Skip to content

Commit b2ec8e9

Browse files
ttaylorrgitster
authored andcommitted
midx: do not require packs to be sorted in lexicographic order
The MIDX file format currently requires that pack files be identified by the lexicographic ordering of their names (that is, a pack having a checksum beginning with "abc" would have a numeric pack_int_id which is smaller than the same value for a pack beginning with "bcd"). As a result, it is impossible to combine adjacent MIDX layers together without permuting bits from bitmaps that are in more recent layer(s). To see why, consider the following example: | packs | preferred pack --------+-------------+--------------- MIDX #0 | { X, Y, Z } | Y MIDX #1 | { A, B, C } | B MIDX #2 | { D, E, F } | D , where MIDX #2's base MIDX is MIDX #1, and so on. Suppose that we want to combine MIDX layers #0 and #1, to create a new layer #0' containing the packs from both layers. With the original three MIDX layers, objects are laid out in the bitmap in the order they appear in their source pack, and the packs themselves are arranged according to the pseudo-pack order. In this case, that ordering is Y, X, Z, B, A, C. But recall that the pseudo-pack ordering is defined by the order that packs appear in the MIDX, with the exception of the preferred pack, which sorts ahead of all other packs regardless of its position within the MIDX. In the above example, that means that pack 'Y' could be placed anywhere (so long as it is designated as preferred), however, all other packs must be placed in the location listed above. Because that ordering isn't sorted lexicographically, it is impossible to compact MIDX layers in the above configuration without permuting the object-to-bit-position mapping. Changing this mapping would affect all bitmaps belonging to newer layers, rendering the bitmaps associated with MIDX #2 unreadable. One of the goals of MIDX compaction is that we are able to shrink the length of the MIDX chain *without* invalidating bitmaps that belong to newer layers, and the lexicographic ordering constraint is at odds with this goal. However, packs do not *need* to be lexicographically ordered within the MIDX. As far as I can gather, the only reason they are sorted lexically is to make it possible to perform a binary search over the pack names in a MIDX, necessary to make `midx_contains_pack()`'s performance logarithmic in the number of packs rather than linear. Relax this constraint by allowing MIDX writes to proceed with packs that are not arranged in lexicographic order. `midx_contains_pack()` will lazily instantiate a `pack_names_sorted` array on the MIDX, which will be used to implement the binary search over pack names. This change produces MIDXs which may not be correctly read with external tools or older versions of Git. Though older versions of Git know how to gracefully degrade and ignore any MIDX(s) they consider corrupt, external tools may not be as robust. To avoid unintentionally breaking any such tools, guard this change behind a version bump in the MIDX's on-disk format. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
1 parent 82c905e commit b2ec8e9

5 files changed

Lines changed: 69 additions & 16 deletions

File tree

Documentation/gitformat-pack.adoc

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -374,7 +374,9 @@ HEADER:
374374
The signature is: {'M', 'I', 'D', 'X'}
375375

376376
1-byte version number:
377-
Git only writes or recognizes version 1.
377+
Git writes the version specified by the "midx.version"
378+
configuration option, which defaults to 2. It recognizes
379+
both versions 1 and 2.
378380

379381
1-byte Object Id Version
380382
We infer the length of object IDs (OIDs) from this value:
@@ -413,7 +415,9 @@ CHUNK DATA:
413415
strings. There is no extra padding between the filenames,
414416
and they are listed in lexicographic order. The chunk itself
415417
is padded at the end with between 0 and 3 NUL bytes to make the
416-
chunk size a multiple of 4 bytes.
418+
chunk size a multiple of 4 bytes. Version 1 MIDXs are required to
419+
list their packs in lexicographic order, but version 2 MIDXs may
420+
list their packs in any arbitrary order.
417421

418422
Bitmapped Packfiles (ID: {'B', 'T', 'M', 'P'})
419423
Stores a table of two 4-byte unsigned integers in network order.

midx-write.c

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,13 @@ extern int cmp_idx_or_pack_name(const char *idx_or_pack_name,
3636

3737
static size_t write_midx_header(const struct git_hash_algo *hash_algo,
3838
struct hashfile *f, unsigned char num_chunks,
39-
uint32_t num_packs)
39+
uint32_t num_packs, int version)
4040
{
41+
if (version != MIDX_VERSION_V1 && version != MIDX_VERSION_V2)
42+
BUG("unexpected MIDX version: %d", version);
43+
4144
hashwrite_be32(f, MIDX_SIGNATURE);
42-
hashwrite_u8(f, MIDX_VERSION);
45+
hashwrite_u8(f, version);
4346
hashwrite_u8(f, oid_version(hash_algo));
4447
hashwrite_u8(f, num_chunks);
4548
hashwrite_u8(f, 0); /* unused */
@@ -105,6 +108,8 @@ struct write_midx_context {
105108

106109
uint32_t preferred_pack_idx;
107110

111+
int version; /* must be MIDX_VERSION_V1 or _V2 */
112+
108113
int incremental;
109114
uint32_t num_multi_pack_indexes_before;
110115

@@ -410,7 +415,9 @@ static int write_midx_pack_names(struct hashfile *f, void *data)
410415
if (ctx->info[i].expired)
411416
continue;
412417

413-
if (i && strcmp(ctx->info[i].pack_name, ctx->info[i - 1].pack_name) <= 0)
418+
if (ctx->version == MIDX_VERSION_V1 &&
419+
i && strcmp(ctx->info[i].pack_name,
420+
ctx->info[i - 1].pack_name) <= 0)
414421
BUG("incorrect pack-file order: %s before %s",
415422
ctx->info[i - 1].pack_name,
416423
ctx->info[i].pack_name);
@@ -1025,6 +1032,12 @@ static bool midx_needs_update(struct multi_pack_index *midx, struct write_midx_c
10251032
if (!midx_checksum_valid(midx))
10261033
goto out;
10271034

1035+
/*
1036+
* If the version differs, we need to update.
1037+
*/
1038+
if (midx->version != ctx->version)
1039+
goto out;
1040+
10281041
/*
10291042
* Ignore incremental updates for now. The assumption is that any
10301043
* incremental update would be either empty (in which case we will bail
@@ -1100,6 +1113,7 @@ static int write_midx_internal(struct write_midx_opts *opts)
11001113
struct tempfile *incr;
11011114
struct write_midx_context ctx = {
11021115
.preferred_pack_idx = NO_PREFERRED_PACK,
1116+
.version = MIDX_VERSION_V2,
11031117
};
11041118
struct multi_pack_index *midx_to_free = NULL;
11051119
int bitmapped_packs_concat_len = 0;
@@ -1114,6 +1128,10 @@ static int write_midx_internal(struct write_midx_opts *opts)
11141128
ctx.repo = r;
11151129
ctx.source = opts->source;
11161130

1131+
repo_config_get_int(ctx.repo, "midx.version", &ctx.version);
1132+
if (ctx.version != MIDX_VERSION_V1 && ctx.version != MIDX_VERSION_V2)
1133+
die(_("unknown MIDX version: %d"), ctx.version);
1134+
11171135
ctx.incremental = !!(opts->flags & MIDX_WRITE_INCREMENTAL);
11181136

11191137
if (ctx.incremental)
@@ -1445,7 +1463,7 @@ static int write_midx_internal(struct write_midx_opts *opts)
14451463
}
14461464

14471465
write_midx_header(r->hash_algo, f, get_num_chunks(cf),
1448-
ctx.nr - dropped_packs);
1466+
ctx.nr - dropped_packs, ctx.version);
14491467
write_chunkfile(cf, &ctx);
14501468

14511469
finalize_hashfile(f, midx_hash, FSYNC_COMPONENT_PACK_METADATA,

midx.c

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -149,7 +149,7 @@ static struct multi_pack_index *load_multi_pack_index_one(struct odb_source *sou
149149
m->signature, MIDX_SIGNATURE);
150150

151151
m->version = m->data[MIDX_BYTE_FILE_VERSION];
152-
if (m->version != MIDX_VERSION)
152+
if (m->version != MIDX_VERSION_V1 && m->version != MIDX_VERSION_V2)
153153
die(_("multi-pack-index version %d not recognized"),
154154
m->version);
155155

@@ -210,7 +210,8 @@ static struct multi_pack_index *load_multi_pack_index_one(struct odb_source *sou
210210
die(_("multi-pack-index pack-name chunk is too short"));
211211
cur_pack_name = end + 1;
212212

213-
if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0)
213+
if (m->version == MIDX_VERSION_V1 &&
214+
i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0)
214215
die(_("multi-pack-index pack names out of order: '%s' before '%s'"),
215216
m->pack_names[i - 1],
216217
m->pack_names[i]);
@@ -411,6 +412,7 @@ void close_midx(struct multi_pack_index *m)
411412
}
412413
FREE_AND_NULL(m->packs);
413414
FREE_AND_NULL(m->pack_names);
415+
FREE_AND_NULL(m->pack_names_sorted);
414416
free(m);
415417
}
416418

@@ -655,17 +657,40 @@ int cmp_idx_or_pack_name(const char *idx_or_pack_name,
655657
return strcmp(idx_or_pack_name, idx_name);
656658
}
657659

660+
661+
static int midx_pack_names_cmp(const void *a, const void *b, void *m_)
662+
{
663+
struct multi_pack_index *m = m_;
664+
return strcmp(m->pack_names[*(const size_t *)a],
665+
m->pack_names[*(const size_t *)b]);
666+
}
667+
658668
static int midx_contains_pack_1(struct multi_pack_index *m,
659669
const char *idx_or_pack_name)
660670
{
661671
uint32_t first = 0, last = m->num_packs;
662672

673+
if (m->version == MIDX_VERSION_V2 && !m->pack_names_sorted) {
674+
uint32_t i;
675+
676+
ALLOC_ARRAY(m->pack_names_sorted, m->num_packs);
677+
678+
for (i = 0; i < m->num_packs; i++)
679+
m->pack_names_sorted[i] = i;
680+
681+
QSORT_S(m->pack_names_sorted, m->num_packs, midx_pack_names_cmp,
682+
m);
683+
}
684+
663685
while (first < last) {
664686
uint32_t mid = first + (last - first) / 2;
665687
const char *current;
666688
int cmp;
667689

668-
current = m->pack_names[mid];
690+
if (m->pack_names_sorted)
691+
current = m->pack_names[m->pack_names_sorted[mid]];
692+
else
693+
current = m->pack_names[mid];
669694
cmp = cmp_idx_or_pack_name(idx_or_pack_name, current);
670695
if (!cmp)
671696
return 1;

midx.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ struct git_hash_algo;
1111
struct odb_source;
1212

1313
#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
14-
#define MIDX_VERSION 1
14+
#define MIDX_VERSION_V1 1
15+
#define MIDX_VERSION_V2 2
1516
#define MIDX_BYTE_FILE_VERSION 4
1617
#define MIDX_BYTE_HASH_VERSION 5
1718
#define MIDX_BYTE_NUM_CHUNKS 6
@@ -71,6 +72,7 @@ struct multi_pack_index {
7172
uint32_t num_packs_in_base;
7273

7374
const char **pack_names;
75+
size_t *pack_names_sorted;
7476
struct packed_git **packs;
7577
};
7678

t/t5319-multi-pack-index.sh

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ midx_read_expect () {
2121
EXTRA_CHUNKS="$5"
2222
{
2323
cat <<-EOF &&
24-
header: 4d494458 1 $HASH_LEN $NUM_CHUNKS $NUM_PACKS
24+
header: 4d494458 2 $HASH_LEN $NUM_CHUNKS $NUM_PACKS
2525
chunks: pack-names oid-fanout oid-lookup object-offsets$EXTRA_CHUNKS
2626
num_objects: $NUM_OBJECTS
2727
packs:
@@ -512,11 +512,6 @@ test_expect_success 'verify invalid chunk offset' '
512512
"improper chunk offset(s)"
513513
'
514514

515-
test_expect_success 'verify packnames out of order' '
516-
corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "z" $objdir \
517-
"pack names out of order"
518-
'
519-
520515
test_expect_success 'verify missing pack' '
521516
corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "a" $objdir \
522517
"failed to load pack"
@@ -578,6 +573,15 @@ test_expect_success 'verify incorrect checksum' '
578573
$objdir "incorrect checksum"
579574
'
580575

576+
test_expect_success 'setup for v1-specific fsck tests' '
577+
git -c midx.version=1 multi-pack-index write
578+
'
579+
580+
test_expect_success 'verify packnames out of order (v1)' '
581+
corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "z" $objdir \
582+
"pack names out of order"
583+
'
584+
581585
test_expect_success 'repack progress off for redirected stderr' '
582586
GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir repack 2>err &&
583587
test_line_count = 0 err

0 commit comments

Comments
 (0)