Skip to content

Commit e417277

Browse files
ychingitster
authored andcommitted
xdiff: re-diff shifted change groups when using histogram algorithm
After a diff algorithm has been run, the compaction phase (xdl_change_compact()) shifts and merges change groups to produce a cleaner output. However, this shifting could create a new matched group where both sides now have matching lines. This results in a wrong-looking diff output which contains redundant lines that are the same on both files. Fix this by detecting this situation, and re-diff the texts on each side to find similar lines, using the fall-back Myer's diff. Only do this for histogram diff as it's the only algorithm where this is relevant. Below contains an example, and more details. For an example, consider two files below: file1: A A A A A A A file2: A A x A A A A When using Myer's diff, the algorithm finds that only the "x" has been changed, and produces a final diff result (these are line diffs, but using word-diff syntax for ease of presentation): A A[-A-]{+x+}A AAA When using histogram diff, the algorithm first discovers the LCS "A AAA", which it uses as anchor, then produces an intermediate diff: {+A Ax+}A AAA[- AAA-]. This is a longer diff than Myer's, but it's still self-consistent. However, the compaction phase attempts to shift the first file's diff group upwards (note that this shift crosses the anchor that histogram had used), leading to the final results for histogram diff: [-A AA-]{+A Ax+}A AAA This is a technically correct patch but looks clearly redundant to a human as the first 3 lines should not be in the diff. The fix would detect that a shift has caused matching to a new group, and re-diff the "A AA" and "A Ax" parts, which results in "A A" correctly re-marked as unchanged. This creates the now correct histogram diff: A A[-A-]{+x+}A AAA This issue is not applicable to Myer's diff algorithm as it already generates a minimal diff, which means a shift cannot result in a smaller diff output (the default Myer's diff in xdiff is not guaranteed to be minimal for performance reasons, but it typically does a good enough job). It's also not applicable to patience diff, because it uses only unique lines as anchor for its splits, and falls back to Myer's diff within each split. Shifting requires both ends having the same lines, and therefore cannot cross the unique line boundaries established by the patience algorithm. In contrast histogram diff uses non-unique lines as anchors, and therefore shifting can cross over them. This issue is rare in a normal repository. Below is a table of repositories (`git log --no-merges -p --histogram -1000`), showing how many times a re-diff was done and how many times it resulted in finding matching lines (therefore addressing this issue) with the fix. In general it is fewer than 1% of diff's that exhibit this offending behavior: | Repo (1k commits) | Re-diff | Found matching lines | |--------------------|---------|----------------------| | llvm-project | 45 | 11 | | vim | 110 | 9 | | git | 18 | 2 | | WebKit | 168 | 1 | | ripgrep | 22 | 1 | | cpython | 32 | 0 | | vscode | 13 | 0 | Signed-off-by: Yee Cheng Chin <ychin.git@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
1 parent 9a2fb14 commit e417277

3 files changed

Lines changed: 210 additions & 2 deletions

File tree

t/meson.build

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -496,6 +496,7 @@ integration_tests = [
496496
't4070-diff-pairs.sh',
497497
't4071-diff-minimal.sh',
498498
't4072-diff-max-depth.sh',
499+
't4074-diff-shifted-matched-group.sh',
499500
't4100-apply-stat.sh',
500501
't4101-apply-nonl.sh',
501502
't4102-apply-rename.sh',
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
#!/bin/sh
2+
3+
test_description='shifted diff groups re-diffing during histogram diff'
4+
5+
. ./test-lib.sh
6+
7+
test_expect_success 'shifted/merged diff group should re-diff to minimize patch' '
8+
test_write_lines A x A A A x A A A >file1 &&
9+
test_write_lines A x A Z A x A A A >file2 &&
10+
11+
file1_h=$(git rev-parse --short $(git hash-object file1)) &&
12+
file2_h=$(git rev-parse --short $(git hash-object file2)) &&
13+
14+
cat >expect <<-EOF &&
15+
diff --git a/file1 b/file2
16+
index $file1_h..$file2_h 100644
17+
--- a/file1
18+
+++ b/file2
19+
@@ -1,7 +1,7 @@
20+
A
21+
x
22+
A
23+
-A
24+
+Z
25+
A
26+
x
27+
A
28+
EOF
29+
30+
test_expect_code 1 git diff --no-index --histogram file1 file2 >output &&
31+
test_cmp expect output
32+
'
33+
34+
test_expect_success 'merged diff group with no shift' '
35+
test_write_lines A Z B x >file1 &&
36+
test_write_lines C D x Z E x >file2 &&
37+
38+
file1_h=$(git rev-parse --short $(git hash-object file1)) &&
39+
file2_h=$(git rev-parse --short $(git hash-object file2)) &&
40+
41+
cat >expect <<-EOF &&
42+
diff --git a/file1 b/file2
43+
index $file1_h..$file2_h 100644
44+
--- a/file1
45+
+++ b/file2
46+
@@ -1,4 +1,6 @@
47+
-A
48+
+C
49+
+D
50+
+x
51+
Z
52+
-B
53+
+E
54+
x
55+
EOF
56+
57+
test_expect_code 1 git diff --no-index --histogram file1 file2 >output &&
58+
test_cmp expect output
59+
'
60+
61+
test_expect_success 're-diff should preserve diff flags' '
62+
test_write_lines a b c a b c >file1 &&
63+
test_write_lines x " b" z a b c >file2 &&
64+
65+
file1_h=$(git rev-parse --short $(git hash-object file1)) &&
66+
file2_h=$(git rev-parse --short $(git hash-object file2)) &&
67+
68+
cat >expect <<-EOF &&
69+
diff --git a/file1 b/file2
70+
index $file1_h..$file2_h 100644
71+
--- a/file1
72+
+++ b/file2
73+
@@ -1,6 +1,6 @@
74+
-a
75+
-b
76+
-c
77+
+x
78+
+ b
79+
+z
80+
a
81+
b
82+
c
83+
EOF
84+
85+
test_expect_code 1 git diff --no-index --histogram file1 file2 >output &&
86+
test_cmp expect output &&
87+
88+
cat >expect_iwhite <<-EOF &&
89+
diff --git a/file1 b/file2
90+
index $file1_h..$file2_h 100644
91+
--- a/file1
92+
+++ b/file2
93+
@@ -1,6 +1,6 @@
94+
-a
95+
+x
96+
b
97+
-c
98+
+z
99+
a
100+
b
101+
c
102+
EOF
103+
104+
test_expect_code 1 git diff --no-index --histogram --ignore-all-space file1 file2 >output_iwhite &&
105+
test_cmp expect_iwhite output_iwhite
106+
'
107+
108+
test_expect_success 'shifting on either side should trigger re-diff properly' '
109+
test_write_lines a b c a b c a b c >file1 &&
110+
test_write_lines a b c a1 a2 a3 b c1 a b c >file2 &&
111+
112+
file1_h=$(git rev-parse --short $(git hash-object file1)) &&
113+
file2_h=$(git rev-parse --short $(git hash-object file2)) &&
114+
115+
cat >expect1 <<-EOF &&
116+
diff --git a/file1 b/file2
117+
index $file1_h..$file2_h 100644
118+
--- a/file1
119+
+++ b/file2
120+
@@ -1,9 +1,11 @@
121+
a
122+
b
123+
c
124+
-a
125+
+a1
126+
+a2
127+
+a3
128+
b
129+
-c
130+
+c1
131+
a
132+
b
133+
c
134+
EOF
135+
136+
test_expect_code 1 git diff --no-index --histogram file1 file2 >output1 &&
137+
test_cmp expect1 output1 &&
138+
139+
cat >expect2 <<-EOF &&
140+
diff --git a/file2 b/file1
141+
index $file2_h..$file1_h 100644
142+
--- a/file2
143+
+++ b/file1
144+
@@ -1,11 +1,9 @@
145+
a
146+
b
147+
c
148+
-a1
149+
-a2
150+
-a3
151+
+a
152+
b
153+
-c1
154+
+c
155+
a
156+
b
157+
c
158+
EOF
159+
160+
test_expect_code 1 git diff --no-index --histogram file2 file1 >output2 &&
161+
test_cmp expect2 output2
162+
'
163+
164+
test_done

xdiff/xdiffi.c

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -793,6 +793,7 @@ static int group_slide_up(xdfile_t *xdf, struct xdlgroup *g)
793793
*/
794794
int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
795795
struct xdlgroup g, go;
796+
struct xdlgroup g_orig;
796797
long earliest_end, end_matching_other;
797798
long groupsize;
798799

@@ -806,10 +807,12 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
806807
if (g.end == g.start)
807808
goto next;
808809

810+
g_orig = g;
811+
809812
/*
810813
* Now shift the change up and then down as far as possible in
811814
* each direction. If it bumps into any other changes, merge
812-
* them.
815+
* them and restart the process.
813816
*/
814817
do {
815818
groupsize = g.end - g.start;
@@ -862,7 +865,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
862865
/*
863866
* Move the possibly merged group of changes back to
864867
* line up with the last group of changes from the
865-
* other file that it can align with.
868+
* other file that it can align with. This avoids breaking
869+
* a single change into a separate addition/deletion.
866870
*/
867871
while (go.end == go.start) {
868872
if (group_slide_up(xdf, &g))
@@ -915,6 +919,45 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
915919
}
916920
}
917921

922+
/*
923+
* If we merged change groups during shifting, the new
924+
* combined group could now have matching lines in both files,
925+
* even if the original separate groups did not. Re-diff the
926+
* new group to find these matching lines to mark them as
927+
* unchanged.
928+
*
929+
* Only do this if the corresponding group in the other file is
930+
* non-empty, as it's trivial otherwise.
931+
*
932+
* Only do this for histogram diff as its LCS algorithm allows
933+
* for this scenario. In contrast, patience diff finds LCS
934+
* of unique lines that groups cannot be shifted across.
935+
* Myer's diff (standalone or used as fall-back in patience
936+
* diff) already finds minimal edits so it is not possible for
937+
* shifted groups to result in a smaller diff. (Without
938+
* XDF_NEED_MINIMAL, Myer's isn't technically guaranteed to be
939+
* minimal, but it should be so most of the time)
940+
*/
941+
if (go.end != go.start &&
942+
XDF_DIFF_ALG(flags) == XDF_HISTOGRAM_DIFF &&
943+
(g.start != g_orig.start ||
944+
g.end != g_orig.end)) {
945+
xpparam_t xpp;
946+
xdfenv_t xe;
947+
948+
memset(&xpp, 0, sizeof(xpp));
949+
xpp.flags = flags & ~XDF_DIFF_ALGORITHM_MASK;
950+
951+
xe.xdf1 = *xdf;
952+
xe.xdf2 = *xdfo;
953+
954+
if (xdl_fall_back_diff(&xe, &xpp,
955+
g.start + 1, g.end - g.start,
956+
go.start + 1, go.end - go.start)) {
957+
return -1;
958+
}
959+
}
960+
918961
next:
919962
/* Move past the just-processed group: */
920963
if (group_next(xdf, &g))

0 commit comments

Comments
 (0)