Tricky Cases

Tree diffing is challenging in some situations. This page discusses difficult cases, and how difftastic handles them.

Not all of these cases work well in difftastic yet.

Adding Delimiters

;; Before
x

;; After
(x)

Possible result: (x)

Desired result: (x)

This is tricky because x has changed its depth in the tree, but x itself is unchanged.

Not all tree diff algorithms handle this case. It is also challenging to display this case clearly: we want to highlight the changed delimiters, but not their content. This is challenging in larger expressions.

Difftastic: Difftastic considers nodes to be equal even at different depths, achieving the desired result in this case.

Changing Delimiters

;; Before
(x)

;; After
[x]

Desired result: (x), [x]

As with the wrapping case, we want to highlight the delimiters rather than the x.

Difftastic: Difftastic handles this correctly through its tree diffing.

Expanding Delimiters

;; Before
(x) y

;; After
(x y)

Possible result 1: (x) y, (x y)

Possible result 2: (x) y, (x y)

It's not clear which is better in this case.

Difftastic: Difftastic currently shows result 2, but this case is sensitive to the cost model. Some previous versions of difftastic have shown result 1.

Contracting Delimiters

;; Before
(x y)

;; After
(x) y

This case is similar to the expanding delimiter case.

Disconnected Delimiters

;; Before
(foo (bar))

;; After
(foo (novel) (bar))

Desired result: (foo (novel) (bar))

It is easy to end up with (foo (novel) (bar)), where a later pair of delimiters are chosen.

Rewrapping Large Nodes

;; Before
[[foo]]
(x y)

;; After
([[foo]] x y)

We want to highlight [[foo]] being moved inside the parentheses. However, a naive syntax differ prefers to consider a removal of () in the before and an addition of () in the after to be more minimal diff.

(Reported as issue 44.)

Reordering Within A List

;; Before
(x y)

;; After
(y x)

Desired result: (y x)

We want to highlight the list contents and not the delimiters.

Middle Insertions

// Before
foo(bar(123))

// After
foo(extra(bar(123)))

Desired result: foo(extra(bar(123)))

We want to consider both foo and bar to be unchanged. This case is challenging for diffing algorithms that do a bottom-up then top-down matching of trees.

Punctuation Atoms

// Before
foo(1, bar)

// After
foo(bar, 2)

Possible result: foo(bar, 2)

Desired result: foo(bar, 2)

There are two atoms inside the () that we could consider as unchanged, either the bar or the ,. (We can't consider both to be unchanged as they're reordered.)

We want to consider bar to be unchanged, as it's a more important atom than the , punctuation atom. Doing this is in a language-agnostic way is difficult, so difftastic has a small list of punctuation characters that always get lower priority than other atoms.

Sliders (Flat)

Sliders are a common problem in text based diffs, where lines are matched in a confusing way.

They typically look like this. The diff has to arbitrarily choose a line containing delimiter, and it chooses the wrong one.

+ }
+
+ function foo () {
  }

git-diff has some heuristics to reduce the risk of this (e.g. the "patience diff"), but it can still occur.

There's a similar problem in tree diffs.

;; Before
A B
C D

;; After
A B
A B
C D

Possible result:

A B
A B
C D

Preferred result:

A B
A B
C D

Ideally we'd prefer marking contiguous nodes as novel. From the perspective of a longest-common-subsequence algorithm, these two choices are equivalent.

Sliders (Nested)

// Before
old1(old2)

// After
old1(new1(old2))

Possible result: old1(new1(old2))

Desired result: old1(new1(old2))

The correct answer depends on the language. Most languages want to prefer the inner delimiter, whereas Lisps and JSON prefer the outer delimiter.

Minimising Depth Changes

// Before
if true {
  foo(123);
}
foo(456);

// After
foo(789);

Do we consider foo(123) or foo(456) to match with foo(789)? Difftastic prefers foo(456) by preferring nodes at the same nesting depth.

Replacements With Minor Similarities

// Before
function foo(x) { return x + 1; }

// After
function bar(y) { baz(y); }

Possible result: function bar(y) { baz(y); }

In this example, we've deleted a function and written a completely different one. A tree-based diff could match up the function and the outer delimiters, resulting in a confusing display showing lots of small changes.

As with sliders, the replacement problem can also occur in textual line-based diffs. Line-diffs struggle if there are a small number of common lines. The more precise, granular behaviour of tree diffs makes this problem much more common though.

Matching Substrings In Comments

// Before
/* The quick brown fox. */
foobar();

// After
/* The slow brown fox. */
foobaz();

foobar and foobaz are completely different, and their common prefix fooba should not be matched up. However, matching common prefixes or suffixes for comments is desirable.

Multiline Comments

// Before
/* Hello
 * World. */

// After
if (x) {
  /* Hello
   * World. */
}

The inner content of these two comments is technically different. We want to treat them as identical however.

Reflowing Doc Comments

Block comments have prefixes that aren't meaningful.

// Before
/* The quick brown fox jumps 
 * over the lazy dog. */

// After
/* The quick brown fox immediately
 * jumps over the lazy dog. */

The inner content has changed from jumps * over to immediately * jumps over. However, the * is decorative and we don't care that it's moved.

Small Changes To Large Strings

// Before
"""A very long string
with lots of words about
lots of stuff."""

// After
"""A very long string
with lots of NOVEL words about
lots of stuff."""

It would be correct to highlight the entire string literal as being removed and replaced with a new string literal. However, this makes it hard to see what's actually changed.

It's clear that variable names should be treated atomically, and comments are safe to show subword changes. It's not clear how to handle a small change in a 20 line string literal.

It's tempting to split strings on spaces and diff that, but users still want to know when whitespace changes inside strings. " " and " " are not the same.

Autoformatter Punctuation

// Before
foo("looooong", "also looooong");

// After
foo(
  "looooong",
  "novel",
  "also looooong",
);

Autoformatters (e.g. prettier) will sometimes add or remove punctuation when formatting. Commas and parentheses are the most common.

Syntactic diffing can ignore whitespace changes, but it has to assume punctuation is meaningful. This can lead to punctuation changes being highlighted, which may be quite far from the relevant content change.

Unordered Data Types

// Before
set(1, 2)

// After
set(2, 1)

Users may expect difftastic to find no changes here. This is difficult for several reasons.

For programming languages, side effects might make the order relevant. set(foo(), bar()) might behave differently to set(bar(), foo()).

For configuration languages like JSON or YAML, some parser implementations do actually expose ordering information (e.g. object_pairs_hook=OrderedDict in Python, or serde_json's preserve_order feature in Rust).

To make matters worse, unordered tree diffing is NP-hard.

For the unordered case, it turns out that all of the problems in general are NP-hard. Indeed, the tree edit distance and alignment distance problems are even MAX SNP-hard.

-- A survey on tree edit distance and related problems

Difftastic: Difftastic considers ordering to be meaningful everywhere, so it will always report ordering changes.

Novel Blank Lines

Blank lines are challenging for syntactic diffs. We are comparing syntactic tokens, so we don't see blank lines.

// Before
A
B

// After
A

B

Generally we want syntactic diffing to ignore blank lines. In this first example, this should show no changes.

This is occasionally problematic, as it can hide accidental code reformatting.

// Before
A
B

// After
A
X

Y
B

In this second example, we've inserted X and Y and a blank line. We want to highlight the blank line as an addition.

// Before
A


B

// After
A
X
B

In this third example, the syntactic diffing only sees an addition. From the user's perspective, there has also been a removal of two blank lines.

Invalid Syntax

There's no guarantee that the input we're given is valid syntax. Even if the code is valid, it might use syntax that isn't supported by the parser.

Difftastic: Difftastic will fall back to a text-based diff if any parse errors occur, to avoid diffing incomplete syntax trees. When this occurs, the file header reports the error count.

$ difft sample_files/syntax_error_1.js sample_files/syntax_error_2.js
sample_files/syntax_error_after.js --- Text (2 errors, exceeded DFT_PARSE_ERROR_LIMIT)
...

Users may opt-in to syntactic diffing by setting DFT_PARSE_ERROR_LIMIT to a larger value. In this mode, difftastic treats tree-sitter error nodes as atoms and performs a tree diff as normal.