Introduction

Difftastic is a structural diff tool that understands syntax. It supports over 30 programming languages and when it works, it's fantastic.

Difftastic is open source software (MIT license) and available on GitHub.

This copy of the manual describes version 0.63.0. The changelog records which features and bug fixes are in each version.

This manual is also available in Chinese.

Syntactic Diffing

Difftastic detects the language, parses the code, and then compares the syntax trees. Let's look at an example.

// old.rs
let ts_lang = guess(path, guess_src).map(tsp::from_language);
// new.rs
let ts_lang = language_override
    .or_else(|| guess(path, guess_src))
    .map(tsp::from_language);
$ difft old.rs new.rs

1 1 let ts_lang = language_override
. 2     .or_else(|| guess(path, guess_src))
. 3     .map(tsp::from_language);

Notice how difftastic recognises that .map is unchanged, even though it's now on a new line with whitespace.

A line-oriented diff does a much worse job here.

$ diff -u old.rs new.rs

@@ -1 +1,3 @@
-let ts_lang = guess(path, guess_src).map(tsp::from_language);
+let ts_lang = language_override
+    .or_else(|| guess(path, guess_src))
+    .map(tsp::from_language);

Some textual diff tools also highlight word changes (e.g. GitHub or git's --word-diff). They still don't understand the code though. Difftastic will always find matched delimiters: you can see the closing ) from or_else has been highlighted.

Fallback Textual Diffing

If input files are not in a format that difftastic understands, it uses a conventional line-oriented text diff with word highlighting.

Difftastic will also use textual diffing when given extremely large inputs.

Installation

Diffstatic can be installed as pre-built binaries or using various package managers.

Pre-Built Binaries

Diffstatic releases are published as GitHub releases with pre-built binaries for Windows, macOS and Linux. Open the latest release page, download the file matching your OS and CPU architecture, and extract the difft executable application file.

Package Manager

macOS

If you're a Homebrew user, you can install difftastic with brew.

$ brew install difftastic

Linux and Unix

If you're an Arch Linux user, you can install difftastic with pacman.

$ sudo pacman -S difftastic

If you're a Nix user, you can install difftastic with nix-env.

$ nix-env --install difftastic

If you're a Fedora user, you can install difftastic with dnf.

$ sudo dnf install difftastic

If you're a FreeBSD user, you can install difftastic with pkg.

$ sudo pkg install difftastic

Windows

If you're a Windows user using Windows Package Manager (WinGet), you can install difftastic with winget.

$ winget install difftastic

If you're a Windows user using Scoop, you can install difftastic with scoop.

$ scoop install difftastic

If you're a Windows user using Chocolatey, you can install difftastic with choco.

$ choco install difftastic

Full Package Listing

This table lists all the platforms that have packaged difftastic.

Packaging status

Installing From Source

Build Requirements

Difftastic is written in Rust, so you will need Rust installed. I recommend rustup to install Rust. Difftastic requires Rust version 1.66 or later.

You will also need a C++ compiler that supports C++14. If you're using GCC, you need at least version 8.

Build

You can download and build difftastic on crates.io with Cargo (which is part of Rust).

$ cargo install --locked difftastic

Difftastic uses the cc crate for building C/C++ dependencies. This allows you to use environment variables CC and CXX to control the compiler used (see the cc docs).

See contributing for instructions on debug builds.

(Optional) Install a MIME Database

If a MIME database is available, difftastic will use it to detect binary files more accurately. This is the same database used by the file command, so you probably already have it.

The MIME database path is specified in the XDG specification. The database should be at one of the following paths:

  • /usr/share/mime/magic
  • /usr/local/share/mime/magic
  • $HOME/.local/share/mime/magic

Usage

This page describes how to use the difft binary directly. See also the Git, Mercurial and Fossil pages for instructions on how to configure them to use difftastic.

File Arguments

Diffing Files

$ difft FIRST-FILE SECOND-FILE

# For example:
$ difft sample_files/simple_1.js sample_files/simple_2.js

Diffing Directories

$ difft FIRST-DIRECTORY SECOND-DIRECTORY

# For example:
$ difft sample_files/dir_1/ sample_files/dir_2/

Difftastic will recursively walk the two directories, diffing files with the same name.

The --skip-unchanged option is useful when diffing directories that contain many unchanged files.

Reading stdin

You can read a file from stdin by specifying - as the file path.

$ difft - SECOND-FILE

# For example:
$ cat sample_files/simple_1.js | difft - sample_files/simple_2.js

Files With Conflicts

(Added in version 0.50.)

If you have a file with <<<<<<< conflict markers, you can pass it as a single argument to difftastic. Difftastic will construct the two file states and diff those.

$ difft FILE-WITH-CONFLICTS

# For example:
$ difft sample_files/conflicts.el

Configuration Options

Every difftastic option can be set with a command line argument or an environment variable. For example, DFT_BACKGROUND=light is equivalent to --background=light.

Environment variables are often useful when using VCS tools like git, because they invoke the difft binary directly.

For a full list of configuration options, see --help.

$ difft --help
...
OPTIONS:
        --background <BACKGROUND>
            Set the background brightness. Difftastic will prefer brighter colours on dark backgrounds.

            [env: DFT_BACKGROUND=]
            [default: dark]
            [possible values: dark, light]
...

Exit Codes

2: Difftastic was given invalid arguments. This includes invalid usage (e.g. the wrong number of arguments) as well as paths that difftastic cannot read (e.g. non-existent paths or insufficient permissions).

1: When called with --exit-code, difftastic will return an exit code of 1 when it finds any syntactic changes (in text files) or byte changes (in binary files).

0: All other cases.

Git

Difftastic can be used an external diff command in git, allowing difftastic to be used with any git subcommand.

Warning: git v2.43.1 and earlier can crash when using an external diff and file permissions have changed.

If you can't upgrade git, use the difftool configuration described below.

One-Off Usage

You can set the diff.external configuration option when running git diff, or set the GIT_EXTERNAL_DIFF environment variable.

View uncommitted changes with difftastic:

$ git -c diff.external=difft diff

Other git commands also require the --ext-diff argument in order to use diff.external.

View changes from the most recent commit with difftastic:

$ git -c diff.external=difft show --ext-diff

View changes from recent commits on the current branch with difftastic:

$ git -c diff.external=difft log -p --ext-diff

Regular Usage

If you like difftastic, we recommend that you configure git aliases so you can use difftastic more easily.

[alias]
    # Difftastic aliases, so `git dlog` is `git log` with difftastic and so on.
    dlog = -c diff.external=difft log --ext-diff
    dshow = -c diff.external=difft show --ext-diff
    ddiff = -c diff.external=difft diff

The author likes the following additional aliases to reduce typing:

[alias]
    # `git log` with patches shown with difftastic.
    dl = -c diff.external=difft log -p --ext-diff

    # Show the most recent commit with difftastic.
    ds = -c diff.external=difft show --ext-diff

    # `git diff` with difftastic.
    dft = -c diff.external=difft diff

Difftastic By Default

If you want to use difftastic as your default diff tool, add the following to your ~/.gitconfig.

[diff]
    external = difft

This changes git diff to use difftastic, and other commands now only require --ext-diff.

$ git diff
$ git show --ext-diff
$ git log -p --ext-diff

If you've configured difftastic as the default diff tool, you can opt-out for an individual command with --no-ext-diff.

$ git diff --no-ext-diff

Difftool

Git also has a difftool feature which allows users to invoke CLI or GUI comparison tools.

For best results, we recommend using -c diff.external=difft as described above. Git passes more information to the external diff, including file permission changes and rename information, so difftastic can show more information.

To define a difftool named difftastic, add the following to your ~/.gitconfig.

[difftool "difftastic"]
    # See `man git-difftool` for a description of MERGED, LOCAL and REMOTE.
    cmd = difft "$MERGED" "$LOCAL" "abcdef1" "100644" "$REMOTE" "abcdef2" "100644"

You can now use difftastic as a difftool:

$ git difftool -t difftastic

For the best results when using difftastic as a difftool, we recommend the following additional git configuration:

[difftool]
    # Run the difftool immediately, don't ask 'are you sure' each time.
    prompt = false

[pager]
    # Use a pager if the difftool output is larger than one screenful,
    # consistent with the behaviour of `git diff`.
    difftool = true

[diff]
    # Set difftastic as the default difftool, so we don't need to specify
    # `-t difftastic` every time.
    tool = difftastic

Mercurial

Mercurial supports external diff tools with the Extdiff extension. Enable it by adding an entry to extensions in your .hgrc.

[extensions]
extdiff =

You can then run hg extdiff -p difft instead of hg diff (assumes the difft binary is on your $PATH).

You can also define an alias to run difftastic with hg. Add the following to your .hgrc to run difftastic with hg dft.

[extdiff]
cmd.dft = difft
# You can add further options which will be passed to the command line, e.g.
# opts.dft = --background light

All options of hg diff are also supported by hg dft; for example, hg dft --stat will show statistics of changed lines and hg dft -r 42 -r 45 will show the diff between two revisions.

hg log -p

Mercurial does not have a way of changing the default diff tool, at least to the author's knowledge.

If you just want to view the diff of the most recent commit, you can use the following.

hg dft -r .^ -r .

This is equivalent to hg log -l 1 -p, although it does not show the commit message.

Fossil

Fossil supports external diff commands by setting diff-command for the current repository:

fossil settings diff-command difft

To use difftastic for all repositories, use --global:

fossil settings diff-command --global difft

Skip difftastic on Fossil

If you set difftastic as Fossil's diff command, but you need to use Fossil's internal diff once, use -i to skip difftastic once:

fossil diff -i

If you want to remove difftastic from one repository (or globally), use unset:

fossil unset diff-command

unset also supports the --global flag.

Languages Supported

This page lists all the languages supported by difftastic. You can also view the languages supported in your current installed version with difft --list-languages.

Programming Languages

LanguageParser Used
Adabriot/tree-sitter-ada
Apexaheber/tree-sitter-sfapex
Bashtree-sitter/tree-sitter-bash
Ctree-sitter/tree-sitter-c
C++tree-sitter/tree-sitter-cpp
C#tree-sitter/tree-sitter-c-sharp
Clojuresogaiu/tree-sitter-clojure
CMakeuyha/tree-sitter-cmake
Common LisptheHamsta/tree-sitter-commonlisp
DartUserNobody14/tree-sitter-dart
Device Treejoelspadin/tree-sitter-devicetree
Elixirelixir-lang/tree-sitter-elixir
Elmelm-tooling/tree-sitter-elm
Elvishckafi/tree-sitter-elvish
ErlangWhatsApp/tree-sitter-erlang
Emacs Lispwilfred/tree-sitter-elisp
F#ionide/tree-sitter-fsharp
Gleamgleam-lang/tree-sitter-gleam
Gotree-sitter/tree-sitter-go
Hackslackhq/tree-sitter-hack
Hareecmma/tree-sitter-hare
Haskelltree-sitter/tree-sitter-haskell
Janetsogaiu/tree-sitter-janet-simple
Javatree-sitter/tree-sitter-java
JavaScript, JSXtree-sitter/tree-sitter-javascript
Juliatree-sitter/tree-sitter-julia
Kotlinfwcd/tree-sitter-kotlin
Luatree-sitter-grammars/tree-sitter-lua
Maketree-sitter-grammars/tree-sitter-make
Nixcstrahan/tree-sitter-nix
Objective-Camaanq/tree-sitter-objc
OCamltree-sitter/tree-sitter-ocaml
Perlganezdragon/tree-sitter-perl
PHPtree-sitter/tree-sitter-php
Pythontree-sitter/tree-sitter-python
QMLyuja/tree-sitter-qmljs
Rr-lib/tree-sitter-r
Racket6cdh/tree-sitter-racket
Rubytree-sitter/tree-sitter-ruby
Rusttree-sitter/tree-sitter-rust
Scalatree-sitter/tree-sitter-scala
Scheme6cdh/tree-sitter-scheme
Smaliamaanq/tree-sitter-smali
SolidityJoranHonig/tree-sitter-solidity
SQLm-novikov/tree-sitter-sql
Swiftalex-pinkus/tree-sitter-swift
TypeScript, TSXtree-sitter/tree-sitter-typescript
VHDLJLeemaster/tree-sitter-vhdl
Zigmaxxnino/tree-sitter-zig

Structured Text Formats

Language Detection

Difftastic guesses the language used based on the file extension, file name, and the contents of the first few lines.

To see the languages available, and the associated file names, use the --list-languages option.

$ difft --list-languages
...
XML
 *.ant *.csproj *.plist *.resx *.svg *.ui *.vbproj *.xaml *.xml *.xsd *.xsl *.xslt *.zcml App.config nuget.config packages.config .classpath .cproject .project
YAML
 *.yaml *.yml
Zig
 *.zig

You can override language detection for specific file globs using the --override option.

$ difft --override=GLOB:NAME FIRST-FILE SECOND-FILE

# For example, treating .h files as C rather than C++:
$ difft --override=*.h:c sample_files/preprocesor_1.h sample_files/preprocesor_2.h

See difft --help for more examples of --override usage.

Internals: Parsing

Difftastic uses tree-sitter to build a parse tree. The parse tree is then converted to a simpler tree which can be diffed.

Parsing with Tree-sitter

Difftastic relies on tree-sitter to understand syntax. You can view the parse tree that tree-sitter produces using the --dump-ts flag.

$ difft --dump-ts sample_files/javascript_simple_1.js | head
program (0, 0) - (7, 0)
  comment (0, 0) - (0, 8) "// hello"
  expression_statement (1, 0) - (1, 6)
    call_expression (1, 0) - (1, 5)
      identifier (1, 0) - (1, 3) "foo"
      arguments (1, 3) - (1, 5)
        ( (1, 3) - (1, 4) "("
        ) (1, 4) - (1, 5) ")"
    ; (1, 5) - (1, 6) ";"
  expression_statement (2, 0) - (2, 6)

Simplified Syntax

Difftastic converts the tree-sitter parse tree to a simplified syntax tree. The syntax tree is a uniform representation where everything is either an atom (e.g. integer literals, comments, variable names) or a list (consisting of the open delimiter, children and the close delimiter).

The flag --dump-syntax will display the syntax tree generated for a file.

$ difft --dump-syntax sample_files/before.js
[
    Atom id:1 {
        content: "// hello",
        position: "0:0-8",
    },
    List id:2 {
        open_content: "",
        open_position: "1:0-0",
        children: [
          ...

Conversion Process

The simple representation of the difftastic parse tree makes diffing much easier. Converting the detailed tree-sitter parse tree is a recursive tree walk, treating tree-sitter leaf nodes as atoms. There are two exceptions.

(1) Tree-sitter parse trees sometimes include unwanted structure. Some grammars consider string literals to be a single token, whereas others treat strings as a complex structure where the delimiters are separate.

tree_sitter_parser.rs uses atom_nodes to mark specific tree-sitter node names as flat atoms even if the node has children.

(2) Tree-sitter parse trees include open and closing delimiters as tokens. A list [1] will have a parse tree that includes [ and ] as nodes.

$ echo '[1]' > example.js
$ difft --dump-ts example.js
program (0, 0) - (1, 0)
  expression_statement (0, 0) - (0, 3)
    array (0, 0) - (0, 3)
      [ (0, 0) - (0, 1) "["
      number (0, 1) - (0, 2) "1"
      ] (0, 2) - (0, 3) "]"

tree_sitter_parser.rs uses open_delimiter_tokens to ensure that [ and ] are used as delimiter content in the enclosing list, rather than converting them to atoms.

Difftastic can match up atoms that occur in different parts of the simplified syntax tree. If e.g. a [ is treated as an atom, difftastic might match it with another [ elsewhere. The resulting diff would be unbalanced, highlighting different numbers of open and close delimiters.

Lossy Syntax Trees

The simplified syntax tree only stores node content and node position. It does not store whitespace between nodes, and position is largely ignored during diffing.

Internals: Diffing

Difftastic treats diff calculations as a route finding problem on a directed acyclic graph.

Graph Representation

A vertex in the graph represents a position in two syntax trees.

The start vertex has both positions pointing to the first syntax node in both trees. The end vertex has both positions just after the last syntax node in both trees.

Consider comparing A with X A.

START
+---------------------+
| Left: A  Right: X A |
|       ^         ^   |
+---------------------+

END
+---------------------+
| Left: A  Right: X A |
|        ^           ^|
+---------------------+

From the start vertex, we have two options:

  • we can mark the first syntax node on the left as novel, and advance to the next syntax node on the left (vertex 1 above), or
  • we can mark the first syntax node on the right as novel, and advance to the next syntax node on the right (vertex 2 above).
            START
            +---------------------+
            | Left: A  Right: X A |
            |       ^         ^   |
            +---------------------+
                   /       \
     Novel atom L /         \ Novel atom R
1                v       2   v
+---------------------+  +---------------------+
| Left: A  Right: X A |  | Left: A  Right: X A |
|        ^        ^   |  |       ^           ^ |
+---------------------+  +---------------------+

Choosing "novel atom R" to vertex 2 will turn out to be the best choice. From vertex 2, we can see three routes to the end vertex.

            2
            +---------------------+
            | Left: A  Right: X A |
            |       ^           ^ |
            +---------------------+
                   /    |   \
     Novel atom L /     |    \ Novel atom R
                 v      |     v
+---------------------+ | +---------------------+
| Left: A  Right: X A | | | Left: A  Right: X A |
|        ^          ^ | | |       ^            ^|
+---------------------+ | +---------------------+
  |                     |                    |
  | Novel atom R        | Nodes match        | Novel atom L
  |                     |                    |
  |         END         v                    |
  |         +---------------------+          |
  +-------->| Left: A  Right: X A |<---------+
            |        ^           ^|
            +---------------------+

Comparing Routes

We assign a cost to each edge. Marking a syntax node as novel is worse than finding a matching syntax node, so the "novel atom" edge has a higher cost than the "syntax nodes match" edge.

The best route is the lowest cost route from the start vertex to the end vertex.

Finding The Best Route

Difftastic uses Dijkstra's algorithm to find the best (i.e. lowest cost) route.

One big advantage of this algorithm is that we don't need to construct the graph in advance. Constructing the whole graph would require exponential memory relative to the number of syntax nodes. Instead, vertex neighbours are constructed as the graph is explored.

There are lots of resources explaining Dijkstra's algorithm online, but I particularly recommend the graph search section of Red Blob Games.

Tricky Cases

Tree diffing is challenging in some situations. This page discusses difficult cases, and how difftastic handles them.

Not all of these cases work well in difftastic yet.

Adding Delimiters

;; Before
x

;; After
(x)

Possible result: (x)

Desired result: (x)

This is tricky because x has changed its depth in the tree, but x itself is unchanged.

Not all tree diff algorithms handle this case. It is also challenging to display this case clearly: we want to highlight the changed delimiters, but not their content. This is challenging in larger expressions.

Difftastic: Difftastic considers nodes to be equal even at different depths, achieving the desired result in this case.

Changing Delimiters

;; Before
(x)

;; After
[x]

Desired result: (x), [x]

As with the wrapping case, we want to highlight the delimiters rather than the x.

Difftastic: Difftastic handles this correctly through its tree diffing.

Expanding Delimiters

;; Before
(x) y

;; After
(x y)

Possible result 1: (x) y, (x y)

Possible result 2: (x) y, (x y)

It's not clear which is better in this case.

Difftastic: Difftastic currently shows result 2, but this case is sensitive to the cost model. Some previous versions of difftastic have shown result 1.

Contracting Delimiters

;; Before
(x y)

;; After
(x) y

This case is similar to the expanding delimiter case.

Disconnected Delimiters

;; Before
(foo (bar))

;; After
(foo (novel) (bar))

Desired result: (foo (novel) (bar))

It is easy to end up with (foo (novel) (bar)), where a later pair of delimiters are chosen.

Rewrapping Large Nodes

;; Before
[[foo]]
(x y)

;; After
([[foo]] x y)

We want to highlight [[foo]] being moved inside the parentheses. However, a naive syntax differ prefers to consider a removal of () in the before and an addition of () in the after to be more minimal diff.

(Reported as issue 44.)

Reordering Within A List

;; Before
(x y)

;; After
(y x)

Desired result: (y x)

We want to highlight the list contents and not the delimiters.

Middle Insertions

// Before
foo(bar(123))

// After
foo(extra(bar(123)))

Desired result: foo(extra(bar(123)))

We want to consider both foo and bar to be unchanged. This case is challenging for diffing algorithms that do a bottom-up then top-down matching of trees.

Punctuation Atoms

// Before
foo(1, bar)

// After
foo(bar, 2)

Possible result: foo(bar, 2)

Desired result: foo(bar, 2)

There are two atoms inside the () that we could consider as unchanged, either the bar or the ,. (We can't consider both to be unchanged as they're reordered.)

We want to consider bar to be unchanged, as it's a more important atom than the , punctuation atom. Doing this is in a language-agnostic way is difficult, so difftastic has a small list of punctuation characters that always get lower priority than other atoms.

Sliders (Flat)

Sliders are a common problem in text based diffs, where lines are matched in a confusing way.

They typically look like this. The diff has to arbitrarily choose a line containing delimiter, and it chooses the wrong one.

+ }
+
+ function foo () {
  }

git-diff has some heuristics to reduce the risk of this (e.g. the "patience diff"), but it can still occur.

There's a similar problem in tree diffs.

;; Before
A B
C D

;; After
A B
A B
C D

Possible result:

A B
A B
C D

Preferred result:

A B
A B
C D

Ideally we'd prefer marking contiguous nodes as novel. From the perspective of a longest-common-subsequence algorithm, these two choices are equivalent.

Sliders (Nested)

// Before
old1(old2)

// After
old1(new1(old2))

Possible result: old1(new1(old2))

Desired result: old1(new1(old2))

The correct answer depends on the language. Most languages want to prefer the inner delimiter, whereas Lisps and JSON prefer the outer delimiter.

Minimising Depth Changes

// Before
if true {
  foo(123);
}
foo(456);

// After
foo(789);

Do we consider foo(123) or foo(456) to match with foo(789)? Difftastic prefers foo(456) by preferring nodes at the same nesting depth.

Replacements With Minor Similarities

// Before
function foo(x) { return x + 1; }

// After
function bar(y) { baz(y); }

Possible result: function bar(y) { baz(y); }

In this example, we've deleted a function and written a completely different one. A tree-based diff could match up the function and the outer delimiters, resulting in a confusing display showing lots of small changes.

As with sliders, the replacement problem can also occur in textual line-based diffs. Line-diffs struggle if there are a small number of common lines. The more precise, granular behaviour of tree diffs makes this problem much more common though.

Matching Substrings In Comments

// Before
/* The quick brown fox. */
foobar();

// After
/* The slow brown fox. */
foobaz();

foobar and foobaz are completely different, and their common prefix fooba should not be matched up. However, matching common prefixes or suffixes for comments is desirable.

Multiline Comments

// Before
/* Hello
 * World. */

// After
if (x) {
  /* Hello
   * World. */
}

The inner content of these two comments is technically different. We want to treat them as identical however.

Reflowing Doc Comments

Block comments have prefixes that aren't meaningful.

// Before
/* The quick brown fox jumps 
 * over the lazy dog. */

// After
/* The quick brown fox immediately
 * jumps over the lazy dog. */

The inner content has changed from jumps * over to immediately * jumps over. However, the * is decorative and we don't care that it's moved.

Small Changes To Large Strings

// Before
"""A very long string
with lots of words about
lots of stuff."""

// After
"""A very long string
with lots of NOVEL words about
lots of stuff."""

It would be correct to highlight the entire string literal as being removed and replaced with a new string literal. However, this makes it hard to see what's actually changed.

It's clear that variable names should be treated atomically, and comments are safe to show subword changes. It's not clear how to handle a small change in a 20 line string literal.

It's tempting to split strings on spaces and diff that, but users still want to know when whitespace changes inside strings. " " and " " are not the same.

Autoformatter Punctuation

// Before
foo("looooong", "also looooong");

// After
foo(
  "looooong",
  "novel",
  "also looooong",
);

Autoformatters (e.g. prettier) will sometimes add or remove punctuation when formatting. Commas and parentheses are the most common.

Syntactic diffing can ignore whitespace changes, but it has to assume punctuation is meaningful. This can lead to punctuation changes being highlighted, which may be quite far from the relevant content change.

Unordered Data Types

// Before
set(1, 2)

// After
set(2, 1)

Users may expect difftastic to find no changes here. This is difficult for several reasons.

For programming languages, side effects might make the order relevant. set(foo(), bar()) might behave differently to set(bar(), foo()).

For configuration languages like JSON or YAML, some parser implementations do actually expose ordering information (e.g. object_pairs_hook=OrderedDict in Python, or serde_json's preserve_order feature in Rust).

To make matters worse, unordered tree diffing is NP-hard.

For the unordered case, it turns out that all of the problems in general are NP-hard. Indeed, the tree edit distance and alignment distance problems are even MAX SNP-hard.

-- A survey on tree edit distance and related problems

Difftastic: Difftastic considers ordering to be meaningful everywhere, so it will always report ordering changes.

Novel Blank Lines

Blank lines are challenging for syntactic diffs. We are comparing syntactic tokens, so we don't see blank lines.

// Before
A
B

// After
A

B

Generally we want syntactic diffing to ignore blank lines. In this first example, this should show no changes.

This is occasionally problematic, as it can hide accidental code reformatting.

// Before
A
B

// After
A
X

Y
B

In this second example, we've inserted X and Y and a blank line. We want to highlight the blank line as an addition.

// Before
A


B

// After
A
X
B

In this third example, the syntactic diffing only sees an addition. From the user's perspective, there has also been a removal of two blank lines.

Invalid Syntax

There's no guarantee that the input we're given is valid syntax. Even if the code is valid, it might use syntax that isn't supported by the parser.

Difftastic: Difftastic will fall back to a text-based diff if any parse errors occur, to avoid diffing incomplete syntax trees. When this occurs, the file header reports the error count.

$ difft sample_files/syntax_error_1.js sample_files/syntax_error_2.js
sample_files/syntax_error_after.js --- Text (2 errors, exceeded DFT_PARSE_ERROR_LIMIT)
...

Users may opt-in to syntactic diffing by setting DFT_PARSE_ERROR_LIMIT to a larger value. In this mode, difftastic treats tree-sitter error nodes as atoms and performs a tree diff as normal.

Contributing

Building

Install Rust with rustup, then clone the code.

$ git clone git@github.com:Wilfred/difftastic.git
$ cd difftastic

Difftastic uses Cargo for building.

$ cargo build

Debug builds are significantly slower than release builds. For files with more than fifty lines, it's usually worth using an optimised build.

$ cargo build --release

Manual

This website is generated with mdbook. mdbook can be installed with Cargo.

Note: difftastic uses an older Rust toolchain version. You have to run cargo install mdbook outside of the repository directory. Otherwise, installation fails.

$ cargo install mdbook

You can then use the mdbook binary to build and serve the site locally.

$ cd manual
$ mdbook serve

API Documentation

You can browse the internal API documentation generated by rustdoc here.

Difftastic's internal docs are not available on docs.rs, as it does not support binary crates today.

Testing

$ cargo test

There are also several files in sample_files/ that you can use.

The best way to test difftastic is to look at history from a real project. Set GIT_EXTERNAL_DIFF to point to your current build.

For example, you can run difftastic on its own source code.

$ GIT_EXTERNAL_DIFF=./target/release/difft git log -p --ext-diff -- src

Logging

Difftastic uses the pretty_env_logger library to log some additional debug information.

$ DFT_LOG=debug cargo run sample_files/old.jsx sample_files/new.jsx

See the env_logger documentation for full details.

Releasing

Use Cargo to create a new release, and tag it in git. Difftastic has a helper script for this:

$ ./scripts/release.sh

You can now increment the version in Cargo.toml and add a new entry to CHANGELOG.md.

Vendoring

Git Subtrees

Tree-sitter parsers are sometimes packaged on npm, sometimes packaged on crates.io, and have different release frequencies. Difftastic uses git subtrees (not git submodules) to track parsers.

Updating a parser

To update a parser, pull commits from the upstream git repository. For example, the following command will update the Java parser:

$ git subtree pull --prefix=vendored_parsers/tree-sitter-java git@github.com:tree-sitter/tree-sitter-java.git master

To see when each parser was last updated, use the following shell command:

$ for d in $(git log | grep git-subtree-dir | tr -d ' ' | cut -d ":" -f2 | sort); do echo "$d"; git log --pretty="  %cs" -n 1 $d; done

Adding A Parser

Finding a parser

New parsers for difftastic must be reasonably complete and maintained.

There are many tree-sitter parsers available, and the tree-sitter website includes a list of some well-known parsers.

Add the source code

Once you've found a parser, add it as a git subtree to vendored_parsers/. We'll use tree-sitter-json as an example.

$ git subtree add --prefix=vendored_parsers/tree-sitter-json https://github.com/tree-sitter/tree-sitter-json.git master

Configure the build

Cargo does not allow packages to include subdirectories that contain a Cargo.toml. Add a symlink to the src/ parser subdirectory.

$ cd vendored_parsers
$ ln -s tree-sitter-json/src tree-sitter-json-src

You can now add the parser to build by including the directory in build.rs.

TreeSitterParser {
    name: "tree-sitter-json",
    src_dir: "vendored_parsers/tree-sitter-json-src",
    extra_files: vec![],
},

If your parser includes custom C or C++ files for lexing (e.g. a scanner.cc), add them to extra_files.

Configure parsing

Add an entry to tree_sitter_parser.rs for your language.

Json => {
    let language = unsafe { tree_sitter_json() };
    TreeSitterConfig {
        language,
        atom_nodes: vec!["string"].into_iter().collect(),
        delimiter_tokens: vec![("{", "}"), ("[", "]")],
        highlight_query: ts::Query::new(
            language,
            include_str!("../../vendored_parsers/highlights/json.scm"),
        )
        .unwrap(),
        sub_languages: vec![],
    }
}

atom_nodes is a list of tree-sitter node names that should be treated as atoms even though the nodes have children. This is common for things like string literals or interpolated strings, where the node might have children for the opening and closing quote.

If you don't set atom_nodes, you may notice added/removed content shown in white. This is usually a sign that child node should have its parent treated as an atom.

delimiter_tokens are delimiters that difftastic stores on the enclosing list node. This allows difftastic to distinguish delimiter tokens from other punctuation in the language.

If you don't set delimiter_tokens, difftastic will consider the tokens in isolation, and may think that a ( was added but the ) was unchanged.

You can use difft --dump-ts foo.json to see the results of the tree-sitter parser, and difft --dump-syntax foo.json to confirm that you've set atoms and delimiters correctly.

sub-languages is empty for most languages: see the code documentation for details.

Configure language detection

Update language_name in guess_language.rs to detect your new language. Insert a match arm like:

Json => "json",

There may also file names or shebangs associated with your language; configure those by adapting the language_globs, from_emacs_mode_header and from_shebang functions in that file. GitHub's linguist definitions are a useful source of common file extensions.

Syntax highlighting (Optional)

To add syntax highlighting for your language, you'll also need a symlink to the queries/highlights.scm file, if available.

$ cd vendored_parsers/highlights
$ ln -s ../tree-sitter-json/queries/highlights.scm json.scm

Test It

Search GitHub for a popular repository in your target language (example search) and confirm that git history looks sensible with difftastic.

Add a regression test

Finally, add a regression test for your language. This ensures that the output for your test file doesn't change unexpectedly.

Regression test files live in sample_files/ and have the form foo_1.abc and foo_2.abc.

$ nano simple_1.json
$ nano simple_2.json

Run the regression test script and update the .expected file.

$ ./sample_files/compare_all.sh
$ cp sample_files/compare.result sample_files/compare.expected

Maintenance

To update a parser that is already imported, use git subtree pull.

$ git subtree pull --prefix=vendored_parsers/tree-sitter-json git@github.com:tree-sitter/tree-sitter-json.git master

Profiling

If you have a file that's particularly slow, you can use cargo-flamegraph to see which functions are slow.

$ CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph --bin difft -- sample_files/slow_1.rs sample_files/slow_2.rs

It's also worth looking at memory usage, as graph traversal bugs can lead to huge memory consumption.

$ /usr/bin/time -v ./target/release/difft sample_files/slow_1.rs sample_files/slow_2.rs

If timing measurement are noisy, Linux's perf tool will report instructions executed, which is more stable.

$ perf stat ./target/release/difft sample_files/slow_1.rs sample_files/slow_2.rs
$ perf stat ./target/release/difft sample_files/typing_1.ml sample_files/typing_2.ml

Many more profiling techniques are discussed in The Rust Performance Book.

Glossary

Atom: An atom is an item in difftastic's syntax tree structure that has no children. It represents things like literals, variable names, and comments. See also 'list'.

Delimiter: A paired piece of syntax. A list has an open delimiter and a close delimiter, such as [ and ]. Delimiters may not be punctuation (e.g. begin and end) and may be empty strings (e.g. infix syntax converted to difftastic's syntax tree).

Hunk: A group of lines displayed together in the diff output. Increasing the number of context lines increases the size of the hunk.

LHS: Left-hand side. Difftastic compares two items, and LHS refers to the first item. See also 'RHS'.

List: A list is an item in difftastic's syntax tree structure that has an open delimiter, children, and a close delimiter. It represents things like expressions and function definitions. See also 'atom'.

Novel: An addition or a removal. Syntax is novel if it occurs in only one of the two items being compared.

RHS: Right-hand side. Difftastic compares two items, and RHS refers to the second item. See also 'LHS'.

Root: A syntax tree without a parent node. Roots represent top-level definitions in the file being diffed.

Slider: A diffing situation where there are multiple minimal diffs possible, due to adjacent content. It is possible to 'slide' to produce better results in this situation. See the discussion in Tricky Cases.

Syntax node: An item in difftastic's syntax tree structure. Either an atom or a list.

Token: A small piece of syntax tracked by difftastic (e.g. $x, function or ]), for highlighting and aligned display. This is either an atom or a non-empty delimiter.

Alternative Projects

Many different tools exist for diffing files. This section of the manual discusses the design of other tools that have influenced difftastic.

Tree Diffing

This page summarises some of the other tree diffing tools available.

If you're in a hurry, start by looking at Autochrome. It's extremely capable, and has an excellent description of the design.

If you're interested in a summary of the academic literature, this blog post (and its accompanying paper -- mirrored under a CC BY-NC license) are great resources.

json-diff (2012)

Languages: JSON
Algorithm: Pairwise comparison
Output: CLI colours

json-diff performs a structural diff of JSON files. It considers subtrees to be different if they don't match exactly, so e.g. "foo" and ["foo"] are entirely different.

json-diff is also noteworthy for its extremely readable display of results.

GumTree (2014)

Languages: ~10 programming languages
Parser: Several, including srcML
Algorithm: Top-down, then bottom-up
Output: HTML, Swing GUI, or text

GumTree can parse several programming languages and then performs a tree-based diff, outputting an HTML display.

The GumTree algorithm is described in the associated paper 'Fine-grained and accurate source code differencing' by Falleri et al (DOI, PDF). It performs a greedy top-down search for identical subtrees, then performs a bottom-up search to match up the rest.

Tree Diff (2017)

Languages: S-expression data format
Algorithm: A* search
Output: Merged s-expression file

Tristan Hume wrote a tree diffing algorithm during his 2017 internship at Jane Street. The source code is not available, but he has a blog post discussing the design in depth.

This project finds minimal diffs between s-expression files used as configuration by Jane Street. It uses A* search to find the minimal diff between them, and builds a new s-expression with a section marked with :date-switch for the differing parts.

(Jane Street also has patdiff, but that seems to be a line-oriented diff with some whitespace/integer display polish. It doesn't understand that e.g. whitespace in "foo " is meaningful).

Autochrome (2017)

Languages: Clojure
Parser: Custom, preserves comments
Algorithm: Dijkstra (previously A* search)
Output: HTML

Autochrome parses Clojure with a custom parser that preserves comments. Autochrome uses Dijkstra's algorithm to compare syntax trees.

Autochrome's webpage includes worked examples of the algorithm and a discussion of design tradeoffs. It's a really great resource for understanding tree diffing techniques in general.

graphtage (2020)

Languages: JSON, XML, HTML, YAML, plist, and CSS
Parser: json5, pyYAML, ignores comments
Algorithm: Levenshtein distance
Output: CLI colours

graphtage compares structured data by parsing into a generic file format, then displaying a diff. It even allows things like diffing JSON against YAML.

As with json-diff, it does not consider ["foo"] and "foo" to have any similarities.

Diffsitter (2020)

Parser: Tree-sitter
Algorithm: Longest-common-subsequence
Output: CLI colours

Diffsitter is another tree-sitter based diff tool. It uses LCS diffing on the leaves of the syntax tree.

sdiff (2021)

Languages: Scheme
Parser: Scheme's built-in read, ignores comments
Algorithm: MH-Diff from the Chawathe paper
Output: CLI colours

Semantically meaningful S-expression diff: Tree-diff for lisp source code was presented at FOSDEM 2021.