Introduction

Difftastic is a structural diff tool that understands syntax. It supports over 20 programming languages and when it works, it's fantastic.

Difftastic is open source software (MIT license) and available on GitHub.

This copy of the manual describes version 0.29.0. The changelog records which features and bug fixes are in each version.

Syntactic Diffing

Difftastic detects the language, parses the code, and then compares the syntax trees. Let's look at an example.

// old.rs
let ts_lang = guess(path, guess_src).map(tsp::from_language);
// new.rs
let ts_lang = language_override
    .or_else(|| guess(path, guess_src))
    .map(tsp::from_language);
$ difft old.rs new.rs
1 1 let ts_lang = language_override
. 2     .or_else(|| guess(path, guess_src))
. 3     .map(tsp::from_language);

Notice how difftastic recognises that .map is unchanged, even though it's now on a new line with whitespace.

A line-oriented diff does a much worse job here.

$ diff -u old.rs new.rs
@@ -1 +1,3 @@
-let ts_lang = guess(path, guess_src).map(tsp::from_language);
+let ts_lang = language_override
+    .or_else(|| guess(path, guess_src))
+    .map(tsp::from_language);

Some textual diff tools also highlight word changes (e.g. GitHub or git's --word-diff). They still don't understand the code though. Difftastic will always find matched delimiters: you can see the closing ) from or_else has been highlighted.

Fallback Textual Diffing

If input files are not in a format that difftastic understands, it uses a conventional line-oriented text diff with word highlighting.

Difftastic will also use textual diffing when given extremely large inputs.

Installation

Installing a binary

Difftastic provides GitHub releases with prebuilt binaries.

Packages are also available on the following platforms.

Packaging status

Installing via homebrew (on macOS or Linux)

Difftastic can be installed with Homebrew on macOS or Linux.

$ brew install difftastic

Installing from source

Build Requirements

Difftastic is written in Rust, so you will need Rust installed. I recommend rustup to install Rust. Difftastic requires Rust version 1.56 or later.

You will also need a C++ compiler that supports C++14. If you're using GCC, you need at least version 8.

Build

You can download and build difftastic on crates.io with Cargo (which is part of Rust).

$ cargo install difftastic

Difftastic uses the cc crate for building C/C++ dependencies. This allows you to use environment variables CC and CXX to control the compiler used (see the cc docs).

See contributing for instructions on debug builds.

(Optional) Install MIME Database

If a MIME database is available, difftastic will use it to detect binary files more accurately. This is the same database used by the file command, so you probably already have it.

The MIME database path is specified in the XDG specification. The database should be at one of the following paths:

  • /usr/share/mime/magic
  • /usr/local/share/mime/magic
  • $HOME/.local/share/mime/magic

Usage

Diffing Files

$ difft sample_files/before.js sample_files/after.js

Difftastic uses the file extension to decide which parser to use.

Diffing Directories

$ difft sample_files/dir_before/ sample_files/dir_after/

Difftastic will recursively walk the two directories, diffing files with the same name.

The --skip-unchanged option is useful when diffing directories that contain many unchanged files.

Language Detection

Difftastic guesses the language used based on the file extension, file name, and the contents of the first lines.

You can override the language detection by passing the --language option. Difftastic will treat input files as if they had that extension, and ignore other language detection heuristics.

$ difft --language cpp before.c after.c

Git

Git supports external diff tools. You can use GIT_EXTERNAL_DIFF for a one-off git command.

$ GIT_EXTERNAL_DIFF=difft git diff
$ GIT_EXTERNAL_DIFF=difft git log -p --ext-diff
$ GIT_EXTERNAL_DIFF=difft git show e96a7241760319 --ext-diff

If you want to use difftastic by default, use git config.

# Set git configuration for the current repository.
$ git config diff.external difft

# Set git configuration for all repositories.
$ git config --global diff.external difft

After running git config, git diff will use difft automatically. Other git commands require --ext-diff to use diff.external.

$ git diff
$ git log -p --ext-diff
$ git show e96a7241760319 --ext-diff

git-difftool

git difftool is a git command for viewing the current changes with a different diff tool. It's useful if you want to use difftastic occasionally.

Add the following to your .gitconfig to use difftastic as your difftool.

[diff]
        tool = difftastic

[difftool]
        prompt = false

[difftool "difftastic"]
        cmd = difft "$LOCAL" "$REMOTE"

You can then run git difftool to see current changes with difftastic.

$ git difftool

We also recommend the following settings to get the best difftool experience.

# Use a pager for large output, just like other git commands.
[pager]
        difftool = true

# `git dft` is less to type than `git difftool`.
[alias]
        dft = difftool

Mercurial

Mercurial supports external diff tools with the Extdiff extension. Enable it by adding an entry to extensions in your .hgrc.

[extensions]
extdiff =

You can then run hg extdiff -p difft (assumes the difft binary is on your $PATH).

You can also define an alias to run difftastic with hg. Add the following to your .hgrc to run difftastic with hg dft.

[extdiff]
cmd.dft = difft
opts.dft = --missing-as-empty

hg log -p

Mercurial does not have a way of changing the default diff tool, at least to the author's knowledge.

If you just want to view the diff of the most recent commit, you can use the following.

GIT_PAGER_IN_USE=1 hg dft -r .^ -r . | less

This is equivalent to hg log -l 1 -p, although it does not show the commit message.

Languages Supported

Difftastic supports the following programming languages.

LanguageParser Used
Bashtree-sitter/tree-sitter-bash
Ctree-sitter/tree-sitter-c
C++tree-sitter/tree-sitter-cpp
C#tree-sitter/tree-sitter-c-sharp
Clojuresogaiu/tree-sitter-clojure (branched)
Common LisptheHamsta/tree-sitter-commonlisp
DartUserNobody14/tree-sitter-dart
Elixirelixir-lang/tree-sitter-elixir
Elvishckafi/tree-sitter-elvish
Emacs Lispwilfred/tree-sitter-elisp
Gleamgleam-lang/tree-sitter-gleam
Gotree-sitter/tree-sitter-go
Haskelltree-sitter/tree-sitter-haskell
Janetsogaiu/tree-sitter-janet-simple
Javatree-sitter/tree-sitter-java
JavaScript, JSXtree-sitter/tree-sitter-javascript
Kotlinfwcd/tree-sitter-kotlin
Luanvim-treesitter/tree-sitter-lua
Nixcstrahan/tree-sitter-nix
OCamltree-sitter/tree-sitter-ocaml
Perlganezdragon/tree-sitter-perl
PHPtree-sitter/tree-sitter-php
Pythontree-sitter/tree-sitter-python
Rubytree-sitter/tree-sitter-ruby
Rusttree-sitter/tree-sitter-rust (forked)
Scalatree-sitter/tree-sitter-scala
Swiftalex-pinkus/tree-sitter-swift
TypeScript, TSXtree-sitter/tree-sitter-typescript
Zigmaxxnino/tree-sitter-zig

Difftastic also supports the following structured text formats.

LanguageParser Used
CSStree-sitter/tree-sitter-css
HCLMichaHoffmann/tree-sitter-hcl
JSONtree-sitter/tree-sitter-json
TOMLikatyang/tree-sitter-toml
YAMLikatyang/tree-sitter-yaml

Parsing

Difftastic uses tree-sitter to build a parse tree. The parse tree is then converted to a simpler tree which can be diffed.

Parsing with Tree-sitter

Difftastic relies on tree-sitter to understand syntax. You can view the parse tree that tree-sitter produces using the --dump-ts flag.

$ difft --dump-ts sample_files/javascript_simple_before.js | head
program (0, 0) - (7, 0)
  comment (0, 0) - (0, 8) "// hello"
  expression_statement (1, 0) - (1, 6)
    call_expression (1, 0) - (1, 5)
      identifier (1, 0) - (1, 3) "foo"
      arguments (1, 3) - (1, 5)
        ( (1, 3) - (1, 4) "("
        ) (1, 4) - (1, 5) ")"
    ; (1, 5) - (1, 6) ";"
  expression_statement (2, 0) - (2, 6)

Simplified Syntax

Difftastic converts the tree-sitter parse tree to a simplified syntax tree. The syntax tree is a uniform representation where everything is either an atom (e.g. integer literals, comments, variable names) or a list (consisting of the open delimiter, children and the close delimiter).

The flag --dump-syntax will display the syntax tree generated for a file.

$ difft --dump-syntax sample_files/before.js
[
    Atom id:1 {
        content: "// hello",
        position: "0:0-8",
    },
    List id:2 {
        open_content: "",
        open_position: "1:0-0",
        children: [
          ...

Conversion Process

The simple representation of the difftastic parse tree makes diffing much easier. Converting the detailed tree-sitter parse tree is a recursive tree walk, treating tree-sitter leaf nodes as atoms. There are two exceptions.

(1) Tree-sitter parse trees sometimes include unwanted structure. Some grammars consider string literals to be a single token, whereas others treat strings as a complex structure where the delimiters are separate.

tree_sitter_parser.rs uses atom_nodes to mark specific tree-sitter node names as flat atoms even if the node has children.

(2) Tree-sitter parse trees include open and closing delimiters as tokens. A list [1] will have a parse tree that includes [ and ] as nodes.

$ echo '[1]' > example.js
$ difft --dump-ts example.js
program (0, 0) - (1, 0)
  expression_statement (0, 0) - (0, 3)
    array (0, 0) - (0, 3)
      [ (0, 0) - (0, 1) "["
      number (0, 1) - (0, 2) "1"
      ] (0, 2) - (0, 3) "]"

tree_sitter_parser.rs uses open_delimiter_tokens to ensure that [ and ] are used as delimiter content in the enclosing list, rather than converitng them to atoms.

Difftastic can match up atoms that occur in different parts of the simplified syntax tree. If e.g. a [ is treated as an atom, difftastic might match it with another [ elsewhere. The resulting diff would be unbalanced, highlighting different numbers of open and close delimiters.

Lossy Syntax Trees

The simplified syntax tree only stores node content and node position. It does not store whitespace between nodes, and position is largely ignored during diffing.

Diffing

Difftastic treats diff calculations as a route finding problem on a directed acyclic graph.

Graph Representation

A vertex in the graph represents a position in two syntax trees.

The start vertex has both positions pointing to the first syntax node in both trees. The end vertex has both positions just after the last syntax node in both trees.

Consider comparing A with X A.

            START
            +---------------------+
            | Left: A  Right: X A |
            |      ^         ^    |
            +---------------------+
                   /       \
     Novel atom L /         \ Novel atom R
1                v       2   v
+---------------------+  +---------------------+
| Left: A  Right: X A |  | Left: A  Right: X A |
|        ^       ^    |  |      ^           ^  |
+---------------------+  +---------------------+

From the start vertex, we have two options:

  • we can mark the first syntax node on the left as novel, and advance to the next syntax node on the left (vertex 1 above), or
  • we can mark the first syntax node on the right as novel, and advance to the next syntax node on the right (vertex 2 above).

Choosing "novel atom R" to vertex 2 will turn out to be the best choice. From vertex 2, we can see three routes to the end vertex.

            2
            +---------------------+
            | Left: A  Right: X A |
            |      ^           ^  |
            +---------------------+
                   /    |   \
     Novel atom L /     |    \ Novel atom R
3                v      | 4   v
+---------------------+ | +---------------------+
| Left: A  Right: X A | | | Left: A  Right: X A |
|        ^         ^  | | |      ^             ^|
+---------------------+ | +---------------------+
  |                     |                    |
  | Novel atom R        | Nodes match        | Novel atom L
  |                     |                    |
  |         END         v                    |
  |         +---------------------+          |
  +-------->| Left: A  Right: X A |<---------+
            |        ^           ^|
            +---------------------+

Comparing Routes

We assign a cost to each edge. Marking a syntax node as novel is worse than finding a matching syntax node, so the "novel atom" edge has a higher cost than the "syntax nodes match" edge.

The best route is the lowest cost route from the start vertex to the end vertex.

Finding The Best Route

Difftastic uses Dijkstra's algorithm to find the best (i.e. lowest cost) route.

One big advantage of this algorithm is that we don't need to construct the graph in advance. Constructing the whole graph would require exponential memory relative to the number of syntax nodes. Instead, vertex neighbours are constructed as the graph is explored.

There are lots of resources explaining Dijkstra's algorithm online, but I particularly recommend the graph search section of Red Blob Games.

Tricky Cases

Tree diffing is challenging in some situations. This page demonstrates difficult cases observed during development.

Not all of these cases work well in difftastic yet.

Adding Delimiters

;; Before
x

;; After
(x)

Desired result: (x)

This is tricky because x has changed its depth in the tree, but x itself is unchanged.

Not all tree diff algorithms handle this case. It is also challenging to display this case clearly: we want to highlight the changed delimiters, but not their content. This is challenging in larger expressions.

Changing Delimiters

;; Before
(x)

;; After
[x]

As with the wrapping case, we want to highlight the delimiters rather than the x.

Expanding Delimiters

;; Before
(x) y

;; After
(x y)

Desired output: (x y)

In this case, we want to highlight y. Highlighting the delimiters could make x look changed.

Contracting Delimiters

;; Before
(x y)

;; After
(x) y

This should be highlighted similar to the expanding delimiter case.

Disconnected Delimiters

;; Before
(foo (bar))

;; After
(foo (novel) (bar))

Desired result: (foo (novel) (bar)

It is easy to end up with (foo (novel) (bar)), where a later pair of delimiters are chosen.

Rewrapping Large Nodes

;; Before
[[foo]]
(x y)

;; After
([[foo]] x y)

We want to highlight [[foo]] being moved inside the parentheses. However, a naive syntax differ prefers consider a removal of () in the before and an addition of () in the after to be more minimal diff.

(Reported as issue 44.)

Reordering Within A List

;; Before
(x y)

;; After
(y x)

Desired result: (y x)

We want to highlight the list contents and not the delimiters.

Middle Insertions

// Before
foo(bar(123))

// After
foo(extra(bar(123)))

Desired result: foo(extra(bar(123)))

We want to consider both foo and bar to be unchanged. This case is challenging for diffing algorithms that do a bottom-up then top-down matching of trees.

Sliders

Sliders are a common problem in text based diffs, where lines are matched in a confusing way.

They typically look like this. The diff has to arbitrarily choose a line containing delimiter, and it chooses the wrong one.

+ }
+
+ function foo () {
  }

git-diff has some heuristics to reduce the risk of this (e.g. the "patience diff"), but it can still occur.

There's a similar problem in tree diffs.

;; Before
A B
C D

;; After
A B
A B
C D

Ideally we'd prefer marking contiguous nodes as novel, so we highlight A B rather than B\nA. From the perspective of a longest-common-subsequence algorithm, these two choices are equivalent.

Minimising Depth Changes

// Before
if true {
  foo(123);
}
foo(456);

// After
foo(789);

Do we consider foo(123) or foo(456) to match with foo(789)? Difftastic prefers foo(456) by preferring nodes at the same nesting depth.

Replacements With Minor Similarities

// Before
function foo(x) { return x + 1; }

// After
function bar(y) { baz(y); }

In this example, we've deleted a function and written a completely different one. A tree-based diff could match up the function and the outer delimiters, resulting in a confusing display showing lots of small changes.

As with sliders, the replacement problem can also occur in textual line-based diffs. Line-diffs struggle if there are a small number of common lines. The more precise, granular behaviour of tree diffs makes this problem much more common though.

Matching Substrings In Comments

// Before
/* The quick brown fox. */
foobar();

// After
/* The slow brown fox. */
foobaz();

foobar and foobaz are completely different, and their common prefix fooba should not be matched up. However, matching common prefixes or suffixes for comments is desirable.

Multiline Comments

// Before
/* Hello
 * World. */

// After
if (x) {
  /* Hello
   * World. */
}

The inner content of these two comments is technically different. We want to treat them as identical however.

Reflowing Doc Comments

Block comments have prefixes that aren't meaningful.

// Before
/* The quick brown fox jumps 
 * over the lazy dog. */

// After
/* The quick brown fox immediately
 * jumps over the lazy dog. */

The inner content has changed from jumps * over to immediately * jumps over. However, the * is decorative and we don't care that it's moved.

Small Changes To Large Strings

// Before
"""A very long string
with lots of words about
lots of stuff."""

// After
"""A very long string
with lots of NOVEL words about
lots of stuff."""

It would be correct to highlight the entire string literal as being removed and replaced with a new string literal. However, this makes it hard to see what's actually changed.

It's clear that variable names should be treated atomically, and comments are safe to show subword changes. It's not clear how to handle a small change in a 20 line string literal.

It's tempting to split strings on spaces and diff that, but users still want to know when whitespace changes inside strings. " " and " " are not the same.

Autoformatter Punctuation

// Before
foo("looooong", "also looooong");

// Before
foo(
  "looooong",
  "novel",
  "also looooong",
);

Autoformatters (e.g. prettier) will sometimes add or remove punctuation when formatting. Commas and parentheses are the most common.

Syntactic diffing can ignore whitespace changes, but it has to assume punctuation is meaningful. This can lead to punctuation changes being highlighted, which may be quite far from the relevant content change.

Novel Blank Lines

Blank lines are challenging for syntactic diffs. We are comparing syntactic tokens, so we don't see blank lines.

// Before
A
B

// After
A

B

Generally we want syntactic diffing to ignore blank lines. In this first example, this should show no changes.

This is occasionally problematic, as it can hide accidental code reformatting.

// Before
A
B

// After
A
X

Y
B

In this second example, we've inserted X and Y and a blank line. We want to highlight the blank line as an addition.

// Before
A


B

// After
A
X
B

In this third example, the syntactic diffing only sees an addition. From the user's perspective, there has also been a removal of two blank lines.

Invalid Syntax

There's no guarantee that the input we're given is valid syntax. Even if the code is valid, it might use syntax that isn't supported by the parser.

Tree-sitter provided explicit error nodes, and difftastic treats them as atoms so it can run the same tree diff algorithm regardless.

Contributing

Building

Install Rust with rustup, then clone the code.

$ git clone git@github.com:Wilfred/difftastic.git
$ cd difftastic

Difftastic uses Cargo for building.

$ cargo build

Debug builds are significantly slower than release builds. For files with more than fifty lines, it's usually worth using an optimised build.

$ cargo build --release

Manual

This website is generated with mdbook. mdbook can be installed with Cargo.

$ cargo install mdbook

You can then use the mdbook binary to build and serve the site locally.

$ cd manual
$ mdbook serve

API Documentation

You can browse the internal API documentation generated by rustdoc here.

Difftastic's internal docs are not available on docs.rs, as it does not support binary crates today.

Testing

$ cargo test

There are also several files in sample_files/ that you can use.

The best way to test difftastic is to look at history from a real project. Set GIT_EXTERNAL_DIFF to point to your current build.

For example, you can run difftastic on its own source code.

$ GIT_EXTERNAL_DIFF=./target/release/difft git log -p --ext-diff -- src

Logging

Difftastic uses the pretty_env_logger library to log some additional debug information.

$ RUST_LOG=debug cargo run sample_files/old.jsx sample_files/new.jsx

See the env_logger documentation for full details.

Profiling

If you have a file that's particularly slow, you can use cargo-flamegraph to see which functions are slow.

$ CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph --bin difft sample_files/slow_before.rs sample_files/slow_after.rs

It's also worth looking at memory usage, as graph traversal bugs can lead to huge memory consumption.

$ /usr/bin/time -v ./target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs

If timing measurement are noisy, Linux's perf tool will report instructions executed, which is more stable.

$ perf stat ./target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
$ perf stat ./target/release/difft sample_files/typing_old.ml sample_files/typing_new.ml

Many more profiling techniques are discussed in the The Rust Performance Book.

Releasing

Use Cargo to create a new release, and tag it in git. Difftastic has a helper script for this:

$ ./scripts/release.sh

You can now increment the version in Cargo.toml and add a new entry to CHANGELOG.md.

Vendoring

Git Subtrees

Tree-sitter parsers are sometimes packaged on npm, sometimes packaged on crates.io, and have different release frequencies. Difftastic uses git subtrees (not git submodules) to track parsers.

Updating a parser

To update a parser, pull commits from the upstream git repository. For example, the following command will update the Java parser:

$ git subtree pull --prefix=vendor/tree-sitter-java git@github.com:tree-sitter/tree-sitter-java.git master

To see when each parser was last updated, use the following shell command:

$ for d in $(git log | grep git-subtree-dir | tr -d ' ' | cut -d ":" -f2 | sort); do echo "$d"; git log --pretty="  %cs" -n 1 $d; done

Adding A Parser

Finding a parser

New parsers for difftastic must be reasonably complete and maintained.

There are many tree-sitter parsers available, and the tree-sitter website includes a list of some well-known parsers.

Add the source code

Once you've found a parser, add it as a git subtree to vendor/. We'll use tree-sitter-json as an example.

$ git subtree add --prefix=vendor/tree-sitter-json git@github.com:tree-sitter/tree-sitter-json.git master

Configure the build

Cargo does not allow packages to include subdirectories that contain a Cargo.toml. Add a symlink to the src/ parser subdirectory.

$ cd vendor
$ ln -s tree-sitter-json/src tree-sitter-json-src

You can now add the parser to build by including the directory in build.rs.

TreeSitterParser {
    name: "tree-sitter-json",
    src_dir: "vendor/tree-sitter-json-src",
    extra_files: vec![],
},

If your parser includes custom C or C++ files for lexing (e.g. a scanner.cc), add them to extra_files.

Configure parsing

Add an entry to tree_sitter_parser.rs for your language.

Json => {
    let language = unsafe { tree_sitter_json() };
    TreeSitterConfig {
        name: "JSON",
        language,
        atom_nodes: vec!["string"].into_iter().collect(),
        delimiter_tokens: vec![("{", "}"), ("[", "]")],
        highlight_query: ts::Query::new(
            language,
            include_str!("../vendor/highlights/json.scm"),
        )
        .unwrap(),
    }
}

name is the human-readable name shown in the UI.

atom_nodes is a list of tree-sitter node names that should be treated as atoms even though the nodes have children. This is common for things like string literals or interpolated strings, where the node might have children for the opening and closing quote.

If you don't set atom_nodes, you may notice added/removed content shown in white. This is usually a sign that child node should have its parent treated as an atom.

delimiter_tokens are delimiters that difftastic stores on the enclosing list node. This allows difftastic to distinguish delimiter tokens from other punctuation in the language.

If you don't set delimiter_tokens, difftastic will consider the tokens in isolation, and may think that a ( was added but the ) was unchanged.

You can use difft --dump-ts foo.json to see the results of the tree-sitter parser, and difft --dump-syntax foo.json to confirm that you've set atoms and delimiters correctly.

Configure language detection

Update from_extension in guess_language.rs to detect your new language.

"json" => Some(Json),

There may also file names or shebangs associated with your language. GitHub's linguist definitions is a useful source of common file extensions.

Syntax highlighting (Optional)

To add syntax highlighting to the package, you'll also need a symlink to the highlights.scm, if available.

$ cd vendor/highlights
$ ln -s ../tree-sitter-json/queries/highlights.scm json.scm

Add a regression test

Finally, add a regression test for your language. This ensures that the output for your test file doesn't change unexpectedly.

Regression test files live in sample_files/ and have the form foo_before.abc and foo_after.abc.

$ nano simple_before.json
$ nano simple_after.json

Run the regression test script and update the .expected file.

$ ./sample_files/compare_all.sh
$ cp sample_files/compare.result sample_files/compare.expected

Glossary

Atom: An atom is an item in difftastic's syntax tree structure that has no children. It represents things like literals, variable names, and comments. See also 'list'.

Delimiter: A paired piece of syntax. A list has an open delimiter and a close delimiter, such as [ and ]. Delimiters may not be punctuation (e.g. begin and end) and may be empty strings (e.g. infix syntax converted to difftastic's syntax tree).

LHS: Left-hand side. Difftastic compares two items, and LHS refers to the first item. See also 'RHS'.

List: A list is an item in difftastic's syntax tree structure that has an open delimiter, children, and a close delimiter. It represents things like expressions and function definitions. See also 'atom'.

Novel: An addition or a removal. Syntax is novel if it occurs in only one of the two items being compared.

RHS: Right-hand side. Difftastic compares two items, and RHS refers to the second item. See also 'LHS'.

Root: A syntax tree without a parent node. Roots represent top-level definitions in the file being diffed.

Syntax node: An item in difftastic's syntax tree structure. Either an atom or a list.

Token: A small piece of syntax tracked by difftastic (e.g. $x, function or ]), for highlighting and aligned display. This is either an atom or a non-empty delimiter.

Alternative Projects

Many different tools exist for diffing files. This section of the manual discusses the design of other tools that have influenced difftastic.

Tree Diffing

This page summarises some of the other tree diffing tools available.

If you're in a hurry, start by looking at Autochrome. It's extremely capable, and has an excellent description of the design.

If you're interested in a summary of the academic literature, this blog post (and its accompanying paper -- mirrored under a CC BY-NC license) are great resources.

json-diff (2012)

Languages: JSON
Algorithm: Pairwise comparison
Output: CLI colours

json-diff performs a structural diff of JSON files. It considers subtrees to be different if they don't match exactly, so e.g. "foo" and ["foo"] are entirely different.

json-diff is also noteworthy for its extremely readable display of results.

GumTree (2014)

Languages: ~10 programming languages
Parser: Several, including srcML
Algorithm: Top-down, then bottom-up
Ouput: HTML, Swing GUI, or text

GumTree can parse several programming languages and then performs a tree-based diff, outputting an HTML display.

The GumTree algorithm is described in the associated paper 'Fine-grained and accurate source code differencing' by Falleri et al (DOI, PDF). It performs a greedy top-down search for identical subtrees, then performs a bottom-up search to match up the rest.

Tree Diff (2017)

Languages: S-expression data format
Algorithm: A* search
Output: Merged s-expression file

Tristan Hume wrote a tree diffing algorithm during his 2017 internship and Jane Street. The source code is not available, but he has a blog post discussing the design in depth.

This project finds minimal diffs between s-expression files used as configuration by Jane Street. It uses A* search to find the minimal diff between them, and builds a new s-expression with a section marked with :date-switch for the differing parts.

(Jane Street also has patdiff, but that seems to be a line-oriented diff with some whitespace/integer display polish. It doesn't understand that e.g. whitespace in "foo " is meaningful).

Autochrome (2017)

Languages: Clojure
Parser: Custom, preserves comments
Algorithm: Dijkstra (previously A* search)
Output: HTML

Autochrome parses Clojure with a custom parser that preserves comments. Autochrome uses Dijkstra's algorithm to compare syntax trees.

Autochrome's webpage includes worked examples of the algorithm and a discussion of design tradeoffs. It's a really great resource for understanding tree diffing techniques in general.

graphtage (2020)

Languages: JSON, XML, HTML, YAML, plist, and CSS
Parser: json5, pyYAML, ignores comments
Algorithm: Levenshtein distance
Output: CLI colours

graphtage compares structured data by parsing into a generic file format, then displaying a diff. It even allows things like diffing JSON against YAML.

As with json-diff, it does not consider ["foo"] and "foo" to have any similarities.

Diffsitter (2020)

Parser: Tree-sitter
Algorithm: Longest-common-subsequence
Output: CLI colours

Diffsitter is another tree-sitter based diff tool. It uses LCS diffing on the leaves of the syntax tree.

sdiff (2021)

Languages: Scheme
Parser: Scheme's built-in read, ignores comments
Algorithm: MH-Diff from the Chawathe paper
Output: CLI colours

Semantically meaningful S-expression diff: Tree-diff for lisp source code was presented at FOSDEM 2021.