Introduction
Difftastic is a structural diff tool that understands syntax. It supports over 30 programming languages and when it works, it's fantastic.
Difftastic is open source software (MIT license) and available on GitHub.
This copy of the manual describes version 0.61.0. The changelog records which features and bug fixes are in each version.
This manual is also available in Chinese.
Syntactic Diffing
Difftastic detects the language, parses the code, and then compares the syntax trees. Let's look at an example.
// old.rs
let ts_lang = guess(path, guess_src).map(tsp::from_language);
// new.rs
let ts_lang = language_override
.or_else(|| guess(path, guess_src))
.map(tsp::from_language);
$ difft old.rs new.rs
1 1 let ts_lang = language_override
. 2 .or_else(|| guess(path, guess_src))
. 3 .map(tsp::from_language);
Notice how difftastic recognises that .map
is unchanged, even though
it's now on a new line with whitespace.
A line-oriented diff does a much worse job here.
$ diff -u old.rs new.rs
@@ -1 +1,3 @@
-let ts_lang = guess(path, guess_src).map(tsp::from_language);
+let ts_lang = language_override
+ .or_else(|| guess(path, guess_src))
+ .map(tsp::from_language);
Some textual diff tools also highlight word changes (e.g. GitHub or
git's --word-diff
). They still don't understand the code
though. Difftastic will always find matched delimiters: you can see
the closing )
from or_else
has been highlighted.
Fallback Textual Diffing
If input files are not in a format that difftastic understands, it uses a conventional line-oriented text diff with word highlighting.
Difftastic will also use textual diffing when given extremely large inputs.
Installation
Difftastic provides GitHub releases with prebuilt binaries for Windows, macOS and Linux.
If you're a Homebrew user, you can install
difftastic with brew
.
$ brew install difftastic
If you're an Arch Linux user, you can install
difftastic
with pacman
.
$ sudo pacman -S difftastic
If you're a Nix user, you can install
difftastic
with nix-env
.
$ nix-env --install difftastic
If you're a Fedora user, you can install difftastic with dnf
.
$ sudo dnf install difftastic
If you're a FreeBSD user, you can install
difftastic
with pkg
.
$ sudo pkg install difftastic
If you're a Windows user, you can install
difftastic
with scoop
.
$ scoop install difftastic
If you're a Windows user, you can install
difftastic
with choco
.
$ choco install difftastic
Full Package Listing
This table lists all the platforms that have packaged difftastic.
Installing From Source
Build Requirements
Difftastic is written in Rust, so you will need Rust installed. I recommend rustup to install Rust. Difftastic requires Rust version 1.65 or later.
You will also need a C++ compiler that supports C++14. If you're using GCC, you need at least version 8.
Build
You can download and build difftastic on crates.io with Cargo (which is part of Rust).
$ cargo install --locked difftastic
Difftastic uses the cc
crate for building C/C++ dependencies. This
allows you to use environment variables CC
and CXX
to control the
compiler used (see the cc
docs).
See contributing for instructions on debug builds.
(Optional) Install a MIME Database
If a MIME database is available, difftastic will use it to detect
binary files more accurately. This is the same database used by the
file
command, so you probably already have it.
The MIME database path is specified in the XDG specification. The database should be at one of the following paths:
/usr/share/mime/magic
/usr/local/share/mime/magic
$HOME/.local/share/mime/magic
Usage
File Arguments
Diffing Files
$ difft FIRST-FILE SECOND-FILE
# For example:
$ difft sample_files/simple_1.js sample_files/simple_2.js
Diffing Directories
$ difft FIRST-DIRECTORY SECOND-DIRECTORY
# For example:
$ difft sample_files/dir_1/ sample_files/dir_2/
Difftastic will recursively walk the two directories, diffing files with the same name.
The --skip-unchanged
option is useful when diffing directories that
contain many unchanged files.
Reading stdin
You can read a file from stdin by specifying -
as the file path.
$ difft - SECOND-FILE
# For example:
$ cat sample_files/simple_1.js | difft - sample_files/simple_2.js
Files With Conflicts
(Added in version 0.50.)
If you have a file with <<<<<<<
conflict markers, you can pass it as
a single argument to difftastic. Difftastic will construct the two
file states and diff those.
$ difft FILE-WITH-CONFLICTS
# For example:
$ difft sample_files/conflicts.el
Language Detection
Difftastic guesses the language used based on the file extension, file
name, and the contents of the first lines. To see the languages
available, and the associated file names, use the --list-languages
option.
$ difft --list-languages
You can override language detection for specific file globs using the
--override
option.
$ difft --override=GLOB:NAME FIRST-FILE SECOND-FILE
# For example, treating .h files as C rather than C++:
$ difft --override=*.h:c sample_files/preprocesor_1.h sample_files/preprocesor_2.h
Options
Difftastic includes a range of configuration CLI options, see difft --help
for the full list.
Difftastic can also be configured with environment variables. These
are also visible in --help
.
For example, DFT_BACKGROUND=light
is equivalent to
--background=light
. This is useful when using VCS tools like git,
where you are not invoking the difft
binary directly.
Exit Codes
2: Difftastic was given invalid arguments. This includes invalid usage (e.g. the wrong number of arguments) as well as paths that difftastic cannot read (e.g. non-existent paths or insufficient permissions).
1: When called with --exit-code
, difftastic will return an exit code
of 1 when it finds any syntactic changes (in text files) or byte changes
(in binary files).
0: All other cases.
Git
Difftastic has good support for git.
One-Off Usage
You can use
GIT_EXTERNAL_DIFF
for a one-off git command.
View uncommitted changes with difftastic:
$ GIT_EXTERNAL_DIFF=difft git diff
View changes from the most recent commit with difftastic:
$ GIT_EXTERNAL_DIFF=difft git show HEAD --ext-diff
View changes from recent commits on the current branch with difftastic:
$ GIT_EXTERNAL_DIFF=difft git log -p --ext-diff
Regular Usage
If you like difftastic, we recommend that you configure git aliases so you can use difftastic more easily.
Add the following to your ~/.gitconfig
to use difftastic as a
difftool.
[diff]
tool = difftastic
[difftool]
prompt = false
[difftool "difftastic"]
cmd = difft "$LOCAL" "$REMOTE"
[pager]
difftool = true
You can now use the following command to see changes with difftastic,
equivalent to git diff
:
$ git difftool
We recommend that you set up a shorter alias for this command in your
~/.gitconfig
:
# `git dft` is less to type than `git difftool`.
[alias]
dft = difftool
For other commands, we also recommend that you set up aliases that are equivalent to the one-off commands shown above.
# `git dlog` to show `git log -p` with difftastic.
[alias]
dlog = -c diff.external=difft log -p --ext-diff
Difftastic By Default
If you want to use difftastic as your default diff tool, add the
following to your ~/.gitconfig
.
[diff]
external = difft
This only applies to git diff
. For other git commands, you still
need to specify --ext-diff
, or use an alias as described above.
$ git diff
$ git show HEAD --ext-diff
$ git log -p --ext-diff
Conversely, if you need to copy/paste a diff in the standard diff format, you
can use git diff --no-ext-diff
.
Mercurial
Mercurial supports external diff
tools with the
Extdiff extension. Enable it by adding an entry to extensions
in
your .hgrc
.
[extensions]
extdiff =
You can then run hg extdiff -p difft
instead of hg diff
(assumes the difft
binary is on your $PATH
).
You can also define an alias to run difftastic with hg. Add the
following to your .hgrc
to run difftastic with hg dft
.
[extdiff]
cmd.dft = difft
# You can add further options which will be passed to the command line, e.g.
# opts.dft = --background light
All options of hg diff
are also supported by hg dft
; for example,
hg dft --stat
will show statistics of changed lines and hg dft -r 42 -r 45
will show the diff between two revisions.
hg log -p
Mercurial does not have a way of changing the default diff tool, at least to the author's knowledge.
If you just want to view the diff of the most recent commit, you can use the following.
hg dft -r .^ -r .
This is equivalent to hg log -l 1 -p
, although it does not show the
commit message.
Fossil
Fossil supports external diff
commands by setting diff-command
for the current repository:
fossil settings diff-command difft
To use difftastic for all repositories, use --global
:
fossil settings diff-command --global difft
Skip difftastic on Fossil
If you set difftastic as Fossil's diff
command, but you need to use Fossil's internal diff once, use -i
to skip difftastic once:
fossil diff -i
If you want to remove difftastic from one repository (or globally), use unset
:
fossil unset diff-command
unset
also supports the --global
flag.
Languages Supported
This page lists all the languages supported by difftastic. You can
also view the languages supported in your current installed version
with difft --list-languages
.
Programming Languages
Structured Text Formats
Language | Parser Used |
---|---|
CSS | tree-sitter/tree-sitter-css |
HCL | MichaHoffmann/tree-sitter-hcl |
HTML | tree-sitter/tree-sitter-html |
JSON | tree-sitter/tree-sitter-json |
LaTeX | latex-lsp/tree-sitter-latex |
Newick | delehef/tree-sitter-newick |
SCSS | serenadeai/tree-sitter-scss |
TOML | ikatyang/tree-sitter-toml |
XML | ObserverOfTime/tree-sitter-xml |
YAML | ikatyang/tree-sitter-yaml |
Internals: Parsing
Difftastic uses tree-sitter to build a parse tree. The parse tree is then converted to a simpler tree which can be diffed.
Parsing with Tree-sitter
Difftastic relies on tree-sitter to understand syntax. You can view
the parse tree that tree-sitter produces using the --dump-ts
flag.
$ difft --dump-ts sample_files/javascript_simple_1.js | head
program (0, 0) - (7, 0)
comment (0, 0) - (0, 8) "// hello"
expression_statement (1, 0) - (1, 6)
call_expression (1, 0) - (1, 5)
identifier (1, 0) - (1, 3) "foo"
arguments (1, 3) - (1, 5)
( (1, 3) - (1, 4) "("
) (1, 4) - (1, 5) ")"
; (1, 5) - (1, 6) ";"
expression_statement (2, 0) - (2, 6)
Simplified Syntax
Difftastic converts the tree-sitter parse tree to a simplified syntax tree. The syntax tree is a uniform representation where everything is either an atom (e.g. integer literals, comments, variable names) or a list (consisting of the open delimiter, children and the close delimiter).
The flag --dump-syntax
will display the syntax tree generated for a
file.
$ difft --dump-syntax sample_files/before.js
[
Atom id:1 {
content: "// hello",
position: "0:0-8",
},
List id:2 {
open_content: "",
open_position: "1:0-0",
children: [
...
Conversion Process
The simple representation of the difftastic parse tree makes diffing much easier. Converting the detailed tree-sitter parse tree is a recursive tree walk, treating tree-sitter leaf nodes as atoms. There are two exceptions.
(1) Tree-sitter parse trees sometimes include unwanted structure. Some grammars consider string literals to be a single token, whereas others treat strings as a complex structure where the delimiters are separate.
tree_sitter_parser.rs
uses atom_nodes
to mark specific tree-sitter
node names as flat atoms even if the node has children.
(2) Tree-sitter parse trees include open and closing delimiters as
tokens. A list [1]
will have a parse tree that includes [
and ]
as nodes.
$ echo '[1]' > example.js
$ difft --dump-ts example.js
program (0, 0) - (1, 0)
expression_statement (0, 0) - (0, 3)
array (0, 0) - (0, 3)
[ (0, 0) - (0, 1) "["
number (0, 1) - (0, 2) "1"
] (0, 2) - (0, 3) "]"
tree_sitter_parser.rs
uses open_delimiter_tokens
to ensure that
[
and ]
are used as delimiter content in the enclosing list,
rather than converitng them to atoms.
Difftastic can match up atoms that occur in different parts of the
simplified syntax tree. If e.g. a [
is treated as an atom,
difftastic might match it with another [
elsewhere. The resulting
diff would be unbalanced, highlighting different numbers of open and
close delimiters.
Lossy Syntax Trees
The simplified syntax tree only stores node content and node position. It does not store whitespace between nodes, and position is largely ignored during diffing.
Internals: Diffing
Difftastic treats diff calculations as a route finding problem on a directed acyclic graph.
Graph Representation
A vertex in the graph represents a position in two syntax trees.
The start vertex has both positions pointing to the first syntax node in both trees. The end vertex has both positions just after the last syntax node in both trees.
Consider comparing A
with X A
.
START
+---------------------+
| Left: A Right: X A |
| ^ ^ |
+---------------------+
END
+---------------------+
| Left: A Right: X A |
| ^ ^|
+---------------------+
From the start vertex, we have two options:
- we can mark the first syntax node on the left as novel, and advance to the next syntax node on the left (vertex 1 above), or
- we can mark the first syntax node on the right as novel, and advance to the next syntax node on the right (vertex 2 above).
START
+---------------------+
| Left: A Right: X A |
| ^ ^ |
+---------------------+
/ \
Novel atom L / \ Novel atom R
1 v 2 v
+---------------------+ +---------------------+
| Left: A Right: X A | | Left: A Right: X A |
| ^ ^ | | ^ ^ |
+---------------------+ +---------------------+
Choosing "novel atom R" to vertex 2 will turn out to be the best choice. From vertex 2, we can see three routes to the end vertex.
2
+---------------------+
| Left: A Right: X A |
| ^ ^ |
+---------------------+
/ | \
Novel atom L / | \ Novel atom R
v | v
+---------------------+ | +---------------------+
| Left: A Right: X A | | | Left: A Right: X A |
| ^ ^ | | | ^ ^|
+---------------------+ | +---------------------+
| | |
| Novel atom R | Nodes match | Novel atom L
| | |
| END v |
| +---------------------+ |
+-------->| Left: A Right: X A |<---------+
| ^ ^|
+---------------------+
Comparing Routes
We assign a cost to each edge. Marking a syntax node as novel is worse than finding a matching syntax node, so the "novel atom" edge has a higher cost than the "syntax nodes match" edge.
The best route is the lowest cost route from the start vertex to the end vertex.
Finding The Best Route
Difftastic uses Dijkstra's algorithm to find the best (i.e. lowest cost) route.
One big advantage of this algorithm is that we don't need to construct the graph in advance. Constructing the whole graph would require exponential memory relative to the number of syntax nodes. Instead, vertex neighbours are constructed as the graph is explored.
There are lots of resources explaining Dijkstra's algorithm online, but I particularly recommend the graph search section of Red Blob Games.
Tricky Cases
Tree diffing is challenging in some situations. This page discusses difficult cases, and how difftastic handles them.
Not all of these cases work well in difftastic yet.
Adding Delimiters
;; Before
x
;; After
(x)
Possible result: (x)
Desired result: (x)
This is tricky because x
has changed its depth in the tree, but x
itself is unchanged.
Not all tree diff algorithms handle this case. It is also challenging to display this case clearly: we want to highlight the changed delimiters, but not their content. This is challenging in larger expressions.
Difftastic: Difftastic considers nodes to be equal even at different depths, achieving the desired result in this case.
Changing Delimiters
;; Before
(x)
;; After
[x]
Desired result: (x)
, [x]
As with the wrapping case, we want to highlight the delimiters rather
than the x
.
Difftastic: Difftastic handles this correctly through its tree diffing.
Expanding Delimiters
;; Before
(x) y
;; After
(x y)
Possible result 1: (x) y
, (x y)
Possible result 2: (x) y
, (x y)
It's not clear which is better in this case.
Difftastic: Difftastic currently shows result 2, but this case is sensitive to the cost model. Some previous versions of difftastic have shown result 1.
Contracting Delimiters
;; Before
(x y)
;; After
(x) y
This case is similar to the expanding delimiter case.
Disconnected Delimiters
;; Before
(foo (bar))
;; After
(foo (novel) (bar))
Desired result: (foo (novel) (bar))
It is easy to end up with
(foo (novel) (bar))
,
where a later pair of delimiters are chosen.
Rewrapping Large Nodes
;; Before
[[foo]]
(x y)
;; After
([[foo]] x y)
We want to highlight [[foo]]
being moved inside the
parentheses. However, a naive syntax differ prefers to consider a removal
of ()
in the before and an addition of ()
in the after to be more
minimal diff.
(Reported as issue 44.)
Reordering Within A List
;; Before
(x y)
;; After
(y x)
Desired result: (y x)
We want to highlight the list contents and not the delimiters.
Middle Insertions
// Before
foo(bar(123))
// After
foo(extra(bar(123)))
Desired result: foo(extra(bar(123)))
We want to consider both foo
and bar
to be unchanged. This case is
challenging for diffing algorithms that do a bottom-up then top-down
matching of trees.
Punctuation Atoms
// Before
foo(1, bar)
// After
foo(bar, 2)
Possible result: foo(bar, 2)
Desired result: foo(bar, 2)
There are two atoms inside the ()
that we could consider as
unchanged, either the bar
or the ,
. (We can't consider both to be
unchanged as they're reordered.)
We want to consider bar
to be unchanged, as it's a more important
atom than the ,
punctuation atom. Doing this is in a
language-agnostic way is difficult, so difftastic has a small list of
punctuation characters that always get lower priority than other
atoms.
Sliders (Flat)
Sliders are a common problem in text based diffs, where lines are matched in a confusing way.
They typically look like this. The diff has to arbitrarily choose a line containing delimiter, and it chooses the wrong one.
+ }
+
+ function foo () {
}
git-diff has some heuristics to reduce the risk of this (e.g. the "patience diff"), but it can still occur.
There's a similar problem in tree diffs.
;; Before
A B
C D
;; After
A B
A B
C D
Possible result:
A B
A B
C D
Preferred result:
A B
A B
C D
Ideally we'd prefer marking contiguous nodes as novel. From the perspective of a longest-common-subsequence algorithm, these two choices are equivalent.
Sliders (Nested)
// Before
old1(old2)
// After
old1(new1(old2))
Possible result: old1(new1(old2))
Desired result: old1(new1(old2))
The correct answer depends on the language. Most languages want to prefer the inner delimiter, whereas Lisps and JSON prefer the outer delimiter.
Minimising Depth Changes
// Before
if true {
foo(123);
}
foo(456);
// After
foo(789);
Do we consider foo(123)
or foo(456)
to match with foo(789)
?
Difftastic prefers foo(456)
by preferring nodes at the same nesting depth.
Replacements With Minor Similarities
// Before
function foo(x) { return x + 1; }
// After
function bar(y) { baz(y); }
Possible result: function bar(y) { baz(y); }
In this example, we've deleted a function and written a completely
different one. A tree-based diff could match up the function
and the
outer delimiters, resulting in a confusing display showing lots of
small changes.
As with sliders, the replacement problem can also occur in textual line-based diffs. Line-diffs struggle if there are a small number of common lines. The more precise, granular behaviour of tree diffs makes this problem much more common though.
Matching Substrings In Comments
// Before
/* The quick brown fox. */
foobar();
// After
/* The slow brown fox. */
foobaz();
foobar
and foobaz
are completely different, and their common
prefix fooba
should not be matched up. However, matching common
prefixes or suffixes for comments is desirable.
Multiline Comments
// Before
/* Hello
* World. */
// After
if (x) {
/* Hello
* World. */
}
The inner content of these two comments is technically different. We want to treat them as identical however.
Reflowing Doc Comments
Block comments have prefixes that aren't meaningful.
// Before
/* The quick brown fox jumps
* over the lazy dog. */
// After
/* The quick brown fox immediately
* jumps over the lazy dog. */
The inner content has changed from jumps * over
to immediately * jumps over
. However, the *
is decorative and we don't care that
it's moved.
Small Changes To Large Strings
// Before
"""A very long string
with lots of words about
lots of stuff."""
// After
"""A very long string
with lots of NOVEL words about
lots of stuff."""
It would be correct to highlight the entire string literal as being removed and replaced with a new string literal. However, this makes it hard to see what's actually changed.
It's clear that variable names should be treated atomically, and comments are safe to show subword changes. It's not clear how to handle a small change in a 20 line string literal.
It's tempting to split strings on spaces and diff that, but users
still want to know when whitespace changes inside strings. " "
and
" "
are not the same.
Autoformatter Punctuation
// Before
foo("looooong", "also looooong");
// After
foo(
"looooong",
"novel",
"also looooong",
);
Autoformatters (e.g. prettier) will sometimes add or remove punctuation when formatting. Commas and parentheses are the most common.
Syntactic diffing can ignore whitespace changes, but it has to assume punctuation is meaningful. This can lead to punctuation changes being highlighted, which may be quite far from the relevant content change.
Unordered Data Types
// Before
set(1, 2)
// After
set(2, 1)
Users may expect difftastic to find no changes here. This is difficult for several reasons.
For programming languages, side effects might make the order
relevant. set(foo(), bar())
might behave differently to set(bar(), foo())
.
For configuration languages like JSON or YAML, some parser
implementations do actually expose ordering information
(e.g. object_pairs_hook=OrderedDict
in Python, or serde_json's
preserve_order
feature in Rust).
To make matters worse, unordered tree diffing is NP-hard.
For the unordered case, it turns out that all of the problems in general are NP-hard. Indeed, the tree edit distance and alignment distance problems are even MAX SNP-hard.
Difftastic: Difftastic considers ordering to be meaningful everywhere, so it will always report ordering changes.
Novel Blank Lines
Blank lines are challenging for syntactic diffs. We are comparing syntactic tokens, so we don't see blank lines.
// Before
A
B
// After
A
B
Generally we want syntactic diffing to ignore blank lines. In this first example, this should show no changes.
This is occasionally problematic, as it can hide accidental code reformatting.
// Before
A
B
// After
A
X
Y
B
In this second example, we've inserted X and Y and a blank line. We want to highlight the blank line as an addition.
// Before
A
B
// After
A
X
B
In this third example, the syntactic diffing only sees an addition. From the user's perspective, there has also been a removal of two blank lines.
Invalid Syntax
There's no guarantee that the input we're given is valid syntax. Even if the code is valid, it might use syntax that isn't supported by the parser.
Difftastic: Difftastic will fall back to a text-based diff if any parse errors occur, to avoid diffing incomplete syntax trees. When this occurs, the file header reports the error count.
$ difft sample_files/syntax_error_1.js sample_files/syntax_error_2.js
sample_files/syntax_error_after.js --- Text (2 errors, exceeded DFT_PARSE_ERROR_LIMIT)
...
Users may opt-in to syntactic diffing by setting
DFT_PARSE_ERROR_LIMIT
to a larger value. In this mode, difftastic
treats tree-sitter error nodes as atoms and performs a tree diff as
normal.
Contributing
Building
Install Rust with rustup, then clone the code.
$ git clone git@github.com:Wilfred/difftastic.git
$ cd difftastic
Difftastic uses Cargo for building.
$ cargo build
Debug builds are significantly slower than release builds. For files with more than fifty lines, it's usually worth using an optimised build.
$ cargo build --release
Manual
This website is generated with mdbook. mdbook can be installed with Cargo.
$ cargo install mdbook
You can then use the mdbook
binary to build and serve the site
locally.
$ cd manual
$ mdbook serve
API Documentation
You can browse the internal API documentation generated by rustdoc here.
Difftastic's internal docs are not available on docs.rs, as it does not support binary crates today.
Testing
$ cargo test
There are also several files in sample_files/
that you can use.
The best way to test difftastic is to look at history from a real
project. Set GIT_EXTERNAL_DIFF
to point to your current build.
For example, you can run difftastic on its own source code.
$ GIT_EXTERNAL_DIFF=./target/release/difft git log -p --ext-diff -- src
Logging
Difftastic uses the pretty_env_logger
library to log some additional
debug information.
$ DFT_LOG=debug cargo run sample_files/old.jsx sample_files/new.jsx
See the env_logger
documentation for full details.
Releasing
Use Cargo to create a new release, and tag it in git. Difftastic has a helper script for this:
$ ./scripts/release.sh
You can now increment the version in Cargo.toml and add a new entry to CHANGELOG.md.
Vendoring
Git Subtrees
Tree-sitter parsers are sometimes packaged on npm, sometimes packaged on crates.io, and have different release frequencies. Difftastic uses git subtrees (not git submodules) to track parsers.
Updating a parser
To update a parser, pull commits from the upstream git repository. For example, the following command will update the Java parser:
$ git subtree pull --prefix=vendored_parsers/tree-sitter-java git@github.com:tree-sitter/tree-sitter-java.git master
To see when each parser was last updated, use the following shell command:
$ for d in $(git log | grep git-subtree-dir | tr -d ' ' | cut -d ":" -f2 | sort); do echo "$d"; git log --pretty=" %cs" -n 1 $d; done
Adding A Parser
Finding a parser
New parsers for difftastic must be reasonably complete and maintained.
There are many tree-sitter parsers available, and the tree-sitter website includes a list of some well-known parsers.
Add the source code
Once you've found a parser, add it as a git subtree to
vendored_parsers/
. We'll use
tree-sitter-json as
an example.
$ git subtree add --prefix=vendored_parsers/tree-sitter-json https://github.com/tree-sitter/tree-sitter-json.git master
Configure the build
Cargo does not allow packages to include subdirectories that contain a
Cargo.toml
. Add a symlink to the src/
parser subdirectory.
$ cd vendored_parsers
$ ln -s tree-sitter-json/src tree-sitter-json-src
You can now add the parser to build by including the directory in
build.rs
.
TreeSitterParser {
name: "tree-sitter-json",
src_dir: "vendored_parsers/tree-sitter-json-src",
extra_files: vec![],
},
If your parser includes custom C or C++ files for lexing (e.g. a
scanner.cc
), add them to extra_files
.
Configure parsing
Add an entry to tree_sitter_parser.rs
for your language.
Json => {
let language = unsafe { tree_sitter_json() };
TreeSitterConfig {
language,
atom_nodes: vec!["string"].into_iter().collect(),
delimiter_tokens: vec![("{", "}"), ("[", "]")],
highlight_query: ts::Query::new(
language,
include_str!("../../vendored_parsers/highlights/json.scm"),
)
.unwrap(),
sub_languages: vec![],
}
}
atom_nodes
is a list of tree-sitter node names that should be
treated as atoms even though the nodes have children. This is common
for things like string literals or interpolated strings, where the
node might have children for the opening and closing quote.
If you don't set atom_nodes
, you may notice added/removed content
shown in white. This is usually a sign that child node should have its
parent treated as an atom.
delimiter_tokens
are delimiters that difftastic stores on
the enclosing list node. This allows difftastic to distinguish
delimiter tokens from other punctuation in the language.
If you don't set delimiter_tokens
, difftastic will consider the
tokens in isolation, and may think that a (
was added but the )
was unchanged.
You can use difft --dump-ts foo.json
to see the results of the
tree-sitter parser, and difft --dump-syntax foo.json
to confirm that
you've set atoms and delimiters correctly.
sub-languages
is empty for most languages: see the code documentation for details.
Configure language detection
Update language_name
in guess_language.rs
to detect your new
language. Insert a match arm like:
Json => "json",
There may also file names or shebangs associated with your language; configure those
by adapting the language_globs
, from_emacs_mode_header
and from_shebang
functions
in that file.
GitHub's linguist definitions
are a useful source of common file extensions.
Syntax highlighting (Optional)
To add syntax highlighting for your language, you'll also need a symlink
to the queries/highlights.scm
file, if available.
$ cd vendored_parsers/highlights
$ ln -s ../tree-sitter-json/queries/highlights.scm json.scm
Test It
Search GitHub for a popular repository in your target language (example search) and confirm that git history looks sensible with difftastic.
Add a regression test
Finally, add a regression test for your language. This ensures that the output for your test file doesn't change unexpectedly.
Regression test files live in sample_files/
and have the form
foo_1.abc
and foo_2.abc
.
$ nano simple_1.json
$ nano simple_2.json
Run the regression test script and update the .expected
file.
$ ./sample_files/compare_all.sh
$ cp sample_files/compare.result sample_files/compare.expected
Maintenance
To update a parser that is already imported, use git subtree pull
.
$ git subtree pull --prefix=vendored_parsers/tree-sitter-json git@github.com:tree-sitter/tree-sitter-json.git master
Profiling
If you have a file that's particularly slow, you can use cargo-flamegraph to see which functions are slow.
$ CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph --bin difft -- sample_files/slow_1.rs sample_files/slow_2.rs
It's also worth looking at memory usage, as graph traversal bugs can lead to huge memory consumption.
$ /usr/bin/time -v ./target/release/difft sample_files/slow_1.rs sample_files/slow_2.rs
If timing measurement are noisy, Linux's perf
tool will report
instructions executed, which is more stable.
$ perf stat ./target/release/difft sample_files/slow_1.rs sample_files/slow_2.rs
$ perf stat ./target/release/difft sample_files/typing_1.ml sample_files/typing_2.ml
Many more profiling techniques are discussed in The Rust Performance Book.
Glossary
Atom: An atom is an item in difftastic's syntax tree structure that has no children. It represents things like literals, variable names, and comments. See also 'list'.
Delimiter: A paired piece of syntax. A list has an open delimiter
and a close delimiter, such as [
and ]
. Delimiters may not be
punctuation (e.g. begin
and end
) and may be empty strings (e.g. infix
syntax converted to difftastic's syntax tree).
Hunk: A group of lines displayed together in the diff output. Increasing the number of context lines increases the size of the hunk.
LHS: Left-hand side. Difftastic compares two items, and LHS refers to the first item. See also 'RHS'.
List: A list is an item in difftastic's syntax tree structure that has an open delimiter, children, and a close delimiter. It represents things like expressions and function definitions. See also 'atom'.
Novel: An addition or a removal. Syntax is novel if it occurs in only one of the two items being compared.
RHS: Right-hand side. Difftastic compares two items, and RHS refers to the second item. See also 'LHS'.
Root: A syntax tree without a parent node. Roots represent top-level definitions in the file being diffed.
Slider: A diffing situation where there are multiple minimal diffs possible, due to adjacent content. It is possible to 'slide' to produce better results in this situation. See the discussion in Tricky Cases.
Syntax node: An item in difftastic's syntax tree structure. Either an atom or a list.
Token: A small piece of syntax tracked by difftastic (e.g. $x
,
function
or ]
), for highlighting and aligned display. This is
either an atom or a non-empty delimiter.
Alternative Projects
Many different tools exist for diffing files. This section of the manual discusses the design of other tools that have influenced difftastic.
Tree Diffing
This page summarises some of the other tree diffing tools available.
If you're in a hurry, start by looking at Autochrome. It's extremely capable, and has an excellent description of the design.
If you're interested in a summary of the academic literature, this blog post (and its accompanying paper -- mirrored under a CC BY-NC license) are great resources.
json-diff (2012)
Languages: JSON
Algorithm: Pairwise comparison
Output: CLI colours
json-diff performs a
structural diff of JSON files. It considers subtrees to be different
if they don't match exactly, so e.g. "foo"
and ["foo"]
are
entirely different.
json-diff is also noteworthy for its extremely readable display of results.
GumTree (2014)
Languages: ~10 programming
languages
Parser: Several, including srcML
Algorithm: Top-down, then bottom-up
Output: HTML, Swing GUI, or text
GumTree can parse several programming languages and then performs a tree-based diff, outputting an HTML display.
The GumTree algorithm is described in the associated paper 'Fine-grained and accurate source code differencing' by Falleri et al (DOI, PDF). It performs a greedy top-down search for identical subtrees, then performs a bottom-up search to match up the rest.
Tree Diff (2017)
Languages: S-expression data format
Algorithm: A* search
Output: Merged s-expression file
Tristan Hume wrote a tree diffing algorithm during his 2017 internship at Jane Street. The source code is not available, but he has a blog post discussing the design in depth.
This project finds minimal diffs between s-expression files used as
configuration by Jane Street. It uses A* search to find the minimal
diff between them, and builds a new s-expression with a section marked
with :date-switch
for the differing parts.
(Jane Street also has patdiff, but that seems to be a line-oriented
diff with some whitespace/integer display polish. It doesn't
understand that e.g. whitespace in "foo "
is meaningful).
Autochrome (2017)
Languages: Clojure
Parser: Custom, preserves comments
Algorithm: Dijkstra (previously A* search)
Output: HTML
Autochrome parses Clojure with a custom parser that preserves comments. Autochrome uses Dijkstra's algorithm to compare syntax trees.
Autochrome's webpage includes worked examples of the algorithm and a discussion of design tradeoffs. It's a really great resource for understanding tree diffing techniques in general.
graphtage (2020)
Languages: JSON, XML, HTML, YAML, plist, and CSS
Parser: json5, pyYAML, ignores comments
Algorithm: Levenshtein distance
Output: CLI colours
graphtage compares structured data by parsing into a generic file format, then displaying a diff. It even allows things like diffing JSON against YAML.
As with json-diff, it does not consider ["foo"]
and "foo"
to have
any similarities.
Diffsitter (2020)
Parser: Tree-sitter
Algorithm: Longest-common-subsequence
Output: CLI colours
Diffsitter is another tree-sitter based diff tool. It uses LCS diffing on the leaves of the syntax tree.
sdiff (2021)
Languages: Scheme
Parser: Scheme's built-in read
, ignores comments
Algorithm: MH-Diff from the Chawathe paper
Output: CLI colours
Semantically meaningful S-expression diff: Tree-diff for lisp source code was presented at FOSDEM 2021.