Internals: Parsing

Difftastic uses tree-sitter to build a parse tree. The parse tree is then converted to a simpler tree which can be diffed.

Parsing with Tree-sitter

Difftastic relies on tree-sitter to understand syntax. You can view the parse tree that tree-sitter produces using the --dump-ts flag.

$ difft --dump-ts sample_files/javascript_simple_before.js | head
program (0, 0) - (7, 0)
  comment (0, 0) - (0, 8) "// hello"
  expression_statement (1, 0) - (1, 6)
    call_expression (1, 0) - (1, 5)
      identifier (1, 0) - (1, 3) "foo"
      arguments (1, 3) - (1, 5)
        ( (1, 3) - (1, 4) "("
        ) (1, 4) - (1, 5) ")"
    ; (1, 5) - (1, 6) ";"
  expression_statement (2, 0) - (2, 6)

Simplified Syntax

Difftastic converts the tree-sitter parse tree to a simplified syntax tree. The syntax tree is a uniform representation where everything is either an atom (e.g. integer literals, comments, variable names) or a list (consisting of the open delimiter, children and the close delimiter).

The flag --dump-syntax will display the syntax tree generated for a file.

$ difft --dump-syntax sample_files/before.js
[
    Atom id:1 {
        content: "// hello",
        position: "0:0-8",
    },
    List id:2 {
        open_content: "",
        open_position: "1:0-0",
        children: [
          ...

Conversion Process

The simple representation of the difftastic parse tree makes diffing much easier. Converting the detailed tree-sitter parse tree is a recursive tree walk, treating tree-sitter leaf nodes as atoms. There are two exceptions.

(1) Tree-sitter parse trees sometimes include unwanted structure. Some grammars consider string literals to be a single token, whereas others treat strings as a complex structure where the delimiters are separate.

tree_sitter_parser.rs uses atom_nodes to mark specific tree-sitter node names as flat atoms even if the node has children.

(2) Tree-sitter parse trees include open and closing delimiters as tokens. A list [1] will have a parse tree that includes [ and ] as nodes.

$ echo '[1]' > example.js
$ difft --dump-ts example.js
program (0, 0) - (1, 0)
  expression_statement (0, 0) - (0, 3)
    array (0, 0) - (0, 3)
      [ (0, 0) - (0, 1) "["
      number (0, 1) - (0, 2) "1"
      ] (0, 2) - (0, 3) "]"

tree_sitter_parser.rs uses open_delimiter_tokens to ensure that [ and ] are used as delimiter content in the enclosing list, rather than converitng them to atoms.

Difftastic can match up atoms that occur in different parts of the simplified syntax tree. If e.g. a [ is treated as an atom, difftastic might match it with another [ elsewhere. The resulting diff would be unbalanced, highlighting different numbers of open and close delimiters.

Lossy Syntax Trees

The simplified syntax tree only stores node content and node position. It does not store whitespace between nodes, and position is largely ignored during diffing.