Adding A Parser
Finding a parser
New parsers for difftastic must be reasonably complete and maintained.
There are many tree-sitter parsers available, and the tree-sitter website includes a list of some well-known parsers.
Add the source code
Once you've found a parser, add it as a git subtree to
vendored_parsers/
. We'll use
tree-sitter-json as
an example.
$ git subtree add --prefix=vendored_parsers/tree-sitter-json https://github.com/tree-sitter/tree-sitter-json.git master
Configure the build
Cargo does not allow packages to include subdirectories that contain a
Cargo.toml
. Add a symlink to the src/
parser subdirectory.
$ cd vendored_parsers
$ ln -s tree-sitter-json/src tree-sitter-json-src
You can now add the parser to build by including the directory in
build.rs
.
TreeSitterParser {
name: "tree-sitter-json",
src_dir: "vendored_parsers/tree-sitter-json-src",
extra_files: vec![],
},
If your parser includes custom C or C++ files for lexing (e.g. a
scanner.cc
), add them to extra_files
.
Configure parsing
Add an entry to tree_sitter_parser.rs
for your language.
Json => {
let language = unsafe { tree_sitter_json() };
TreeSitterConfig {
language,
atom_nodes: vec!["string"].into_iter().collect(),
delimiter_tokens: vec![("{", "}"), ("[", "]")],
highlight_query: ts::Query::new(
language,
include_str!("../../vendored_parsers/highlights/json.scm"),
)
.unwrap(),
sub_languages: vec![],
}
}
atom_nodes
is a list of tree-sitter node names that should be
treated as atoms even though the nodes have children. This is common
for things like string literals or interpolated strings, where the
node might have children for the opening and closing quote.
If you don't set atom_nodes
, you may notice added/removed content
shown in white. This is usually a sign that child node should have its
parent treated as an atom.
delimiter_tokens
are delimiters that difftastic stores on
the enclosing list node. This allows difftastic to distinguish
delimiter tokens from other punctuation in the language.
If you don't set delimiter_tokens
, difftastic will consider the
tokens in isolation, and may think that a (
was added but the )
was unchanged.
You can use difft --dump-ts foo.json
to see the results of the
tree-sitter parser, and difft --dump-syntax foo.json
to confirm that
you've set atoms and delimiters correctly.
sub-languages
is empty for most languages: see the code documentation for details.
Configure language detection
Update language_name
in guess_language.rs
to detect your new
language. Insert a match arm like:
Json => "json",
There may also file names or shebangs associated with your language; configure those
by adapting the language_globs
, from_emacs_mode_header
and from_shebang
functions
in that file.
GitHub's linguist definitions
are a useful source of common file extensions.
Syntax highlighting (Optional)
To add syntax highlighting for your language, you'll also need a symlink
to the queries/highlights.scm
file, if available.
$ cd vendored_parsers/highlights
$ ln -s ../tree-sitter-json/queries/highlights.scm json.scm
Test It
Search GitHub for a popular repository in your target language (example search) and confirm that git history looks sensible with difftastic.
Add a regression test
Finally, add a regression test for your language. This ensures that the output for your test file doesn't change unexpectedly.
Regression test files live in sample_files/
and have the form
foo_1.abc
and foo_2.abc
.
$ nano simple_1.json
$ nano simple_2.json
Run the regression test script and update the .expected
file.
$ ./sample_files/compare_all.sh
$ cp sample_files/compare.result sample_files/compare.expected
Maintenance
To update a parser that is already imported, use git subtree pull
.
$ git subtree pull --prefix=vendored_parsers/tree-sitter-json git@github.com:tree-sitter/tree-sitter-json.git master