Next: Developing major modes with tree-sitter, Previous: User-defined “Things” and Navigation, Up: Parsing Program Source [Contents][Index]
Sometimes, the source of a programming language could contain snippets of other languages; HTML + CSS + JavaScript is one example. In that case, text segments written in different languages need to be assigned different parsers. Traditionally, this is achieved by using narrowing. While tree-sitter works with narrowing (see narrowing), the recommended way is instead to specify regions of buffer text (i.e., ranges) in which a parser will operate. This section describes functions for setting and getting ranges for a parser.
Generally when there are multiple languages at play, there is a “primary”, or “host” language. The parser for this language—the primary parser, parses the entire buffer. Parsers for other languages are “embedded” or “guest” parsers, which only work on part of the buffer. The parse tree of the primary parser is usually used to determine the ranges in which the embedded parsers operate.
Major modes should set treesit-primary-parser
to the primary
parser before calling treesit-major-mode-setup
, so that Emacs can
configure the primary parser correctly for font-lock and other features.
Lisp programs should call treesit-update-ranges
to make sure the
ranges for each parser are correct before using parsers in a buffer, and
call treesit-language-at
to figure out the language responsible
for the text at some position. These two functions don’t work by
themselves; they need major modes to set treesit-range-settings
and treesit-language-at-point-function
, which do the actual work.
These functions and variables are explained in more detail towards the
end of the section.
In short, multi-language major modes should set
treesit-primary-parser
, treesit-range-settings
, and
treesit-language-at-point-function
before calling
treesit-major-mode-setup
.
This function sets up parser to operate on ranges. The
parser will only read the text of the specified ranges. Each
range in ranges is a pair of the form (beg . end)
.
The ranges in ranges must come in order and must not overlap. That is, in pseudo code:
(cl-loop for idx from 1 to (1- (length ranges)) for prev = (nth (1- idx) ranges) for next = (nth idx ranges) should (<= (car prev) (cdr prev) (car next) (cdr next)))
If ranges violates this constraint, or something else went
wrong, this function signals the treesit-range-invalid
error.
The signal data contains a specific error message and the ranges we
are trying to set.
This function can also be used for disabling ranges. If ranges
is nil
, the parser is set to parse the whole buffer.
Example:
(treesit-parser-set-included-ranges parser '((1 . 9) (16 . 24) (24 . 25)))
This function returns the ranges set for parser. The return
value is the same as the ranges argument of
treesit-parser-included-ranges
: a list of cons cells of the form
(beg . end)
. If parser doesn’t have any
ranges, the return value is nil
.
(treesit-parser-included-ranges parser) ⇒ ((1 . 9) (16 . 24) (24 . 25))
This function matches source with query and returns the
ranges of captured nodes. The return value is a list of cons cells of
the form (beg . end)
, where beg and
end specify the beginning and the end of a region of text.
For convenience, source can be a language symbol, a parser, or a node. If it’s a language symbol, this function matches in the root node of the first parser using that language; if a parser, this function matches in the root node of that parser; if a node, this function matches in that node.
The argument query is the query used to capture nodes
(see Pattern Matching Tree-sitter Nodes). The capture names don’t matter. The
arguments beg and end, if both non-nil
, limit the
range in which this function queries.
Like other query functions, this function raises the
treesit-query-error
error if query is malformed.
It should suffice for general Lisp programs to call the following two functions in order to support program sources that mix multiple languages.
This function updates ranges for parsers in the buffer. It makes sure
the parsers’ ranges are set correctly between beg and end,
according to treesit-range-settings
. If omitted, beg
defaults to the beginning of the buffer, and end defaults to the
end of the buffer.
For example, fontification functions use this function before querying for nodes in a region.
This function returns the language of the text at buffer position
pos. Under the hood it calls
treesit-language-at-point-function
and returns its return
value. If treesit-language-at-point-function
is nil
,
this function returns the language of the first parser in the returned
value of treesit-parser-list
. If there is no parser in the
buffer, it returns nil
.
Normally, in a set of languages that can be mixed together, there is a host language and one or more embedded languages. A Lisp program usually first parses the whole document with the host language’s parser, retrieves some information, sets ranges for the embedded languages with that information, and then parses the embedded languages.
Take a buffer containing HTML, CSS, and JavaScript
as an example. A Lisp program will first parse the whole buffer with
an HTML parser, then query the parser for
style_element
and script_element
nodes, which correspond
to CSS and JavaScript text, respectively. Then it sets the
range of the CSS and JavaScript parsers to the range which
their corresponding nodes span.
Given a simple HTML document:
<html> <script>1 + 2</script> <style>body { color: "blue"; }</style> </html>
a Lisp program will first parse with a HTML parser, then set ranges for CSS and JavaScript parsers:
;; Create parsers. (setq html (treesit-parser-create 'html)) (setq css (treesit-parser-create 'css)) (setq js (treesit-parser-create 'javascript))
;; Set CSS ranges. (setq css-range (treesit-query-range 'html '((style_element (raw_text) @capture)))) (treesit-parser-set-included-ranges css css-range)
;; Set JavaScript ranges. (setq js-range (treesit-query-range 'html '((script_element (raw_text) @capture)))) (treesit-parser-set-included-ranges js js-range)
Emacs automates this process in treesit-update-ranges
. A
multi-language major mode should set treesit-range-settings
so
that treesit-update-ranges
knows how to perform this process
automatically. Major modes should use the helper function
treesit-range-rules
to generate a value that can be assigned to
treesit-range-settings
. The settings in the following example
directly translate into operations shown above.
(setq treesit-range-settings (treesit-range-rules :embed 'javascript :host 'html '((script_element (raw_text) @capture))
:embed 'css :host 'html '((style_element (raw_text) @capture))))
;; Major modes with multiple languages should always set ;; `treesit-language-at-point-function' (which see). (setq treesit-language-at-point-function (lambda (pos) (let* ((node (treesit-node-at pos 'html)) (parent (treesit-node-parent node))) (cond ((and node parent (equal (treesit-node-type node) "raw_text") (equal (treesit-node-type parent) "script_element")) 'javascript) ((and node parent (equal (treesit-node-type node) "raw_text") (equal (treesit-node-type parent) "style_element")) 'css) (t 'html)))))
This function is used to set treesit-range-settings
. It takes
care of compiling queries and other post-processing, and outputs a
value that treesit-range-settings
can have.
It takes a series of query-specs, where each query-spec is a query preceded by zero or more keyword/value pairs. Each query is a tree-sitter query in either the string, s-expression, or compiled form, or a function.
If query is a tree-sitter query, it should be preceded by two
keyword/value pairs, where the :embed
keyword
specifies the embedded language, and the :host
keyword
specifies the host language.
If the query is given the :local
keyword whose value is
t
, the range set by this query has a dedicated local parser;
otherwise the range shares a parser with other ranges for the same
language.
By default, a parser sees its ranges as a continuum, rather than treating them as separate independent segments. Therefore, if the embedded ranges are semantically independent segments, they should be processed by local parsers, described below.
Local parser set to a range can be retrieved by
treesit-local-parsers-at
and treesit-local-parsers-on
.
treesit-update-ranges
uses query to figure out how to set
the ranges for parsers for the embedded language. It queries
query in a host language parser, computes the ranges which the
captured nodes span, and applies these ranges to embedded language
parsers.
If query is a function, it doesn’t need any keyword and value pair. It should be a function that takes 2 arguments, start and end, and sets the ranges for parsers in the current buffer in the region between start and end. It is fine for this function to set ranges in a larger region that encompasses the region between start and end.
This variable helps treesit-update-ranges
in updating the
ranges for parsers in the buffer. It is a list of settings
where the exact format of a setting is considered internal. You
should use treesit-range-rules
to generate a value that this
variable can have.
This variable’s value should be a function that takes a single
argument, pos, which is a buffer position, and returns the
language of the buffer text at pos. This variable is used by
treesit-language-at
.
This function returns all the local parsers at pos in the current buffer. pos defaults to point.
Local parsers are those which only parse a limited region marked by an
overlay with a non-nil
treesit-parser
property. If
language is non-nil
, only return parsers for that
language.
This function is the same as treesit-local-parsers-at
, but it
returns the local parsers in the range between beg and end
instead of at point.
beg and end default to the entire accessible portion of the buffer.
Next: Developing major modes with tree-sitter, Previous: User-defined “Things” and Navigation, Up: Parsing Program Source [Contents][Index]