Next: , Previous: , Up: Parsing Program Source   [Contents][Index]

38.7 Parsing Text in Multiple Languages

Sometimes, the source of a programming language could contain snippets of other languages; HTML + CSS + JavaScript is one example. In that case, text segments written in different languages need to be assigned different parsers. Traditionally, this is achieved by using narrowing. While tree-sitter works with narrowing (see narrowing), the recommended way is instead to specify regions of buffer text (i.e., ranges) in which a parser will operate. This section describes functions for setting and getting ranges for a parser.

Generally when there are multiple languages at play, there is a “primary”, or “host” language. The parser for this language—the primary parser, parses the entire buffer. Parsers for other languages are “embedded” or “guest” parsers, which only work on part of the buffer. The parse tree of the primary parser is usually used to determine the ranges in which the embedded parsers operate.

Major modes should set treesit-primary-parser to the primary parser before calling treesit-major-mode-setup, so that Emacs can configure the primary parser correctly for font-lock and other features.

Lisp programs should call treesit-update-ranges to make sure the ranges for each parser are correct before using parsers in a buffer, and call treesit-language-at to figure out the language responsible for the text at some position. These two functions don’t work by themselves; they need major modes to set treesit-range-settings and treesit-language-at-point-function, which do the actual work. These functions and variables are explained in more detail towards the end of the section.

In short, multi-language major modes should set treesit-primary-parser, treesit-range-settings, and treesit-language-at-point-function before calling treesit-major-mode-setup.

Getting and setting ranges

Function: treesit-parser-set-included-ranges parser ranges

This function sets up parser to operate on ranges. The parser will only read the text of the specified ranges. Each range in ranges is a pair of the form (beg . end).

The ranges in ranges must come in order and must not overlap. That is, in pseudo code:

(cl-loop for idx from 1 to (1- (length ranges))
         for prev = (nth (1- idx) ranges)
         for next = (nth idx ranges)
         should (<= (car prev) (cdr prev)
                    (car next) (cdr next)))

If ranges violates this constraint, or something else went wrong, this function signals the treesit-range-invalid error. The signal data contains a specific error message and the ranges we are trying to set.

This function can also be used for disabling ranges. If ranges is nil, the parser is set to parse the whole buffer.

Example:

(treesit-parser-set-included-ranges
 parser '((1 . 9) (16 . 24) (24 . 25)))
Function: treesit-parser-included-ranges parser

This function returns the ranges set for parser. The return value is the same as the ranges argument of treesit-parser-included-ranges: a list of cons cells of the form (beg . end). If parser doesn’t have any ranges, the return value is nil.

(treesit-parser-included-ranges parser)
    ⇒ ((1 . 9) (16 . 24) (24 . 25))
Function: treesit-query-range source query &optional beg end

This function matches source with query and returns the ranges of captured nodes. The return value is a list of cons cells of the form (beg . end), where beg and end specify the beginning and the end of a region of text.

For convenience, source can be a language symbol, a parser, or a node. If it’s a language symbol, this function matches in the root node of the first parser using that language; if a parser, this function matches in the root node of that parser; if a node, this function matches in that node.

The argument query is the query used to capture nodes (see Pattern Matching Tree-sitter Nodes). The capture names don’t matter. The arguments beg and end, if both non-nil, limit the range in which this function queries.

Like other query functions, this function raises the treesit-query-error error if query is malformed.

Supporting multiple languages in Lisp programs

It should suffice for general Lisp programs to call the following two functions in order to support program sources that mix multiple languages.

Function: treesit-update-ranges &optional beg end

This function updates ranges for parsers in the buffer. It makes sure the parsers’ ranges are set correctly between beg and end, according to treesit-range-settings. If omitted, beg defaults to the beginning of the buffer, and end defaults to the end of the buffer.

For example, fontification functions use this function before querying for nodes in a region.

Function: treesit-language-at pos

This function returns the language of the text at buffer position pos. Under the hood it calls treesit-language-at-point-function and returns its return value. If treesit-language-at-point-function is nil, this function returns the language of the first parser in the returned value of treesit-parser-list. If there is no parser in the buffer, it returns nil.

Supporting multiple languages in major modes

Normally, in a set of languages that can be mixed together, there is a host language and one or more embedded languages. A Lisp program usually first parses the whole document with the host language’s parser, retrieves some information, sets ranges for the embedded languages with that information, and then parses the embedded languages.

Take a buffer containing HTML, CSS, and JavaScript as an example. A Lisp program will first parse the whole buffer with an HTML parser, then query the parser for style_element and script_element nodes, which correspond to CSS and JavaScript text, respectively. Then it sets the range of the CSS and JavaScript parsers to the range which their corresponding nodes span.

Given a simple HTML document:

<html>
  <script>1 + 2</script>
  <style>body { color: "blue"; }</style>
</html>

a Lisp program will first parse with a HTML parser, then set ranges for CSS and JavaScript parsers:

;; Create parsers.
(setq html (treesit-parser-create 'html))
(setq css (treesit-parser-create 'css))
(setq js (treesit-parser-create 'javascript))

;; Set CSS ranges.
(setq css-range
      (treesit-query-range
       'html
       '((style_element (raw_text) @capture))))
(treesit-parser-set-included-ranges css css-range)

;; Set JavaScript ranges.
(setq js-range
      (treesit-query-range
       'html
       '((script_element (raw_text) @capture))))
(treesit-parser-set-included-ranges js js-range)

Emacs automates this process in treesit-update-ranges. A multi-language major mode should set treesit-range-settings so that treesit-update-ranges knows how to perform this process automatically. Major modes should use the helper function treesit-range-rules to generate a value that can be assigned to treesit-range-settings. The settings in the following example directly translate into operations shown above.

(setq treesit-range-settings
      (treesit-range-rules
       :embed 'javascript
       :host 'html
       '((script_element (raw_text) @capture))
       :embed 'css
       :host 'html
       '((style_element (raw_text) @capture))))

;; Major modes with multiple languages should always set
;; `treesit-language-at-point-function' (which see).
(setq treesit-language-at-point-function
      (lambda (pos)
        (let* ((node (treesit-node-at pos 'html))
               (parent (treesit-node-parent node)))
          (cond
           ((and node parent
                 (equal (treesit-node-type node) "raw_text")
                 (equal (treesit-node-type parent) "script_element"))
            'javascript)
           ((and node parent
                 (equal (treesit-node-type node) "raw_text")
                 (equal (treesit-node-type parent) "style_element"))
            'css)
           (t 'html)))))
Function: treesit-range-rules &rest query-specs

This function is used to set treesit-range-settings. It takes care of compiling queries and other post-processing, and outputs a value that treesit-range-settings can have.

It takes a series of query-specs, where each query-spec is a query preceded by zero or more keyword/value pairs. Each query is a tree-sitter query in either the string, s-expression, or compiled form, or a function.

If query is a tree-sitter query, it should be preceded by two keyword/value pairs, where the :embed keyword specifies the embedded language, and the :host keyword specifies the host language.

If the query is given the :local keyword whose value is t, the range set by this query has a dedicated local parser; otherwise the range shares a parser with other ranges for the same language.

By default, a parser sees its ranges as a continuum, rather than treating them as separate independent segments. Therefore, if the embedded ranges are semantically independent segments, they should be processed by local parsers, described below.

Local parser set to a range can be retrieved by treesit-local-parsers-at and treesit-local-parsers-on.

treesit-update-ranges uses query to figure out how to set the ranges for parsers for the embedded language. It queries query in a host language parser, computes the ranges which the captured nodes span, and applies these ranges to embedded language parsers.

If query is a function, it doesn’t need any keyword and value pair. It should be a function that takes 2 arguments, start and end, and sets the ranges for parsers in the current buffer in the region between start and end. It is fine for this function to set ranges in a larger region that encompasses the region between start and end.

Variable: treesit-range-settings

This variable helps treesit-update-ranges in updating the ranges for parsers in the buffer. It is a list of settings where the exact format of a setting is considered internal. You should use treesit-range-rules to generate a value that this variable can have.

Variable: treesit-language-at-point-function

This variable’s value should be a function that takes a single argument, pos, which is a buffer position, and returns the language of the buffer text at pos. This variable is used by treesit-language-at.

Function: treesit-local-parsers-at &optional pos language

This function returns all the local parsers at pos in the current buffer. pos defaults to point.

Local parsers are those which only parse a limited region marked by an overlay with a non-nil treesit-parser property. If language is non-nil, only return parsers for that language.

Function: treesit-local-parsers-on &optional beg end language

This function is the same as treesit-local-parsers-at, but it returns the local parsers in the range between beg and end instead of at point.

beg and end default to the entire accessible portion of the buffer.

Next: Developing major modes with tree-sitter, Previous: User-defined “Things” and Navigation, Up: Parsing Program Source   [Contents][Index]