Parser

The role of the parser is to turn source-code which is fed as a string (in the form of files) into a tree-representation of that source-code. This tree is typically called the Abstract Syntax Tree (AST). The step of parsing consists of two distinct stages. The first one is the lexical analysis (Lexer) which is performed by a lexer. After lexing we perform the syntactical analysis (Parser) to construct the syntax tree.

                                                                   ┌──┐
       ┌──────────────┐                                            │  │
       │              │                                            └──┘
       │  Source Code │                                            /  \
       │              │   ┌─────────┐        ┌──────────┐         /    \
       │  ──────────  │   │         │        │          │     ┌──┐      ┌──┐
       │              ├───►  Lexer  │        │  Parser  ├────►│  │      │  │
       │  ─────────   │   │         │        │          │     └──┘      └──┘
       │              │   └────┬────┘        └──────────┘      /\        /\
       │  ────        │        │                  ▲           /  \      /  \
       │              │        │                  │        ┌──┐ ┌──┐ ┌──┐ ┌──┐
       │  ────────    │        ▼                  │        │  │ │  │ │  │ │  │
       │              │   ┌───────────────────────┴──┐     └──┘ └──┘ └──┘ └──┘
       │              │   │                          │
       └──────────────┘   │  ┌───┐ ┌───┐ ┌───┐ ┌───┐ │       Abstract Syntax
                          │  │ T │ │ T │ │ T │ │...│ │            Tree
                          │  └───┘ └───┘ └───┘ └───┘ │
                          │                          │
                          └──────────────────────────┘
                                 Token-Stream

Lexer

The lexer performs the lexical analysis. This step turns the source-string into a sequence of well known tokens. The Lexer (or sometimes also called tokenizer) splits the source-string into tokens (or words). Each token has a distinct type which corresponds to a grammar's element. Typical token-types are keywords, numbers, identifiers, brackets, dots, etc. So with the help of this token-stream it is much easier for the parser to spot certain patterns. E.g. a floating-point number consists of the token-sequence: number, dot, number.

The lexer is implemented in the lexer-module. It uses the logos crate to create a lexer that is able to identify all different terminal-symbols. Compared to other languages, Structured Text has a quite high number of keywords and other tokens, so RuSTy's lexer identifies a quite large number of different tokens.

Parser

The parser takes the token stream and creates the corresponding AST that represents the source code in a structured, hierarchical way. The parser is implemented in the parser module whereas the model for the AST is implemented in the ast module.

AST - Abstract Syntax Tree

The abstract syntax tree is a tree representation of the source code. Some parser implementations use a generic tree-data-structure consisting of Nodes which can have an arbitrary number of children. These nodes usually have dynamic properties like a type and an optional value and sometimes they even have dynamic properties stored in a map to make this representation even more flexible.

While this approach needs very little source code we decided to favour a less flexible approach. The RuSTy-AST models every single ast-node as its own struct with all necessary fields including the possible child-nodes. While this approach needs much more code and hand-written changes, its benefits lie in the clearness and simplicity of the data-structure. Every element of the AST is easily identified, debugged and understood. E.g. while in a generic node based AST it is easily possible to have a binary-statement with no, one, or seven child-nodes, the RuSTy-AST enforces the structure of every node. So the RuSTy-Binary-Statement has exactly two children. It is impossible to construct it differently.

Example

So an assignment a := 3; will be parsed with the help of the following Structures:

struct Reference {
   name: string
}

struct LiteralInteger {
   value: i128
}

struct Assignment {
   left: Box<AstStatement>,
   right: Box<AstStatement>
}

Recursive Descent Parser

There are a lot of different frameworks to generate parsers from formal grammars. While they generate highly optimized parsers we felt we wanted more control and more understanding of the parsing process and the resulting AST. The fact that at that point in time we were pretty new to rust itself, writing the parser by hand also gave us more practice and a stronger feeling of control and understanding. Using a parser-generator framework will definitely be an option for future improvements.

As for now, the parser is a hand-written recursive descent parser inside the parser-module.

As the parser reads the token stream Reference, KeywordEquals, Number, Semicolon it instantiates the corresponding syntax tree:

                      ┌─────────────────┐
                      │   Assignment    │
                      └──────┬──┬───────┘
                   left      │  │     right
                 ┌───────────┘  └──────────┐
                 ▼                         ▼
        ┌──────────────────┐     ┌──────────────────┐
        │    Reference     │     │  LiteralInteger  │
        ├──────────────────┤     ├──────────────────┤
        │    name: 'a'     │     │    value: '3'    │
        └──────────────────┘     └──────────────────┘