“I'll keep writing long letters, lol."
― Me
I know, it's been a while. I was planning to publish this one before 2023 ends, yet here we are and I'm back with a bunch of exciting (at least for me) stuff. Here is a brief summary of what we'll be covering in this chapter:
- Syntax changes and additions
- Testing in Rust
- Executing code
- Error handling and reporting
1. Parsing changes and improvements
Last time I expressed that our language would combine elements of PHP and SQL. However, upon consideration, I found that adopting a subset of JavaScript syntax is not only more achievable but also more enjoyable to work with for me, as I am way more familiar with JavaScript than PHP. Also, JSON is a natural choice for a document database data format, so it makes sense to use a subset of JavaScript syntax. So we're building a MongoDB-esque document database... I guess? I don't know, I'm just making these things up as I go.
Regarding the language's current state, we have definitely diverged from Crafting Interpreters, which is not extremely surprising. Chapter 12 in the book explains how to implement classes, and I'm not planning to implement classes in LykiaQL. Currently, there is only object literals that can be used to create objects, and there are no classes in our language. Maybe we can do something similar for data definition language (DDL) in the future, but OOP classes are not relevant to our language's scope.
If you are curious enough to see the changelog in detail, you can check out the diff.
1.1. Overview
- Our lexer effectively tokenizes the input. It still does, but now it's more robust, less buggy and can handle more complex input.
- The parser adeptly transforms those tokens into an Abstract Syntax Tree (AST). It is a more complicated parser now.
- Although our interpreter was functional, I haven't yet demonstrated its workings thoroughly as it wasn't fully complete at that point. We'll get into that towards the end of this chapter as well.
Since our last meeting, there have been some significant progress in the parser. So let me just show you what it can parse now, then we can discuss the internals.
1.1.1. Array and object literals, get and set expressions
Arrays and object literals are essential parts of JavaScript and JSON. So it is hard to think of a language, which is a subset of JavaScript, without them. That's why they're also incorporated now. Here are some examples:
var $obj0 = {
a : "letter A",
b : 'letter B',
1: "one",
"Hello": "It's me."
};
var $obj1 = {
'a b c' : 1,
b : {
c : 2,
d : 3
}
};
var $arr0 = [1, 2, 3, 4, 5];
var $arr1 = [[1, 2], [3, 4], [5, 6]];
print($obj1.b.c); // Get
print($obj1['a b c']); // Get
$obj1.b.c = 10; // Set
$obj1['a b c'] = 20; // Set
print($arr0[0]); // Get
print($arr1[1][0]); // Get
$arr0[0] = 10; // Set
$arr1[1][0] = 20; // Set
1.1.2. Wider support for SQL
As you may remember, we were following SQLite syntax to incorporate SQL into our language (for nerds -- here is a link to SQLite's select grammar). I'm still following the same syntax, but of course there are some cases, I simply cut out. Common table expressions, window functions and recursive queries are not supported. Even with those missing pieces, our grammar has mature enough to cover famous SQL clauses such as from
, where
, group by
, having
, order by
, limit
and offset
, subqueries and joins, of course.
Some naive selects
SELECT users.* from users;
SELECT id, users.name as username from users;
Subqueries
SELECT * from (SELECT * from users) u; -- they see me nestin' they hatin'
Where, limit and offset
SELECT * from users limit 5 offset 10;
SELECT * from users limit 10, 5;
SELECT * from users where
(id > 100 and name = 'John')
or (id < 10 and name = 'Jane');
Joins, group by and having
SELECT * from users
inner join orders on users.id = orders.user_id
inner join baskets on baskets.order_id = orders.id;
where users.id = $user_id;
SELECT avg(salary) from employees
group by department_id;
SELECT avg(salary) from employees
group by department_id, job_id having avg(salary) > 1000;
These select examples are mostly indistinguishable from traditional SQL, but our inserts and updates are significantly different. Here are some examples:
INSERT INTO db.users values ({
name: 'John',
surname: 'Doe',
age: 42,
});
So, instead of using values
keyword, we're using an object literal to insert data. This is not only more concise, but also more readable than
INSERT INTO db.users (name, surname, age) values ('John', 'Doe', 42); -- booo!
1.1.3. Function syntax
For the sake of increasing JavaScript compatibility, I've changed our fun
keyword to function
(yeah, there is no "fun" in JavaScript). Plus, I've added support for anonymous functions. Here are some examples:
var $f = function($x, $y) {
return $x + $y;
};
(function () {
print(\"hello\");
})();
1.2. Some highlights
Here in this section, I'll try to cover the entertaining stuff. There are no cohesion between the subsections, so you can skip to the parts that you're interested in. You can skip the this section altogether, too. I won't be offended.
1.2.1. The "=" problem
So we're stitching together a language that combines elements of JS and SQL. As you might know, many mainstream programming languages use ==
for comparison and =
for assignment. SQL, on the other hand, uses =
for comparison. So, when we see an =
operator, we intuitively understand that it is for comparison but not for assignment. In our imperative side of the language, however, we still need a way to assign values to variables. So, we have two options:
- We can use :=
for assignment and =
for comparison.
- We can track the context and use =
for assignment and ==
for comparison in imperative context, and =
for comparison in declarative context.
I opted for the latter, because I want to keep the imperative part as close to JavaScript as possible. So, here's how it looks like:
var $x = 10;
var $y = 20;
var $z = $x + $y;
if ($z == 30) {
print("z is 30");
}
SELECT * from users where id = 10;
So both =
and ==
are used for comparison in declarative context, and =
is used for assignment in imperative context. If you used both languages (highly probable if you're reading this), your brain will probably do the context tracking for you.
OK, how?
I won't go into too much detail, but the simple solution is to maintain a select_depth counter, if the number is greater than 0, we're in a select
context, otherwise we're in just plain script context. So, we can use this counter to determine which operator to use for comparison. Here's the relevant code:
pub struct Parser<'a> {
tokens: &'a Vec<Token>,
current: usize,
arena: AstArena,
in_select_depth: usize,
in_array_depth: usize,
in_object_depth: usize,
}
impl<'a> Parser<'a> {
// ...
fn sql_select_inner(&mut self) -> ParseResult<SqlSelect> {
self.in_select_depth += 1;
// Extremely complex SELECT parsing...
self.in_select_depth -= 1;
Ok(SqlSelect {
core,
order_by,
limit,
})
}
// ...
fn equality(&mut self) -> ParseResult<ExprId> {
if self.in_select_depth > 0 {
binary!(self, [sym!(BangEqual), sym!(Equal)], comparison);
} else {
binary!(self, [sym!(BangEqual), sym!(EqualEqual)], comparison);
}
}
// ...
}
Every time we enter a select
expression, we increment the counter, and every time we exit a select
expression, we decrement the counter. That's it. You may ask why I didn't use a boolean flag for this. Well, there can be nested select
expressions, so we need to keep track of the depth. If we used a boolean flag, we would incorrectly set the flag to false
when we exit a sub select
expression, while the outer select
expressions are still active.
This depth-tracking worked for distinguishing (logical and (&&)
, and
) and (logical or (||)
, or
) operators as well, but I won't bore you with their implementation, too. It is almost the same as the equality operator.
1.2.2. Refactoring the scanner
In the beginning of the project, for the sake of moving rapid, I implemented a quick and dirty scanner. I was aware of that it was not the Rustacean way of doing things, and I was planning to refactor it later. The dirty part of the scanner was its imperative nature. For example, it was using a mutable variable to keep track of the current character, and it was incrementing this variable every time we call advance
method (I can feel your irritation Rust gods, and I'm sorry). Obviously this was not ideal, as Rust has powerful functional features like iterators, and we should use them whenever possible. Also, if a language recommends you to do something in a certain way, you should probably do that in thing recommended way. So, I refactored the scanner to be more functional. It is still not that perfect, but I eliminated some unnecessary mutability and made it more idiomatic. For instance;
Initially we were using a mutable current
variable to keep track of the current character. We were incrementing this variable every time we call advance
method. Here's the old way of iterating over the characters:
impl Scanner {
pub fn scan(source: &str) -> Result<Vec<Token>, ScanError> {
let mut scanner = Scanner {
chars: source.chars().collect(),
current: 0,
// ...
}
// ...
}
// ...
}
fn advance(&mut self) -> char {
let c = self.chars[self.current];
self.current += 1;
c
}
And here's the new way:
impl<'a> Scanner<'a> {
pub fn scan(source: &str) -> Result<Vec<Token>, ScanError> {
let mut scanner = Scanner {
chars: source.chars().enumerate().peekable(),
// ...
};
// ...
}
// ...
}
fn advance(&mut self) -> (usize, char) {
self.chars.next().unwrap()
}
So, instead of iterating over the array manually, we're using an iterator now. We're also using peekable
method to get the next character without consuming it. This is useful for doing lookaheads.
1.2.4. Separate Identifier nodes in AST
Prior to this improvement, "Token" nodes were used for representing identifiers in the AST. This was not ideal, because irrelevant information was also stored in the AST. So, I've added a new node type for identifiers, and now we're storing only the identifier's name in the AST.
// lang/mod.rs
#[derive(Debug, Clone, Eq, PartialEq, Serialize)]
#[serde(tag = "@type")]
pub struct Identifier {
pub name: String,
pub dollar: bool,
}
For example:
// lang/ast/expr.rs
#[derive(Debug, Eq, PartialEq, Serialize)]
#[serde(tag = "@type")]
pub enum Expr {
// ...
#[serde(rename = "Expr::Function")]
Function {
name: Option<Identifier>, // was Token before
parameters: Vec<Identifier>, // was Token before
body: Rc<Vec<StmtId>>,
#[serde(skip)]
span: Span,
}
// ...
}
1.2.5. Serializing AST
I know you're thinking about all those serde
attributes you've seen in the previous section. Yes, I've added serialization support to the AST. This is not only useful for debugging purposes, but also for testing. We'll get into testing in the next section, but for now, let me show you how we can serialize an AST node:
1.2.6. Spans
- Error types
- Utilizing spans for source code marking
- Pretty reporting libraries
1.3 Parsing SQL
Parsing SQL is a complicated process — not difficult to understand but there are just too many cases to consider. Thus I picked only one parsing example, which is FROM clauses, as an example. It is one of slightly more complicated parts of SQL parsing. I'll try to explain the process step by step, but I won't go into too much detail. If you're interested in the details, you can check out the source.
a. sql_select_from
:
This function is responsible for parsing the FROM
clause of a SELECT
statement. If the next token is the keyword FROM
it calls the sql_select_subquery_join
function to parse the subquery or join clause associated with the FROM
keyword. Returns an Option<SqlCollectionSubquery>
, where Some
indicates a successful parse, and None indicates the absence of a FROM
clause.
fn sql_select_from(&mut self) -> ParseResult<Option<SqlCollectionSubquery>> {
if self.match_next(skw!(From)) {
return Ok(Some(self.sql_select_subquery_join()?));
}
Ok(None)
}
b. sql_select_subquery_join
:
This function parses a group of subqueries joined by various join types (LEFT, RIGHT, INNER).
It utilizes a loop to handle multiple subqueries joined by commas. For each subquery, it first parses the left part using sql_select_subquery_collection
. It then checks for join keywords (LEFT, RIGHT, INNER), determines the type of join, parses the right part, and includes an optional join constraint. The parsed subqueries are grouped together, and the loop continues until no more commas are encountered. The final result is a SqlCollectionSubquery::Group
containing the parsed subqueries.
fn sql_select_subquery_join(&mut self) -> ParseResult<SqlCollectionSubquery> {
let mut subquery_group: Vec<SqlCollectionSubquery> = vec![];
loop {
let left = self.sql_select_subquery_collection()?;
subquery_group.push(left);
while self.match_next_one_of(&[skw!(Left), skw!(Right), skw!(Inner), skw!(Join)]) {
// If the next token is a join keyword, then it must be a join subquery
let peek = self.peek_bw(1);
let join_type = if peek.tok_type == skw!(Inner) {
self.expected(skw!(Join))?;
SqlJoinType::Inner
} else if peek.tok_type == skw!(Left) {
optional_with_expected!(self, skw!(Outer), skw!(Join));
SqlJoinType::Left
} else if peek.tok_type == skw!(Right) {
optional_with_expected!(self, skw!(Outer), skw!(Join));
SqlJoinType::Right
} else if peek.tok_type == skw!(Join) {
SqlJoinType::Inner
} else {
return Err(ParseError::UnexpectedToken {
token: peek.clone(),
});
};
let right = self.sql_select_subquery_collection()?;
let join_constraint: Option<SqlExpr> = if self.match_next(skw!(On)) {
Some(SqlExpr::Default(self.expression()?))
} else {
None
};
let left_popped = subquery_group.pop().unwrap();
subquery_group.push(SqlCollectionSubquery::Join {
left: Box::new(left_popped),
right: Box::new(right),
join_type,
constraint: join_constraint,
});
}
if !self.match_next(sym!(Comma)) {
break;
}
}
return Ok(SqlCollectionSubquery::Group {
values: subquery_group,
});
}
c. sql_select_subquery_collection
:
This function handles parsing various forms of subqueries within the FROM
clause.
If the next token is a left parenthesis, it checks if it's a SELECT
statement or a recursive subquery.
For a SELECT
statement, it calls sql_select to parse the expression. If the next token is an AS keyword, it parses the alias. For a recursive subquery or a regular collection, it calls sql_select_subquery_join
to parse the subquery. Returns a SqlCollectionSubquery
representing the parsed subquery or collection.
fn sql_select_subquery_collection(&mut self) -> ParseResult<SqlCollectionSubquery> {
if self.match_next(sym!(LeftParen)) {
if self.cmp_tok(&skw!(Select)) {
let expr = self.sql_select()?;
self.expected(sym!(RightParen))?; // closing paren
let alias: Option<Token> =
optional_with_expected!(self, skw!(As), Identifier { dollar: false });
return Ok(SqlCollectionSubquery::Select {
expr,
alias: alias.map(|t| t.extract_identifier().unwrap()),
});
}
// If the next token is a left paren, then it must be either a select statement or a recursive subquery
let parsed = self.sql_select_subquery_join()?;
self.expected(sym!(RightParen))?; // closing paren
return Ok(parsed);
} else if let Some(collection) = self.sql_collection_identifier()? {
return Ok(SqlCollectionSubquery::Collection(collection));
} else {
Err(ParseError::UnexpectedToken {
token: self.peek_bw(0).clone(),
})
}
}
These functions collectively work to parse the FROM
clause of a SELECT
statement in SQL, handling various scenarios such as subqueries, joins, and collections. The code demonstrates a systematic approach to parsing complex SQL syntax, considering different possibilities and combinations that may occur in a real-world SQL query.
2. Testing in Rust
- Exploring unit tests and integration tests in Rust
- Test execution mechanisms
- Code coverage using codecov
- CI/CD using GitHub Actions
Testing is a fundamental aspect of software development that ensures the reliability and correctness of code. In Rust, testing is seamlessly integrated into the development process. This section delves into the multifaceted world of testing in Rust, covering everything from unit tests and integration tests to test execution mechanisms, code coverage analysis using codecov, and configuration of GitHub Actions.
Unit Tests
Rust's unit testing framework is both expressive and comprehensive. Unit tests are typically defined within the same module as the code they are testing, residing in a tests submodule. Annotations like #[cfg(test)] signal to the compiler that the enclosed code is only relevant during testing. The assert! macro is a powerful tool for checking conditions within tests, ensuring that the expected behavior matches the actual outcomes. So here is an example of a unit test from scanner.rs
:
// The actual scanner code above this line
#[cfg(test)]
mod test {
use crate::lang::tokenizer::token::TokenType::Eof;
use crate::{kw, lexm, skw};
use super::*;
fn assert_tokens(source: &str, expected_tokens: Vec<Token>) {
let tokens = Scanner::scan(source).unwrap();
assert_eq!(tokens.len(), expected_tokens.len());
// assert that the elements are equal
for (token, expected) in tokens.iter().zip(expected_tokens.iter()) {
assert_eq!(token, expected);
}
}
#[test]
fn test_single_char_tokens() {
assert_tokens(
"(){};,+-*/.",
vec![
Token {
tok_type: sym!(LeftParen),
lexeme: lexm!("("),
literal: None,
span: Span {
start: 0,
end: 1,
line: 0,
line_end: 0,
},
},
Token {
tok_type: sym!(RightParen),
lexeme: lexm!(")"),
literal: None,
span: Span {
start: 1,
end: 2,
line: 0,
line_end: 0,
},
},
Token {
tok_type: sym!(LeftBrace),
lexeme: lexm!("{"),
literal: None,
span: Span {
line: 0,
start: 2,
end: 3,
line_end: 0,
},
},
Token {
tok_type: sym!(RightBrace),
lexeme: lexm!("}"),
literal: None,
span: Span {
line: 0,
start: 3,
end: 4,
line_end: 0,
},
},
Token {
tok_type: sym!(Semicolon),
lexeme: lexm!(";"),
literal: None,
span: Span {
line: 0,
start: 4,
end: 5,
line_end: 0,
},
},
Token {
lexeme: lexm!(","),
tok_type: sym!(Comma),
literal: None,
span: Span {
line: 0,
start: 5,
end: 6,
line_end: 0,
},
},
Token {
tok_type: sym!(Plus),
lexeme: lexm!("+"),
literal: None,
span: Span {
line: 0,
start: 6,
end: 7,
line_end: 0,
},
},
Token {
tok_type: sym!(Minus),
lexeme: lexm!("-"),
literal: None,
span: Span {
line: 0,
start: 7,
end: 8,
line_end: 0,
},
},
Token {
tok_type: sym!(Star),
lexeme: lexm!("*"),
literal: None,
span: Span {
line: 0,
start: 8,
end: 9,
line_end: 0,
},
},
Token {
tok_type: sym!(Slash),
lexeme: lexm!("/"),
literal: None,
span: Span {
line: 0,
start: 9,
end: 10,
line_end: 0,
},
},
Token {
tok_type: sym!(Dot),
lexeme: lexm!("."),
literal: None,
span: Span {
line: 0,
start: 10,
end: 11,
line_end: 0,
},
},
Token {
tok_type: Eof,
lexeme: None,
literal: None,
span: Span {
line: 0,
start: 12,
end: 12,
line_end: 0,
},
},
],
);
}
// ...
Integration Tests
For scenarios where broader testing is necessary, integration tests offer a more holistic approach. These tests reside in a separate tests directory and operate as standalone programs. By spawning new processes, integration tests closely mimic real-world usage scenarios, providing a higher-level validation of the overall system behavior. Here is an example of from tests/runtime/blocks.rs
:
use lykiadb_server::runtime::{
error::ExecutionError,
interpreter::{
test_helpers::{exec_assert, get_runtime},
InterpretError,
},
types::RV,
};
use std::sync::Arc;
#[test]
fn test_blocks_0() {
exec_assert(
"var $a = \"global a\";
var $b = \"global b\";
var $c = \"global c\";
{
var $a = \"outer a\";
var $b = \"outer b\";
{
var $a = \"inner a\";
TestUtils.out($a);
TestUtils.out($b);
TestUtils.out($c);
}
TestUtils.out($a);
TestUtils.out($b);
TestUtils.out($c);
}
TestUtils.out($a);
TestUtils.out($b);
TestUtils.out($c);",
vec![
RV::Str(Arc::new("inner a".to_string())),
RV::Str(Arc::new("outer b".to_string())),
RV::Str(Arc::new("global c".to_string())),
RV::Str(Arc::new("outer a".to_string())),
RV::Str(Arc::new("outer b".to_string())),
RV::Str(Arc::new("global c".to_string())),
RV::Str(Arc::new("global a".to_string())),
RV::Str(Arc::new("global b".to_string())),
RV::Str(Arc::new("global c".to_string())),
],
);
}
2.1. Test Execution Mechanisms
In Rust, test execution is a seamless process facilitated by the cargo command-line tool. Running cargo test automatically discovers and executes all unit tests within the project. This simplicity extends to integration tests, which can be executed using cargo test --test <test_module>. Rust's test harness ensures a consistent and efficient execution of tests, making it easy for developers to validate their code regularly.
2.2. Code Coverage Using Codecov
Code coverage analysis is a crucial aspect of understanding how thoroughly tests exercise the codebase. While Rust does not have native support for code coverage, external tools like codecov seamlessly integrate with Rust projects. So I integrated Codecov and now I can see the code coverage of the project. You can check the report out by clicking here. The GitHub actions integration simply looks like this:
name: Coverage
on: [pull_request, push]
jobs:
coverage:
runs-on: ubuntu-latest
env:
CARGO_TERM_COLOR: always
steps:
- uses: actions/checkout@v4
- name: Install Rust
run: rustup update stable
- name: Install cargo-llvm-cov
uses: taiki-e/install-action@cargo-llvm-cov
- name: Generate code coverage
run: cargo llvm-cov --all-features --workspace --lcov --output-path lcov.info
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
files: lcov.info
fail_ci_if_error: true
Of course there are other code coverage tools like tarpaulin, you can just use another tool if you want. But I'm happy with codecov, so I'll stick with it.
3. Executing code
OK. I think I have bored you, way more than you need, with converting raw code to an AST. Even if you just skimmed through this post and the previous one, you know that we have a comprehensive grammar and a parser that produces ASTs. Those ASTs, however, are not more than some deeply nested data structures if we don't execute them. In this section, we see a pretty naive way of executing code: tree-walk interpreting. We'll also examine the visitor pattern, which is a design pattern commonly used in tree-walk interpreters and software engineering in general.
Tree-walk interpreting
A tree-walk interpreter, a classic and straightforward approach, traverses the abstract syntax tree (AST) recursively, executing operations as it encounters each node. While conceptually simple, this method ensures a clear understanding of the language's structure. However, it might not be the most performant choice, especially for complex languages or large-scale applications. For LykiaQL, most of the resource intensive stuff will be dispatched to the database engine. Thus, I didn't diverge from the initial tree-walk interpreter approach that is presented in the book. However, it might not be the most performant choice, especially for complex languages or large-scale applications as it incurs a significant runtime overhead.
Random extra information
As you may know, there are different ways to execute code. This part just mentions some extra useless information about different execution techniques. If you're not interested in this kind of stuff, you can skip this part.
Bytecode and VMs
The term "bytecode" refers to a compact, platform-independent set of instructions that sits between the source code and machine code. Our program, which runs other programs, (commonly referred to as a virtual machine) reads and executes this bytecode, rather than traversing the AST. Despite being more performant than a tree-walk interpreter, bytecode still incurs a runtime overhead.
Java is probably the most iconic example of a language that uses bytecode. The Java compiler transforms the source code into bytecode, which is then executed by the Java Virtual Machine (JVM). The JVM is a runtime environment that provides the necessary resources for executing Java programs. It is responsible for loading the bytecode, verifying its correctness, and executing it. The JVM is also responsible for memory management, garbage collection, etc.
In the realm of databases, SQLite is a popular example of a database engine that uses bytecode. SQLite compiles SQL statements into bytecode, which is then executed by the SQLite Virtual Machine. The VM is a runtime environment that provides the necessary resources for executing SQLite queries.
So this was definitely a possible approach for LykiaQL, but I didn't want to spend too much time on the runtime and I opted for the tree-walk interpreter approach. Maybe someday, when I'm mad enough, I'll implement a bytecode VM, too.
JIT (Just-In-Time Compilation) and Native Compilation
JIT compilation involves translating bytecode into machine code at runtime, potentially enhancing performance by exploiting runtime information. On the other hand, native compilation transforms the entire source code into machine code ahead of execution, offering maximum performance but often at the cost of longer startup times. Both JIT and native compilation are powerful techniques employed by languages seeking to balance execution speed and resource efficiency.
After this brief introduction, let's get back to our tiny, dead-simple tree-walk interpreter. We'll start with visitor pattern, then we'll examine the variable resolving process.
3.1. The Visitor Pattern
The visitor pattern is a design pattern commonly used in tree-walk interpreters. It allows the separation of operations from the data structure, enabling the addition of new operations without modifying the existing structure. By implementing visitor interfaces, each node in the AST can accept visitors, facilitating extensibility and maintainability in the interpreter's codebase. For the relevant chapter in the book, click. The Java version of visitor pattern is a bit more verbose than the Rust version, but in essence, they are doing the same thing in a similar way. Before proceeding, please take a look at Visitor Pattern/Rust Design Patterns for a simple example. Our approach will be essentially the same, except the spicier AST we have.
The following example demonstrates the visitor pattern in action. The Visitor trait defines the visit methods for each node type. The Interpreter struct implements the Visitor trait, defining the visit methods for each node type. The visit_expr method retrieves the expression node from the arena and matches it against the various expression types. For each expression type, the corresponding visit method is invoked, and the result is returned. By design, the visit methods are responsible for recursively visiting the child nodes, ensuring that the entire AST is traversed.
Since the visiting code is separated from the AST, it is easy to add new operations without modifying the existing structure. For example, we can add a new operation for pretty printing the AST by implementing a new Visitor trait. There is no possibility of missing any node type, as the compiler will complain if we forget to implement a visit method for a node type. This approach also facilitates extensibility, as new node types can be added without modifying the existing visit methods.
// VisitorMut is a simple trait that defines the visit methods
// for each node type, with mutable self in the signature.
impl VisitorMut<RV, HaltReason> for Interpreter {
fn visit_stmt(...) { ... }
fn visit_expr(&mut self, eidx: ExprId) -> Result<RV, HaltReason> {
let a = Rc::clone(&self.arena);
// We simply retrieve the expression node from the arena
let e = a.get_expression(eidx);
match e {
Expr::Get { object, name, span } => {
let object_eval = self.visit_expr(*object)?;
// ...
}
Expr::Select { query, span: _ } => // ...
Expr::Insert { command, span: _ } => // ...
Expr::Update { command, span: _ } => // ...
Expr::Delete { command, span: _ } => // ...
Expr::Literal {
value,
raw: _,
span: _,
} => {
//...
},
Expr::Grouping { expr, span: _ } => {
//...
}
Expr::Unary {
operation,
expr,
span: _,
} => // ...
Expr::Binary {
operation,
left,
right,
span: _,
} => // ...
Expr::Variable { name, span: _ } => // ...
Expr::Assignment { dst, expr, span: _ } => {
//...
}
Expr::Logical {
left,
operation,
right,
span: _,
} => {
//...
}
//...
}
}
In a nutshell, the Visitor just walks through the AST and executes the operations defined for each node type. The ultimate goal of the visitor methods are just to return a RV (Runtime Value) or a HaltReason. The RV is the result of the operation, and the HaltReason is the reason for halting the execution. The HaltReason can be a return statement or simply an error.
types.rs
#[derive(Debug, Clone)]
pub enum RV {
Str(Arc<String>),
Num(f64),
Bool(bool),
Object(Shared<FxHashMap<String, RV>>),
Array(Shared<Vec<RV>>),
Callable(Option<usize>, Arc<Function>),
Undefined,
NaN,
Null,
}
// ...
#[derive(Clone)]
pub enum Function {
Lambda {
function: fn(&mut Interpreter, &[RV]) -> Result<RV, HaltReason>,
},
Stateful(Shared<dyn Stateful + Send + Sync>),
UserDefined {
name: String,
program: Arc<Program>,
parameters: Vec<String>,
closure: EnvId,
body: Arc<Vec<StmtId>>,
},
}
pub trait Stateful {
fn call(&mut self, interpreter: &mut Interpreter, rv: &[RV]) -> Result<RV, HaltReason>;
}
interpreter.rs
#[derive(Debug)]
pub enum HaltReason {
Error(InterpretError),
Return(RV),
}
You can see the full code examples in the following gist:
4. Error handling and reporting
For anyting that can go wrong, it will go wrong. So, we need to handle errors gracefully. In the previous blog post, we've discussed the error handling mechanisms in Rust. In this one, we'll see how an error is reported to the user. We're using a fancy error reporting library called ariadne for this purpose. It provides a simple and expressive API for generating and displaying error reports, making it easier to communicate errors to users. The error reporting mechanism is an essential part of any interpreter, as it helps users understand what went wrong and how to fix it.
Whenever we encounter an error during parsing, resolution, or interpretation, we generate an error report via this function call:
report_error(filename, &content, err.clone())
Ariadne generates a detailed error report, including the error message, source code snippet, and a helpful hint for resolving the issue. By providing users with clear and concise error messages, we can enhance the user experience and facilitate debugging. The error report is printed to the console, making it easy for users to identify and address the issue. Here is an example of an error report generated by Ariadne:
REPL mode
lykia > call
[03] Error: Missing token at line 1
╭─[prompt:?:?]
│
1 │ call
│ ──┬─
│ ╰─── Add a Symbol(Semicolon) token after "call".
───╯
lykia >
Here is the body of that function:
use crate::lang::{
parser::{resolver::ResolveError, ParseError},
tokenizer::{scanner::ScanError, token::Span},
};
use super::interpreter::InterpretError;
use serde::{Deserialize, Serialize};
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum ExecutionError {
Scan(ScanError),
Parse(ParseError),
Resolve(ResolveError),
Interpret(InterpretError),
}
impl From<ParseError> for ExecutionError {
fn from(err: ParseError) -> Self {
ExecutionError::Parse(err)
}
}
impl From<ScanError> for ExecutionError {
fn from(err: ScanError) -> Self {
ExecutionError::Scan(err)
}
}
pub fn report_error(filename: &str, source: &str, error: ExecutionError) {
use ariadne::{Color, Label, Report, ReportKind, Source};
// Generate & choose some colours for each of our elements
let out = Color::Fixed(81);
let print = |message: &str, hint: &str, span: Span| {
Report::build(ReportKind::Error, filename, 12)
.with_code(3)
.with_message(format!("{} at line {}", message, span.line + 1))
.with_label(
Label::new((filename, span.start..span.end))
.with_message(hint)
.with_color(out),
)
.finish()
.print((filename, Source::from(&source)))
.unwrap();
};
match error {
ExecutionError::Scan(ScanError::UnexpectedCharacter { span }) => {
print("Unexpected character", "Remove this character", span);
}
ExecutionError::Scan(ScanError::UnterminatedString { span }) => {
print(
"Unterminated string",
"Terminate the string with a double quote (\").",
span,
);
}
ExecutionError::Scan(ScanError::MalformedNumberLiteral { span }) => {
print(
"Malformed number literal",
"Make sure that number literal is up to spec.",
span,
);
}
ExecutionError::Parse(ParseError::MissingToken { token, expected }) => {
print(
"Missing token",
&format!(
"Add a {:?} token after \"{}\".",
expected,
token.lexeme.unwrap()
),
token.span,
);
}
ExecutionError::Parse(ParseError::NoTokens) => {
print(
"There is nothing to parse.",
"What about adding some tokens?",
Span::default(),
);
}
ExecutionError::Parse(ParseError::InvalidAssignmentTarget { left }) => {
print(
"Invalid assignment target",
&format!("No values can be assigned to {}", left.lexeme.unwrap()),
left.span,
);
}
ExecutionError::Parse(ParseError::UnexpectedToken { token }) => {
print(
"Unexpected token",
&format!(
"Unexpected token {}.",
token.lexeme.unwrap_or("None".to_string())
),
token.span,
);
}
ExecutionError::Interpret(InterpretError::ArityMismatch {
span,
expected,
found,
}) => {
print(
"Function arity mismatch",
&format!(
"Function expects {} arguments, while provided {}.",
expected, found
),
span,
);
}
ExecutionError::Interpret(InterpretError::UnexpectedStatement { span }) => {
print("Unexpected statement", "Remove this.", span);
}
ExecutionError::Interpret(InterpretError::NotCallable { span }) => {
print(
"Not callable",
"Expression does not yield a callable.",
span,
);
}
// ...
}
}
Afterword
There are many things I wanted to cover in this chapter, but I didn't want to make it even longer. For any questions you have, please don't hesitate reaching.
So, what's next? I'm planning to continue this series in 2024 (but conclude it way before the end of 2024), and I'm hoping to cover the following topics very soon:
- Query planning (yay!)
- Storage engine (:gigachad:)
- Indexing (more :gigachad:)
- Query optimization (even more :gigachad:)