“I didn't have time to write a short letter, so I wrote a long one instead.”
― Mark Twain
Welcome, fellow enthusiasts, to a journey through the inner workings of LykiaDB. Today, we're diving into the fascinating realm of language scanning and parsing, demystifying the technical aspects. Buckle up; it's going to be a fun ride!
Here's the table of contents for the first chapter of "LykiaQL":
- An Overview of LykiaQL's Features
- Preliminary: Scanning and Parsing
- Hands-on: Scanning
- Hands-on: Parsing
- Expanding Horizons: Incorporating SQL Syntax into LykiaQL
- Conclusion
An Overview of LykiaQL's Features
Much like many wonderful discoveries in life, the inception of LykiaDB was far from a meticulously planned affair; in fact, it was quite the opposite. This journey began with the humble intention of creating a straightforward programming language. However, as time went on, the project began to take on a life of its own, revealing unforeseen potential for transformation into something far more captivating – akin to a toy database.
In this article I'll try to give you an overview of how I implemented the language part. Before we proceed, I must give a friendly heads up and offer a sincere apology to our esteemed audience of experienced programmers. I fully comprehend that some of the initial explanations may come across as overly simplistic. However, I believe that it is essential to provide a solid foundation, as it leaves less room to wrong assumptions.
Another thing I'd like to mention is that, even though most of the features you saw here are implemented as of the time of writing, there are still some missing pieces, such as function namespaces, and a couple of others which will be implemented soon. Also, the syntax is not finalized yet, and it is subject to change. Yet still, these are just technicalities, and doesn't affect the fundamental concepts that much. So let's get started!
Data Types
LykiaQL, as of now, does not come with explicit and strong typing, yet it naturally has some basic data types. These are similar to JavaScript's data types:
- Strings
- Numbers
- Booleans
- Objects
- Arrays
- Functions
and some constants to express Undefined, NaN and Null. The language is not strongly typed, so syntax itself doesn't contain any type annotations. However, the runtime does have a type system, and it is used to ensure that the operations are performed on the correct types. And we'll see how the runtime types work on the next chapter.
Declarations
In LykiaQL, like in many other programming languages, declarations are fundamental statements used to define variables and their initial values. These variables serve as placeholders for data that can be manipulated and processed throughout the program. For instance:
var $x = 1; // Declares a varible (wow, such an observation, Sherlock)
In this example, we declare a variable $x and initialize it with the value 1. Declarations allow us to store and manage data, providing the foundation for various operations.
Expressions
Expressions are also fundamental, in terms of enabling the evaluation and computation of values. They serve as a means to perform calculations and generate results. Consider the following expression:
var $y = $x + 1; // Assigns the value of $x + 1 to $y
In this instance, the expression $x + 1 is evaluated, resulting in the sum of the current value of $x and 1. This calculated value is then assigned to the variable $y. Expressions empower LykiaQL to manipulate and transform data efficiently, facilitating various data-driven operations within your programs.
If statements
If statements are essential control flow structures in our language that enable conditional execution of code blocks. They evaluate a specified condition and execute different branches of code based on whether the condition is true or false. Here's an example:
var $z = 100;
if ($z > 50) {
print("> 50");
} else if ($z > 20) {
print("50 > $z > 20");
} else {
print("< 20");
}
In this scenario, the program checks the value of $z and prints different messages depending on the value's relationship with specific conditions. If statements allow us to make decisions and tailor the behavior according to the data.
Loops
Loops are crucial for repetitive tasks and iteration in a language. They enable the execution of a block of code multiple times until a specified condition is met. Here are a few examples:
for (var $i = 0; $i < 10000000; $i = $i+1) {
if ($i > 20) break;
if ($i < 10) continue;
for (var $j = 0; $j < 10000000; $j = $j + 1) {
print("" + $j + ", " + $i);
if ($j > 3) break;
}
}
while(true) {
print("Hello");
break;
}
loop {
print("Hello");
}
These examples showcase different types of loops in LykiaQL. The for loop is used for precise control over iterations, while the while and loop constructs offer more flexible looping options. Loops are indispensable for performing repetitive tasks, such as data processing, until specific conditions are met.
Functions
My favorite part of programming languages, functions... They are essential for modularizing code and making it more maintainable. Here's an example:
fun add($a, $b) {
return $a + $b;
}
Even though they can be used for the side effects they cause, I prefer my functions to be pure, like a monk's vow of silence – with no unnecessary chatter (wink wink FP folks).
fun makeCounter() {
var $i = 0;
fun count() {
$i = $i + 1;
return $i;
}
return count;
}
var $count = makeCounter();
print($count()); // prints "1".
print($count()); // prints "2".
For functions, we follow some modern trends, and treat them as first-class citizens. Meaning that, they can be assigned to variables, passed as arguments to other functions, and returned from functions (a.k.a "closures").
The moment you've been waiting for: Queries
If you've taken Programming Languages 101, you've likely encountered the terms "imperative" and "declarative" programming. These paradigms represent opposing philosophies in the programming world. Imperative programming focuses on instructing the computer on how to perform a task, whereas declarative programming is all about specifying what you want to achieve.
While modern programming languages have blurred the lines between these paradigms, many mainstream languages (such as JavaScript and Python) are rooted in imperative programming. To illustrate the difference, consider the following example of finding records with the names "Bob" or "David" in an array of records:
# An imperative example from Python:
records = [
{"id": 1, "name": "Alice"},
{"id": 2, "name": "Bob"},
{"id": 3, "name": "Charlie"},
{"id": 4, "name": "David"},
]
found_records = []
for record in records:
if record["name"] == "Bob" or record["name"] == "David":
found_records.append(record)
In this imperative example, we explicitly iterate through the array of records, checking each one for a matching condition (in this case, the names "Bob" or "David"). We then manually add the matching records to the found_records array. This approach focuses on how you want to achieve the goal, which may require extra effort to comprehend for non-technical readers.
-- Declarative:
SELECT * FROM records WHERE name IN ('Bob', 'David');
In contrast, query languages like SQL are declarative. The SQL query succinctly expresses your intention: selecting all records from the "records" table where the "name" column matches either "Bob" or "David." Here, the emphasis is on what you want to achieve, making it more intuitive.
Now that we've clarified these paradigms, let's return to LykiaQL. As you might have guessed, LykiaQL embraces both paradigms. LykiaQL is a query language
where queries are first-class citizens. They can be executed against both a database and simple variables, offering a versatile approach to problem-solving. It is declarative:
var $p = SELECT *, false as deleted FROM posts
UNION
SELECT *, true as deleted FROM deleted_posts;
print($p);
And also imperative:
var $p = SELECT *, false as deleted FROM posts
UNION
SELECT *, true as deleted FROM deleted_posts;
for (var $i = 0; $i < $p.length; $i = $i + 1) {
print($p[$i]);
}
Combining both paradigms may seem straightforward, but it can introduce subtle issues. Keen readers might have noticed one of these issues, prompting me to adopt the practice of adding a $
prefix to all variable names. This practice resolves conflicts between SQL identifiers and LykiaQL identifiers. When an identifier is prefixed with $
, it is always treated as a runtime variable, ensuring a clear distinction.
var $user_id = 5;
var $p = SELECT * FROM posts where user_id = $user_id;
print($p);
The above code illustrates the challenge with a simple example. Note how user_id
and $user_id
are used for predicates. The former serves as an SQL identifier, while the latter represents a runtime variable.
Incorporating imperative features into a SQL-like language or vice versa is a well-established concept. Examples include PL/SQL and T-SQL, which extend SQL with procedural capabilities. Conversely, the reverse approach has been successfully implemented on a massive scale with LINQ, a concept familiar to .NET Framework users. LykiaQL follows this approach as well. Consider the following code snippet, which prints the names of users older than 20:
var $users = [
{"name": "Alice", "age": 20},
{"name": "Bob", "age": 30},
{"name": "Charlie", "age": 40},
{"name": "David", "age": 50},
];
var $names = SELECT name FROM $users WHERE age > 20;
This is a valid LykiaQL script that prints the names of users older than 20, whether from a database or a simple variable. This feature enables querying data from diverse sources, including databases, files, and external APIs, etc. I can hear your inner voice asking, "y tho". Well, the answer is simple: Because we can. And it's fun.
Heads up
In LykiaQL, SQL keywords are case insensitive, so you can write
select
orSELECT
, it doesn't matter. However, other keywords are case sensitive, so you can't writevar
asVAR
. You may ask, why. Well, the answer is simple: Because I'm a crazy human-being.
With this foundation, in my opinion, we are now prepared to see what's going on under the hood. Let's dive in!
Preliminary: Scanning and Parsing
Starting from this section, things get real and we start writing some code. It's important to note that the example interpreter in the book is written in Java, which makes expressing the same ideas in Rust somewhat challenging, especially if you're new to Rust, as I am. Consequently, crafting the right solutions took several iterations and experiments. As I learn and evolve alongside this project, I want to extend a friendly warning to experienced Rust programmers: some of the code snippets might not represent the best and most idiomatic Rust practices, and I'm open to receiving your valuable feedback and suggestions.
In "Crafting Interpreters," over 100 pages are dedicated to explaining the concepts I'm about to introduce. To keep this article concise, I'll focus on the most crucial aspects. If you're interested in diving deeper into the details, please check the book out.
Before we proceed, it's essential to clarify the fundamental concepts of scanning and parsing and their significance. Our ultimate goal is to transform a string of characters into a syntax tree—a data structure that represents the source code in a format interpretable by a program. This data structure is often referred to as an abstract syntax tree (AST). Achieving this goal involves two essential steps: scanning and parsing. While some compilers and interpreters merge these steps, we'll keep them separate. Here's a simple glossary:
-
Source: We begin with a string of characters, which constitutes the source code of our program. This source can be stored in a file or provided as a command input.
-
Token: Tokens are meaningful units of code, encompassing elements like keywords, identifiers, numbers, and operators. We'll explore concrete examples of tokens shortly, so stay tuned.
-
Scanning: Initially, a code file or command input is essentially a gibberish byte sequence to a computer. The meaning emerges when we start extracting tokens from it, effectively converting it into a sequence of tokens. Here's a pseudo-code example illustrating scanning:
input = "var $x = 50 + 25"
output = scan(input)
// [VAR, IDENTIFIER, EQUAL, NUMBER, PLUS, NUMBER]
- Parsing: While tokens provide more meaningful representations compared to raw byte sequences, they still fall short of fully capturing a program's structure. To address this, we must assemble these tokens into a syntax tree. Here's a pseudo-code example demonstrating parsing (we'll delve into the details shortly):
input = [VAR, IDENTIFIER, EQUAL, NUMBER, PLUS, NUMBER]
output = parse(input)
/*
{
type: "Program",
body: [
{
type: "VariableDeclaration",
identifier: "$x",
init: {
type: "BinaryExpression",
operator: "+",
left: {
type: "NumberLiteral",
value: 50
},
right: {
type: "NumberLiteral",
value: 25
}
}
}
]
}
*/
Hands-on: Scanning
As mentioned earlier, scanning is the process of converting a string of characters into a sequence of tokens. Before proceeding, let's define what a token looks like in LykiaQL:
pub struct Token {
// The type of the token, e.g., TokenType::Num
pub tok_type: TokenType,
// The actual string of the token, if it is an identifier or a string
pub lexeme: Option<Rc<String>>,
// Runtime value, if exists, e.g., RV::Num(45). Ignore this for now.
pub literal: Option<RV>,
// The line number where the token is located
pub line: u32
}
Rust Hint
In Rust, Option is an enumeration type that is used to represent the presence or absence of a value. It's often employed in scenarios where a value might be missing or where there's a need to handle potential absence gracefully.
None: None is one of the variants of the Option enum. It signifies the absence of a value. When a function or operation returns None, it means that no meaningful value is available.
Some: Some is the other variant of the Option enum. It's used to wrap a value that is present or meaningful. When a function or operation returns Some(value), it indicates that a valid value is contained within the Some variant.
The Option type, along with None and Some, is integral to Rust's approach to handling potentially nullable or missing values in a type-safe manner. It encourages developers to explicitly handle the absence of values, reducing the chances of null pointer errors and improving code reliability.
And token types are listed here:
#[derive(Debug, Clone, Eq, PartialEq)]
pub enum TokenType {
Str,
Num,
Null,
False,
True,
Identifier {
dollar: bool,
},
Symbol(Symbol),
Keyword(Keyword),
SqlKeyword(SqlKeyword),
Eof
}
Please pay attention to the Symbol, Keyword and SqlKeyword variants. They are basically another enums used for categorizing tokens, to keep them more organized. For instance, the Symbol enum contains all the symbols, such as +
, -
, *
, /
, (
, )
, etc. The Keyword enum contains all the keywords, such as var
, if
, else
, for
, while
, etc. And the SqlKeyword enum contains all the SQL keywords, such as SELECT
, FROM
, WHERE
, etc.
As we know what tokens are, we can continue with the scanning process. In fact, a scanner is not more than a simple state machine. It reads the input character by character, and depending on the current state, it decides what to do with the character. In a nutshell it does:
initialize Scanner
while not end of input:
peek at the next character and mark the "starting point"
determine "token type" based on character features
call function to "consume" characters belonging to the token
if something goes wrong, halt the process and propagate the error
add the "token" to the list
repeat until end of input
And you're seeing the real Rust equivalent of this pseudo-code:
fn scan_token(&mut self) -> Result<(), ScanError> {
let c = self.advance();
// As you can see, Rust's pattern matching is a great fit for this task
match c {
'\n' => self.line += 1,
' ' | '\r' | '\t' => (),
'"' => self.string()?,
'0'..='9' => self.number()?,
'A'..='z' => self.identifier(false),
'$' => self.identifier(true),
'!' => self.add_double_token(&c.to_string(),'=', sym!(Bang), sym!(BangEqual)),
'=' => self.add_double_token(&c.to_string(),'=', sym!(Equal), sym!(EqualEqual)),
'<' => self.add_double_token(&c.to_string(),'=', sym!(Less), sym!(LessEqual)),
'>' => self.add_double_token(&c.to_string(),'=', sym!(Greater), sym!(GreaterEqual)),
'/' => {
if self.match_next('/') {
while !self.is_at_end() && self.peek(0) != '\n' {
self.advance();
}
} else {
self.add_token(&c.to_string(),sym!(Slash));
}
},
other => {
if let Some(sym) = SYMBOLS.get(&other) {
self.add_token(&other.to_string(), sym.clone());
} else {
return Err(ScanError::UnexpectedCharacter {line: self.line, chr: c})
}
}
}
Ok(())
}
I'd like to use this great opportunity to pause the scanning stuff for just a bit, and give you some Rust hints:
Rust Hint
Rust approaches "class" and "object" concepts a bit differently, and separates the state from the behavior. Here, the state is contained within a struct, while the behavior is implemented separately in impl blocks. You have the flexibility to add multiple impl blocks as needed.
pub struct Dog {
name: String,
age: u32,
}
impl Dog {
pub fn new(name: String, age: u32) -> Self {
Dog {
name,
age,
}
}
pub fn bark(&self) {
println!("Woof! My name is {}", self.name);
}
}
Rust Hint
In Rust, the concept of
self
is equivalent to Python'sself
. This means that when you define methods within a Rust struct (similar to a class in Python), you explicitly pass the instance of the struct as the first argument to the methods. If a method doesn't require access to the instance (i.e., it's a static method), you can call it using the::
operator, likeString::from("Hello")
. It's customary in Rust to define constructors as static methods and name themnew
. So, if you encounter a method namednew
, it's likely a constructor.
Rust Hint
Going back to
scan_token
method, Rust does not have exceptions and try/catch blocks. Instead, it uses theResult
type (a breeze from functional programming universe), which is an enum with two variants:Ok
andErr
.Ok
variant is used to indicate that the operation was successful (in the example we're using the Scanner's internal state to store the scanned result, so we return unit type () instead of returning something else in theOk
variant.), andErr
variant is used to indicate that the operation failed. So, in the code above,Result<(), ScanError>
means that the function returns aResult
type, and theOk
variant contains an empty tuple, and theErr
variant contains aScanError
type.ScanError
is also an enum, and it has several variants, such asUnexpectedCharacter
,UnterminatedString
, etc. So, if the function returnsOk(())
, it means that the operation was successful, and if it returnsErr(ScanError::UnexpectedCharacter)
, it means that the operation failed, and the reason is that an unexpected character was encountered.Again, attentive readers may ask the question "what if something wrong happens in the
string()
function?". Well, the answer is simple: we halt the execution, and propagate the error to the caller using the?
operator. It basically saves us from writing a couple of extra lines, making our code more readable and concise. To get deeper and complete knowledge about error handling in Rust, I also strongly encourage you to check the well-known blog post, Error Handling in Rust.
With this understanding, you're now prepared to explore the actual code. In the provided code snippet, a "Scanner" struct and associated functionality are defined for converting source code into a sequence of tokens. The key components include:
Here we first define "Scanner" struct, then associated functionality for converting source code into a sequence of tokens.
- The Scanner struct holds information about the source code, the generated tokens, the current position, and the current line number.
- We define ScanError to handle potential scanning errors.
- Scanner has several methods to perform scanning:
scan
: This is the main entry point to initiate scanning. It takes the source code as input and returns a Result containing either a vector of tokens or a ScanError.match_next
: Checks if the next character in the source code matches the expected character.peek
: Peeks at the character at an offset from the current position.advance
: Moves the current position forward and returns the character at that position.add_token
: Adds a new token to the list of tokens being generated.add_double_token
: Handles double-character tokens like '==' or '!='.is_at_end
: Checks if the scanner has reached the end of the source code.string
: Scans and handles string literals.number
: Scans and handles numeric literals.identifier
: Scans and handles identifiers (variable names, keywords).scan_token
: Scans and processes a single token.scan_tokens
: Initiates the scanning process for the entire source code.
So let's take a look at an example input and expected output:
// example:
let tokens: Vec<Token> = Scanner::scan("var $x = 50 + 25").unwrap();
println!("{:?}", tokens);
// output:
[
Token {
tok_type: Keyword(Var),
lexeme: Some("var"),
literal: None,
line: 0
},
Token {
tok_type: Identifier { dollar: true },
lexeme: Some("$x"),
literal: Some(Str("$x")),
line: 0
},
Token {
tok_type: Symbol(Equal),
lexeme: Some("="),
literal: None,
line: 0
},
Token {
tok_type: Num,
lexeme: Some("50"),
literal: Some(Num(50.0)),
line: 0
},
Token {
tok_type: Symbol(Plus),
lexeme: Some("+"),
literal: None,
line: 0
},
Token {
tok_type: Num,
lexeme: Some("25"),
literal: Some(Num(25.0)),
line: 0
},
Token {
tok_type: Eof,
lexeme: Some(" "),
literal: None,
line: 1
}
]
Hands-on: Parsing
It's impressive that you're still reading... Here we are, finally. Parsing, the most exciting part of this article.
Now we have a sequence of tokens, but we still don't know what they mean.
Parsing will make those tokens more meaningful (I'm not talking about THE meaning, it is just the meaning of tokens...) by converting them into a syntax tree. As Crafting Interpreters dictate, we'll be using Recursive Descent Parsing to do that, which is a top-down parsing technique that uses a set of mutually recursive functions to process the input. In this approach, we start from the most generic parsing rule, in which each function implements a single grammar rule. Functions call other functions to process the input, and the process continues recursively until the entire input is processed. Here is the grammar:
program -> statement* EOF
declaration -> varDeclaration
| funDeclaration
| statement
statement -> ifStatement
| whileStatement
| forStatement
| loopStatement
| breakStatement
| continueStatement
| returnStatement
| block
| expressionStatement
varDeclaration -> "var" IDENTIFIER ("=" expression)? ";"
funDeclaration -> "fun" IDENTIFIER "(" parameters? ")" block
parameters -> IDENTIFIER ("," IDENTIFIER)*
block -> "{" statement* "}"
ifStatement -> "if" "(" expression ")" statement ("else" statement)?
whileStatement -> "while" "(" expression ")" statement
forStatement -> "for" "(" (varDeclaration | expressionStatement)? ";" expression? ";" expression? ")" statement
loopStatement -> "loop" statement
breakStatement -> "break" ";"
continueStatement -> "continue" ";"
returnStatement -> "return" expression? ";"
expressionStatement -> expression ";"
expression -> assignment
assignment -> conditional
conditional -> logicalOr ("?" expression ":" conditional)?
logicalOr -> logicalAnd ("||" logicalAnd)*
logicalAnd -> equality ("&&" equality)*
equality -> comparison (("!=" | "==") comparison)*
comparison -> term ((">" | ">=" | "<" | "<=") term)*
term -> factor (("+" | "-") factor)*
factor -> unary (("*" | "/") unary)*
unary -> ("!" | "-") unary | call;
call -> primary ("(" arguments? ")")*
primary -> NUMBER | STRING | "true" | "false" | "null"
| IDENTIFIER | "(" expression ")"
arguments -> expression ("," expression)*
Despite the seemingly complex rules, this is actually a straightforward language, and the grammar isn't as intricate as it might appear. Once you grasp the underlying structure, it becomes surprisingly simple. Let's examine a snippet from the parser code to illustrate:
fn statement(&mut self) -> ParseResult<StmtId> {
match_next!(self, kw!(If), if_statement);
match_next!(self, kw!(While), while_statement);
match_next!(self, kw!(For), for_statement);
match_next!(self, kw!(Loop), loop_statement);
match_next!(self, kw!(Break), break_statement);
match_next!(self, kw!(Continue), continue_statement);
match_next!(self, kw!(Return), return_statement);
match_next!(self, sym!(LeftBrace), block);
self.expression_statement()
}
fn if_statement(&mut self) -> ParseResult<StmtId> {
self.expected(sym!(LeftParen))?;
let condition = self.expression()?;
self.expected(sym!(RightParen))?;
let if_branch = self.statement()?;
if self.match_next(kw!(Else)) {
let else_branch = self.statement()?;
return Ok(self.arena.statement(Stmt::If(condition, if_branch, Some(else_branch))));
}
Ok(self.arena.statement(Stmt::If(condition, if_branch, None)))
}
Notice how statement calls functions like if_statement
, and if_statement
in turn calls expression
and other functions. This is the essence of recursive descent parsing. Once you define the grammar rules, you just need to implement these functions, and they naturally call each other. It's almost like magic, isn't it?
% ...
statement -> ifStatement
| whileStatement
| forStatement
| loopStatement
| breakStatement
| continueStatement
| returnStatement
| block
| expressionStatement
ifStatement -> "if" "(" expression ")" statement ("else" statement)?
% ...
Now, let's delve into the fundamental building blocks of the syntax tree, which we've been discussing, namely the nodes. Our syntax tree consists of two types of nodes: statements and expressions. Here are the definitions:
pub enum Stmt {
Expression(ExprId),
Function(Token, Vec<Token>, Rc<Vec<StmtId>>),
Declaration(Token, ExprId),
Block(Vec<StmtId>),
If(ExprId, StmtId, Option<StmtId>),
Loop(Option<ExprId>, StmtId, Option<StmtId>),
Break(Token),
Continue(Token),
Return(Token, Option<ExprId>)
}
pub enum Expr {
Select(SqlSelect),
Binary(Token, ExprId, ExprId),
Grouping(ExprId),
Literal(RV),
Unary(Token, ExprId),
Variable(Token),
Assignment(Token, ExprId),
Logical(ExprId, Token, ExprId),
Call(ExprId, Token, Vec<ExprId>),
}
These nodes are allocated by our arena allocator, which is a simple data structure that stores the nodes in a contiguous memory block. This approach is more efficient than allocating nodes individually on the heap.
Rust Tip
I don't want to get into the details of the arena allocator, at least in this chapter. In a nutshell, arenas in Rust provide a memory management approach optimized for allocations and deallocations, reducing the overhead associated with traditional heap allocation. These resources will help you gain a more in-depth understanding of how to use arenas effectively in your Rust projects:
Now, let's take a look at the full parser code that generates these nodes:
As previously mentioned, the structure of the parser code closely mirrors the grammar's simplicity. It consists of a series of functions that mutually call each other in a recursive manner. The only element that might appear perplexing is the ParseResult
type, which is essentially an alias for Result<T, ParseError>
. Here, ParseError
is an enumeration that encompasses all conceivable parsing errors. So, when you encounter a function returning ParseResult<StmtId>
, it signifies that the function yields a Result
type. The Ok
variant contains a StmtId
, while the Err
variant contains a ParseError
. It's also worth noting that ExprId
and StmtId
are about arena allocation, but delving into those specifics is beyond the scope of this chapter.
The output of parser for our previous example is as follows:
// example:
let tokens: Vec<Token> = Scanner::scan("var $x = 50 + 25;").unwrap();
let parsed = Parser::parse(&tokens);
let stmts = parser.parse().unwrap();
println!("{:?}", stmts);
So the following output is just for demonstration purposes, and it's not the actual output of the parser. The actual output is a sequence of
StmtId
s, which are just indices of the nodes in the arena allocator. I've just converted them to their corresponding nodes for the sake of readability. Once I find a better way to represent the nodes, I'll update this section.
// output:
[
Stmt::Declaration(
Token {
tok_type: Keyword(Var),
lexeme: Some("var"),
literal: None,
line: 0
},
ExprId(0)
// Variable(Token {
// tok_type: Identifier { dollar: true },
// lexeme: Some("$x"),
// literal: Some(Str("$x")),
// line: 0
// })
),
Stmt::Expression(
ExprId(1)
// Binary(Token, ExprId, ExprId)
// Token {
// tok_type: Symbol(Plus),
// lexeme: Some("+"),
// literal: None,
// line: 0
// },
// Literal(RV::Num(50.0)),
// Literal(RV::Num(25.0))
)
]
Expanding Horizons: Incorporating SQL Syntax into LykiaQL
You've already had a glimpse of the SQL-related scanning and parsing code snippets, but you don't know about how I came up with them, as these are not a part of the original language "Lox" explained in the book. Actually, we simply need a Context Free Grammar to extend our parser. Since I couldn't find a ready-to-use one, a quick Google search led me to a meticulously crafted railroad diagrams for SQLite grammar. After studying it briefly, I came up with the following modification to the grammar:
% ... all the others
unary -> ("!" | "-") unary | selectWrap;
% instead of [ unary -> ("!" | "-") unary | call; ]
selectWrap -> call | select
select -> selectCore (compoundOp selectCore)*
("ORDER" "BY" orderingTerm ("," orderingTerm)*)?
("LIMIT" sqlExpr (("OFFSET" | ",") sqlExpr)?)?
compoundOp -> "UNION" "ALL" | "UNION" | "EXCEPT" | "INTERSECT"
orderingTerm-> sqlExpr ("ASC" | "DESC")?
selectCore -> "SELECT" ("DISTINCT" | "ALL")? resultColumn ("," resultColumn)*
("FROM" (joinClause | (tableOrSubquery ("," tableOrSubquery)*))?
("WHERE" sqlExpr)?
("GROUP" "BY" sqlExpr ("," sqlExpr)* ("HAVING" sqlExpr)?)?
resultColumn-> "*" | IDENTIFIER "." "*" | sqlExpr ("AS"? IDENTIFIER)?
joinClause -> tableOrSubquery ("INNER" | "LEFT" "OUTER") "JOIN" tableOrSubquery ("ON" sqlExpr)?
tableOrSubquery -> % Work in progress
sqlExpr -> % Work in progress
% Planning to link it to "expression" rule
I've studied the diagrams I provided earlier and transcribed the grammar onto paper using a pen. While I was in the process of digitizing it, GitHub's Copilot astonishingly anticipated the correct rules even faster than I could type them out. It was annoying honestly, yet still I'm not complaining, as it functions as a poor man's formal verification...
So I extended the grammar with the rules I've transcribed, and modified the parser accordingly. It's not complete yet, but it is proven to be working. We'll see how it evolves in the upcoming chapters.
Conclusion
Honestly, I have mixed feelings about publishing an article for an unfinished query language. Nonetheless, we are operating in an agile environment. As Reid Hoffman aptly put it, "If you are not embarrassed by the first version of your product, you've launched too late.".
If you've made it through the entire article though, I want to express my admiration for your patience. Even as the author, reading this gave me a headache. I hope you've enjoyed this initial chapter on "LykiaQL" and found it informative. If you have any questions, suggestions or feedback, please don't hesitate to get in touch with me.
Looking ahead to the upcoming chapters on the language, it's a bit uncertain how many there will be, but our focus will shift towards runtime and basic query execution aspects. Query execution against a live datastore, which is more of an advanced topic, will come later, once the storage engine is in place, and we're still a couple of weeks away from that milestone. So, stay tuned for what's next. Au revoir!