tl;dr Just show me the code: LykiaDB
Friends, Romans, fellow nerds, lend me your ears!
Since you are reading this manifest, it is assumed that you have a decent understanding of what databases are and why we need them. You, however, are here, not to scratch the surface of yet another database -- but because you’re tired of regular everyday software development, and you have this insatiable curiosity about our very topic: how a database works internally. Obviously, there is no simple answer to that question. So we're taking a bumpy road, and building a database (or LykiaDB as I call it) to find out. If this introduction gives you goosebumps and a sudden excitement, you should probably continue reading.
Goal
Building a document database which;
- Understands SQL queries and allows scripting,
- Does concurrency control via MVCC,
- Is (hopefully) distributed,
- And stores data in a tiny, neat storage engine
with some tools such as;
- A GUI client,
- A CLI client,
- A JS driver.
Goes without saying, these items are tentative and are subject to change. Yet still, I'll try not to digress too much.
Motivation
I’ve spent a couple of years writing frontend code, backend code, refactoring code, doing some other handy-dandy software stuff and even some NLP and machine learning in graduate school (hey Hacettepe folks!) to cure my existential angst. Recently, however, things started worsening, making me feel a bit bored. As a result, I concluded that I needed to do something new, ummm, maybe a hobby. My options were:
- Playing basketball (plus socializing)
- Cooking
- Learning-Rust-by-creating-a-fully-fledged-database-from-scratch-because-im-out-of-my-mind.
As you might have guessed, I picked the easiest one. Jokes aside, since the last undergraduate semester in which I took the Advanced Databases course, databases have been in my bucket list. And after some years, I decided that it is the time.
My preliminary research revealed that databases are not that complex to build. Most of the building blocks we'll be looking at are covered by an average CS/CE syllabus. So like many of you, I have some varying degree of background knowledge and hands-on experience about each of them. And I hope this project to be beneficial to get deeper understanding about the following topics as we put everything together:
- Expression parsing, query planning and execution
- Concurrent and multithreaded programming
- Isolation and transaction management (ACID, BASE, concurrency control, locks).
- Distributed systems fundamentals (CAP theorem, consensus, Raft algorithm, etc.)
- Storage using advanced data structures and related algorithms (LSM trees, B-Trees, etc.)
- Operating system primitives
- Rust lang (If you’re a Rust expert, and think my Rust code is not idiomatic enough, please have mercy. I’m just a newbie.)
- Bonus: You can brag about being a database-maker for the rest of your life (Yet still please calm down, because egotism will take you only so far.).
As you may notice, building a database is definitely a great exercise that will help you to strengthen your engineering muscles, even if it is not to be used in real-world applications. Please trust me when I'm saying that I'm not just mad and the project is definitely a well-posed one. To support my argument with some tangible proof, I visited and collected plenty of databases in the wild, which are written for many different purposes (including just for learning stuff). If you don't believe me, you can check the list of databases on GitHub.
Scope
The outcome of the project is not intended for production use. Hence the reader should not expect it to be something complete/production-ready. Instead, it will be a toy database, which will be able to do some basic stuff. Having said that, I'm sure that my perfectionism will kick in frequently enough, forcing me to write modular, flexible, extensible and maybe somehow over-engineered code. Also, let me warn you about that, this is my very first attempt to create a database. Thus I am expecting to fail miserably at least a couple of times.
Roadmap
-
1. Parsing and execution: This part is basically about converting the raw text into a walkable tree. We also traverse that tree and interpret it.
Progress: For the time being, Lykia's interpreter only executes generic purpose programs. SQL, on the other hand, is only parsed, but not executed yet. You can see the examples in the repo and how to run them on your own.
-
2. Storage engine This is the component which stores (and sometimes persists) arbitrary binary data. It is probably the most important part of a database, so it will be interesting to work on it.
Progress: After the final cleanup to parsing, I'll start implementing the storage engine.
-
3. Planning: This is the stage in which we take the walkable SQL tree and convert into an executable plan.
Progress: Intuitively, one would expect the 2nd step to be planning. I, however, will do this in an out-of-order fashion, as I believe planning is a bit secondary compared to a storage engine.
-
4. Transaction management: TBA
-
5. Indexes: TBA
-
6. Replication: TBA
To be honest, I kickstarted the project in the early days of 2023. Yet still, doing research about the topic, clarifying scope and working in a full-time job have taken a significant amount of time and I just recently found the time required to turn this thing into something presentable. And regarding how I came up with this roadmap, I must mention some cool related work:
-
ToyDB: Even though LykiaDB is different in terms of functionality, there are enough similarities between it and ToyDB for me to mention. Transaction management and replication are those similar aspects, to be exact. Thus, I am heavily inspired by ToyDB's architecture. Thanks to @erikgrinaker for his amazing work, it actually cleared my mind while studying.
-
Rustbase: Another nice project to check is Rustbase. As it provides tooling alongside the DB itself, Rustbase's approach is more holistic compared to ToyDB, which is what I'm aiming at as well.
Parsing sneak peek
Alright! Let's see some real action.
I develop the project under a publicly accessible organization lykia-rs I created on GitHub. The organization functions as a host to the DB repo and all the other upcoming repositories such as a CLI tool, a simple GUI, a simple JS driver, etc. Drum roll please, ladies and gentlemen, please welcome LykiaDB!
For those who have no time to check yet another repo, here is a sneak peek of how Lykia works as a generic scripting language. It is currently developed enough to run the following script:
./target/release/lykia examples/fib.ly
# or cargo run --release examples/fib.ly
// Define a recursive Fibonacci function
fun fib($n) {
if ($n < 2) return $n;
return fib($n - 2) + fib($n - 1);
}
var $start = clock();
print(fib(35) == 9227465);
print("elapsed:", clock() - $start);
and print the result:
/*
Bool(true)
Str("elapsed:") Num(5.5153703689575195)
*/
As you can see, the syntax looks like a hideous mixture of JS and PHP (wait till SQL chimes in). Currently recognized grammar is diverged from the original one explained in the book Crafting Interpreters, which makes sense, as Lykia will be a super-set of simple SQL. I can hear your impatient buzz and preaching about why that is a terrible idea (please let me kindly remind you that this is my hobby project). We'll go into details of language design decisions in the parsing chapter and I'll explain why the things are how they are. Also, while we're at it, please stop bashing PHP, folks. It is already Fall 2023, we're adults and modern PHP is a good language.
Anyway, let's sprinkle some SQL to the language, and see how it looks like:
var $i = 5;
var $p = SELECT *, $i as five FROM some_collection
UNION
SELECT *, 6 as six FROM some_other_collection;
var $q = SELECT * FROM users where id != 5;
var $r = SELECT "darkness" as my_old_friend;
print($p);
print($q);
print($r);
Unfortunately, those queries will only be parsed, but not executed for the time being, as I haven't implemented those parts yet. For more examples, please visit main/examples.
Resources and references
Before closing the first post of this series, I'd like to mention the shoulders of giants we're standing on:
- For language design and implementation, as stated earlier, I used Crafting Interpreters as my main resource. There is a free web version of it, too.
- Storage engine: Since the project aims at educational goals, I won’t try hard implementing a complete LSM-tree. Instead, a naive Bitcask implementation would be enough. Pingcap has a great learning resource for that: pingcap/talent-plan. We're basically following that as the main resource for our storage engine.
- Planning: TBA
- Indexes: TBA
- Concurrency control: TBA
- Distributing data: TBA
Books
The following books give the reader a high-level overview of the topics we will cover. Especially for understanding storage engines better, I strongly encourage you to read the relevant chapters. If you cannot, don't worry though. A storage engine is one of the main pillars of a database, thus I'll try to give you some introduction in the upcoming posts as well.
- Kleppmann, Martin. Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. O'Reilly Media, 2017.
- Petrov, Alex. Database Internals: A deep dive into how distributed data systems work. O'Reilly Media, 2019.
The Internet
In addition to those items, you can broaden your knowledge via checking the following items out:
-
pingcap/awesome-database-learning is an AMAZING repository containing many resources for learning all aspects of a database. Some of the advanced topics mentioned there are not really relevant for this project. So please don't let the detail level of referred resources scare you. You don't need to learn and understand them all.
-
Carnegie Mellon University has a great Database Group and there are plenty of high quality courses covering a wide range of DB topics on their YouTube channel.
That's all folks. See you soon.