Building a Database from Scratch: A series / Prologue

tl;dr Just show me the code: LykiaDB

Friends, Romans, fellow nerds, lend me your ears!

Since you are reading this manifest, it is assumed that you have a decent understanding of what databases are and why we need them. You, however, are here, not to scratch the surface of yet another database -- but because you’re tired of regular everyday software development, and you have this insatiable curiosity about our very topic: how a database works internally. Obviously, there is no simple answer to that question. So we're taking a bumpy road, and building a database (or LykiaDB as I call it) to find out. If this introduction gives you goosebumps and a sudden excitement, you should probably continue reading.

Goal

Building a document database which;

with some tools such as;

Goes without saying, these items are tentative and are subject to change. Yet still, I'll try not to digress too much.

Motivation

I’ve spent a couple of years writing frontend code, backend code, refactoring code, doing some other handy-dandy software stuff and even some NLP and machine learning in graduate school (hey Hacettepe folks!) to cure my existential angst. Recently, however, things started worsening, making me feel a bit bored. As a result, I concluded that I needed to do something new, ummm, maybe a hobby. My options were:

As you might have guessed, I picked the easiest one. Jokes aside, since the last undergraduate semester in which I took the Advanced Databases course, databases have been in my bucket list. And after some years, I decided that it is the time.

My preliminary research revealed that databases are not that complex to build. Most of the building blocks we'll be looking at are covered by an average CS/CE syllabus. So like many of you, I have some varying degree of background knowledge and hands-on experience about each of them. And I hope this project to be beneficial to get deeper understanding about the following topics as we put everything together:

As you may notice, building a database is definitely a great exercise that will help you to strengthen your engineering muscles, even if it is not to be used in real-world applications. Please trust me when I'm saying that I'm not just mad and the project is definitely a well-posed one. To support my argument with some tangible proof, I visited and collected plenty of databases in the wild, which are written for many different purposes (including just for learning stuff). If you don't believe me, you can check the list of databases on GitHub.

Scope

The outcome of the project is not intended for production use. Hence the reader should not expect it to be something complete/production-ready. Instead, it will be a toy database, which will be able to do some basic stuff. Having said that, I'm sure that my perfectionism will kick in frequently enough, forcing me to write modular, flexible, extensible and maybe somehow over-engineered code. Also, let me warn you about that, this is my very first attempt to create a database. Thus I am expecting to fail miserably at least a couple of times.

Roadmap

To be honest, I kickstarted the project in the early days of 2023. Yet still, doing research about the topic, clarifying scope and working in a full-time job have taken a significant amount of time and I just recently found the time required to turn this thing into something presentable. And regarding how I came up with this roadmap, I must mention some cool related work:

Parsing sneak peek

Alright! Let's see some real action.

I develop the project under a publicly accessible organization lykia-rs I created on GitHub. The organization functions as a host to the DB repo and all the other upcoming repositories such as a CLI tool, a simple GUI, a simple JS driver, etc. Drum roll please, ladies and gentlemen, please welcome LykiaDB!

For those who have no time to check yet another repo, here is a sneak peek of how Lykia works as a generic scripting language. It is currently developed enough to run the following script:

./target/release/lykia examples/fib.ly
# or cargo run --release examples/fib.ly
// Define a recursive Fibonacci function

fun fib($n) {
if ($n < 2) return $n;
return fib($n - 2) + fib($n - 1);
}

var $start = clock();
print(fib(35) == 9227465);
print("elapsed:", clock() - $start);

and print the result:

/*
Bool(true)
Str("elapsed:") Num(5.5153703689575195)
*/

As you can see, the syntax looks like a hideous mixture of JS and PHP (wait till SQL chimes in). Currently recognized grammar is diverged from the original one explained in the book Crafting Interpreters, which makes sense, as Lykia will be a super-set of simple SQL. I can hear your impatient buzz and preaching about why that is a terrible idea (please let me kindly remind you that this is my hobby project). We'll go into details of language design decisions in the parsing chapter and I'll explain why the things are how they are. Also, while we're at it, please stop bashing PHP, folks. It is already Fall 2023, we're adults and modern PHP is a good language.

Anyway, let's sprinkle some SQL to the language, and see how it looks like:

var $i = 5;

var $p = SELECT *, $i as five FROM some_collection
UNION
SELECT *, 6 as six FROM some_other_collection;

var $q = SELECT * FROM users where id != 5;

var $r = SELECT "darkness" as my_old_friend;

print($p);
print($q);
print($r);

Unfortunately, those queries will only be parsed, but not executed for the time being, as I haven't implemented those parts yet. For more examples, please visit main/examples.

Resources and references

Before closing the first post of this series, I'd like to mention the shoulders of giants we're standing on:

Books

The following books give the reader a high-level overview of the topics we will cover. Especially for understanding storage engines better, I strongly encourage you to read the relevant chapters. If you cannot, don't worry though. A storage engine is one of the main pillars of a database, thus I'll try to give you some introduction in the upcoming posts as well.

The Internet

In addition to those items, you can broaden your knowledge via checking the following items out:

That's all folks. See you soon.