Database

Databases – everyone needs them. But how do you pick one? There are so many options now, from different SQL providers like Supabase and PlanetScale, to NoSQL providers like Fauna DB and Firebase, to cached edge solutions like Worker KV and Cloudflare. All of these solutions provide the basic need:

I have data that my users have created; I want to store it somewhere and access it later.

But all of them work very differently and solve very different parts of that problem with fundamentally different philosophies. It's way too easy to pick the wrong data solution for what you or your project need.

I want to talk more about how these databases differ and, more importantly, how you can pick the right database for what you specifically are building. With all that said, let's dive in.

But What is a Database?

"an external thing that stores data"
Usually persisted data
Usually accessible via API/standard client
Choosing Database

Choosing the wrong database can be a disaster. In general, Relational Databases like MySQL and Postgres are the safest bet for the majority of applications. But that doesn't mean they're immune to their own problems.

What are the Types of Database?

SQL databases (SQLite, Postgres, MySQL)
Document databases (MongoDb, FireStore)
Graph (Fauna, Neo4j)
K/V stores (Redis, Worker KV, MemcaChed)

Where do I host my database?

Run it on a server yourself (AWS EC2, Railway, Heroku)

What these solutions do is they actually spin up a box on a server that is like a Docker image, effectively like an actual separated container that just runs the SQL server or runs the EdgeDB or runs the Neo4j, but it's a box that runs that database directly. You as the developer have to know which provider you want to use, how to deploy that container for it, and also pay attention to when it hits limits and make sure it scales accordingly.
You can run this on Kubernetes please, do not put your database in Kubernetes; you're going to have a very bad time.

Managed
1. Provided/autoscaled Hosts hosts on OSS (PlanetScale, Cockroach DB)
2. Proprietory infra/tech pairings (Fauna, Worker KV)
3. "Boring" (Supabase)
  
  With Supabase, you can host it yourself, or they just host it for you. When you get close to a limit, you click the button in the UI and promote yourself to the next tier. But it's not proprietary because it is open source.
  
  When I brought up Railway, Railways is under "run it on a server yourself." You're still thinking about the box; you're still thinking about the performance of the box. Supabase is a tier. If I go to the Supabase pricing, you'll see that the way that pricing and scaling work on databases is by the amount of storage you're taking, the amount of bandwidth, and the number of monthly users, rather than the size of the box.
  
  If you go to Railway's pricing, Railway's pricing is much less based on the usage of the number of users and the number of queries. It's based on the size of the box and the specs of it. So it's how much disk this box has, and how much CPU execution time are you spending. These are like a level or a spectrum of how much of the concern you want to own.

Database Spectrum

Please don't use MongoDb (UNLESS you have a very specific REASON)

Mongo is a DOCUMENT store. It is glorified JSON.

JSON isn't bad; I'll be very clear about that.

This isn't "JSON; this is terrible, and that's why is bad."


JSON is a structure that represents your data.
And Data is more complex than blobs and documents.

If you have arbitrary JSON blobs that have no relationship to each other and you want to store those in Mongo, go nuts; that's fine. But almost every problem has relationships.

Almost Every data problem has relations of some form

There are so few exceptions to this and believe me, I wish that wasn't true; I genuinely do.

But every time I am deep in a code base, especially ones that adopted Mongo, most of the problems we end up seeing are weird selection patterns that have to be developed because there are different tables and different document types that represent different things without relationships between them

Non Relational Example

Imagine we have a user object in our database, which contains some key properties: an ID, a name, and a list of comments. These comments represent all the comments this user has left. Now, the challenge is: How do we efficiently store and manage these comments within the user object, especially in a document-oriented database?

One approach is to expand the user object and create a separate "comments" sub-table within it. In this sub-table, each comment is given a unique identifier, like comment one, comment two, and so on. However, this approach introduces another level of complexity because comments themselves can have reactions or responses.

To address this, we further segment the comments. Now, each comment has its subfields, and within these comments, there can be reactions. Each reaction is associated with a user ID to identify who reacted. But here's the catch: To retrieve these user IDs, we often need to perform additional queries back to the user database to fetch the corresponding user data. This process can become cumbersome and inefficient, especially when dealing with a large number of comments and reactions.

This situation highlights a common challenge in relational models: efficiently managing relationships and nested data structures.

Relational Example

In our data model, we might have two main entities: "user" and "comment." A "user" entity has its unique ID, a user ID, and perhaps some content like a comment body. To keep things consistent with our previous example, let's also associate comments with reactions.

Now, we establish a relationship between these entities. Each comment is linked to a user, and this relationship goes both ways. If only there were an easier way to draw a two-sided arrow! These entities are inherently connected.

Specifically, a comment links to the user ID, pointing to the user who made the comment, and a user has comments, which are identified by a foreign key. This foreign key links back to the comment's ID. With this relationship, we can easily select all comments associated with a particular user.

Now, with these relationships in place, we have the flexibility to introduce new features as our application evolves. Let's consider a scenario where, during the initial app development, comments are simple and lack reactions. They only contain essential information like the comment ID, user ID, and content.

However, as the app grows, we receive feedback from our product manager (PM) who insists on adding reactions. Users should be able to react with emojis, different emojis, and multiple people can react in various ways. We might even want to sort and perform other complex operations with these reactions.

To accommodate this new feature, we can seamlessly create a new table, "reactions." This table will have a structure similar to the "comments" table, as reactions share common attributes. Each reaction entry will have its unique ID, a comment ID (serving as a foreign key to link it to the specific comment it's reacting to), a user ID (connecting it back to the user who reacted), and an emoji (representing the actual content of the reaction).

By establishing these relationships between the "reactions" table, "comments" table, and the "user" table, we can trace reactions back to users. Although it's not necessary to explicitly label this in the tables, the relationship exists. Additionally, we can link reactions back to the specific comments they are reacting to.

This relational structure allows us to smoothly adapt to new requirements and expand our application's functionality without the need for significant modifications or migrations. It's a robust foundation for handling evolving features while maintaining data integrity. In contrast, with a document-based system, such changes can be more challenging to implement and manage.

When you opt for a document-based approach instead of a structured relational model, you encounter several challenges:

Anticipating All Relations: Initially, you must try to foresee all possible connections between data elements. This can be a daunting task, as it requires predicting future requirements with precision.
Challenging Migrations: If your initial document structure doesn't account for future changes or new features, you may face a difficult and chaotic migration process. This often involves extensive efforts to reorganize and adapt your data model, leading to complexity and potential errors.
Avoiding Relations: In some cases, due to the complexity and unpredictability of managing relations in a document-centric approach, you might choose not to establish clear connections between data components. Instead, you handle everything within the application's code (user land). While this may provide a quick fix, it can lead to difficulties in maintaining data integrity and codebase clarity.

Let's talk about Database performance

Databases, love them, hate them, you probably still need them. SQL, no SQL, GraphQL, whatever you're using, it's not easy to work with, and the few things that are easy to work with, like Prisma, come with their own gotchas. It feels like when new tools are introduced that improve the developer experience with databases, they often come at a performance cost. I would love to see a future where we don't take a performance hit just to have a better developer experience.

🧠Thinking Program

Search This Blog