5 Lessons from 5 Years of Building Databases at Scale (2024)

Sammy Steele ssteele@figma.com is currently the Senior Staff Engineer and Tech Lead for Figma’s Databases Team.

Worked at Dropbox before.
Was part of the 5 person team that built Dropbox’s petabyte scale search system - Nautilus.

Dropbox operated data-centers on-prem since managed cloud services weren’t as good yet.

Dropbox stored petabyte scale metadata and exabyte scale file data. In 2013, the metadata was stored in horizontally sharded MySQL w/ 2000 shards.
They were outgrowing this system. A single organization (MIT) was going to outgrow a single database which their system didn’t account for.

Beyond Sharded SQL

They ended up building their own Spanner-inspired distributed key-value store in-house (called Panda) since no open-source solutions were scaling to their usage. One org could now be split multiple times to get smaller shard ranges.
(Project Theseus) Hardest part of the project wasn’t building Panda but running the migration to re-build the entire metadata stack on top of Panda with zero downtime. It took 3 years and a team of 20 engineers.
Implemented atomic cross-shard txns. (2PC) across MySQL and Panda to enable the migration. Increased %age rollout of cross-shard txns. over time and eventually swapped reads from the legacy system.

Joined Figma at a time when they were just in the process of scaling out their databases. Back in 2022, Figma was on a single RDS instance. DB Traffic was growing 3x every single year for the last 4 years.

Build vs. Buy for Database Scalability

Sammy was pretty anti-build after Dropbox where everything was in-house. While building new systems in fun as an infra engineer, someone has to keep maintaining the system.
Figma was operating on extremely tight time horizons with a small team and 6 months of runway (before the system grew out of scale).
Why not NoSQL?
- NoSQL does really well at scale but Figma relied a lot on joins, unique indexes and complicated relational models.
- Would have to rewrite the entire system from scratch to use a NoSQL DB.
Spent 3 months exploring a NewSQL solution (like Cockroach, Vitess or Spanner) but couldn’t get it to work at their scale.
Team had a lot of expertise and historical context in RDS Postgres. Needed an incremental migration path to deliver value fast.
Built in-house horizontally shared-system on top of RDS Postgres.

Logical vs. Physical Sharding

A physical shard could have multiple logical shards.
Can logically shard in-place w/ no data movement by using a virtual database layer.
Benefits of logical sharding:
- Incremental %age based rollouts
  - Can use logical shards for reads but let the writes go to the un-sharded database
- Low-risk fallback path
- De-risk latency availability, correctness before physical sharding
  - All query, txn. logic, application changes that happen with horizontal sharding would already be running in production through the virtual shards.

Figma doesn’t use micro-services. They’ve a few large services including a service in Go that powers the real-time feature and a no. of Ruby services including a large Ruby monolith.

DBProxy

They already run a slightly modified version of pgBouncer as their database proxy. DBProxy is a Go service that sits b/w application and pgBouncer layers.
“Brains” of horizontal sharding. Application doesn’t need to know how the backend is actually sharded beyond including a shard key in queries.
Also enables load-shedding, improved observability, replica hedging etc.
- If a replica goes down, then the request is sent to multiple replicas and returns when the first one succeeds.
- Most critical routes only talk to replicas. Very few of them issue reads.
Lightweight query engine at its core.
Query parser taken from CockroachDB and it has worked well for them. Logical and physical planner was built in-house.

(Scatter Gather query) When a shard key isn’t specified, DBProxy aggregates the results from all the shards.

Query Language

All single-shard queries are allowed.
No nested SQL queries.
Multi-shard writes, aggregation and group-bys are allowed but discouraged.
Joins are allowed iff the join is on the sharding key.

Physical Sharding Operation

Built in-house.
Uses Postgres Logical Replication to sync state.
Replica failovers are a no-downtime operation.
Primary failovers take ~10s of downtime during the critical section (In comparison, AWS RDS may take anywhere from minutes to half an hour for major upgrades).

Multi-step state machine w/ automatic rollback and recovery
- Disable reads at pgBouncer (PGB) layer
- Revoke permissions from source RDS
- Kill active queries
  - Since revoking permissions doesn’t stop in-flight queries
- Wait for replication sync
- Reverse logical replication
  - Was logically replicating from old to new. Now replicating from source back to destination (for rollback).
- Re-route pgBouncer instances to point to target
- Re-enable permissions on target RDS
- Re-enable pgBouncer
Core failover operation enables vertical partitoning, horizontal shard-splits and major pg version upgrades.

Lessons Learned

You can get a long way on one database… Until you can’t
- Figma was on a single pg instance from 2012 to 2020 w/o connection pooling or replicas.
- Very few companies hit the scale where vanilla pg stops working.
Don’t be afraid to cross abstraction boundaries
- Engineers often get stuck behind abstractions.
- If something isn’t working, go read the code yourself. Applies to orgs too. Go talk to product engineers, read product code and review product designs.
- For example, we had issues with Postgres vacuuming. Txn. wrap-around occurred regularly since pg uses a 32-bit integer for txn ids (i.e. 2^32 ~ 4 billion txn. ids). Vacuuming is very expensive and they observed that their servers would hit 100% CPU at the end of a vacuum. They had to read the pg source code to understand why this was happening and find a workaround. (There was a single lock that’s being contended on that read the entire file system into memory because query plans have been invalidated at the very end of a vacuum or analyze operation.)
Data migrations are scary, so work twice as hard to make them safe.
- Do data migrations with the gradual, %age based rollouts and have the ability to rollback from unknowns.
Know your customers and product
- Understand how infra tradeoffs map to your organization’s priorities and ensure they align.
- The databases team had to become experts in Figma’s product to ship horizontal sharding.
- DBProxy does txns. by creating them on every database, issue a SELECT before commit to ensure that the DBs are still alive and then you commit in parallel. There is a span of 5ms where the DB could die and you could have a partial commit failure. At their commit volume, it’d only happen 2-3 times a year so it wasn’t a problem to worry about except for some critical txns. which were fixed.
Certain early company decisions are very hard to walk back
- Your data model, your ORM, and your consistency model are extremely hard to change.
  - At Dropbox, they had perfect consistency from day one and folks got used to that. Part of why the project took 3 years was that we couldn’t take that guarantee away from customers and had to keep giving them perfect atomic cross-shard txns.
  - At Figma, they gave replicas with nothing but a stale reads guarantee so all the product code was written to deal with that from day one.
- There is implicit tension b/w the goal of building products fast in a company’s early days vs. needing to care about scale and reliability once you’re successful. Choose your battles wisely.
Centralized on less technologies done well
- Golden paths enables small infra teams to support large product orgs
- Optimize for the good of the company over individual team’s desires
- More technologies add complexity, reduces reliability and increases developer costs
- Don’t gate-keep if you’re blocking the core business

References

Machines Fail

5 Lessons from 5 Years of Building Databases at Scale (2024)

Table of Contents

DBProxy

Lessons Learned

Machines Fail

5 Lessons from 5 Years of Building Databases at Scale (2024)

Table of Contents

DBProxy §

Lessons Learned §

DBProxy

Lessons Learned