Two quotes from the paper that I think will motivate people to read it:
"Rigorous correctness testing via simulation makes FDB extremely reliable. In the past several years, CloudKit [59] has deployed FDB for more than 0.5M disk years without a single data corruption event. Additionally, we constantly perform data consistency checks by comparing replicas of data records and making sure they are the same. To this date, no inconsistent data replicas have ever been found in our production clusters."
"For example, early versions of FDB depended on Apache Zookeeper
for coordination, which was deleted after real-world fault injection found two independent bugs in Zookeeper (circa 2010) and was replaced by a de novo Paxos implementation written in Flow. No production bugs have ever been reported since."
Ehhhh, doesn't align with my experience. I think FDB is actually really poorly tested. When I was evaluating it for replacement of the metadata key-value store at a major, public web services company we found that injecting faults into virtual NVMe devices on individual replicas would cause corrupt results returned to clients. We also found that it would just crash-loop on Linux systems with huge pages, because although someone from the project had written a huge-page-aware C++ allocator "for performance", evidently nobody had ever actually tried to use it, including the author.
It's also really, really weird that their non-scalable architecture hits a brick wall at 25 machines. Ignoring the correctness flaws, it only works if you can either design around that limit by sharding, and never off cross-shard transactions, or if you can assure yourself that your use case will never outgrow half a rack of equipment.
Can you fix a point in time? Software evolves and I think a point I saw is that it wasn’t well tested then they changed once production workloads told them it needs to change.
There weren't any, which is why that particular shop elected to roll their own distributed system on top of rocks.
In general I think people who think they want to do FoundationDB owe themselves a serious contemplation of the cost/benefit of using Cloud Spanner instead. Obviously you cannot do your own fault injection testing of Spanner, but it does have end-to-end checksums.
For the record, I said the same thing. But it's a management problem because on the one hand you have a known open project with demonstrable flaws, and on the other you have your own in-house developers and you will tend to discount the bugs they haven't written yet.
But, also for the same record, thinking you can implement a reliable, globally-replicated key-value store on top of FoundationDB that is cheaper and better than Cloud Spanner may be evidence of the same cognitive bias.
> But, also for the same record, thinking you can implement a reliable, globally-replicated key-value store on top of FoundationDB that is cheaper and better than Cloud Spanner may be evidence of the same cognitive bias.
Thanks for the quotes, I've been wanting to read this paper for some time. Great to see they went through the consensus literature and made a decision to go with Active Disk Paxos, instead of stopping short and not fully understanding the consensus they're building on. The consensus and replication protocol is such a huge part of building a distributed database.
My understanding is that FDB relies heavily on deterministic simulations for testing, and that their async/await model is a big part of how they make sure they cover different possible interleavings in a deterministic way.
"Rigorous correctness testing via simulation makes FDB extremely reliable. In the past several years, CloudKit [59] has deployed FDB for more than 0.5M disk years without a single data corruption event. Additionally, we constantly perform data consistency checks by comparing replicas of data records and making sure they are the same. To this date, no inconsistent data replicas have ever been found in our production clusters."
"For example, early versions of FDB depended on Apache Zookeeper for coordination, which was deleted after real-world fault injection found two independent bugs in Zookeeper (circa 2010) and was replaced by a de novo Paxos implementation written in Flow. No production bugs have ever been reported since."