"I think what's going on here is that a read-only transaction against a secondary can observe some transaction T, but also miss transactions which must have logically executed before T."
i was intuitively wondering the same but i'm having trouble reasoning how the post's example with transactions 1, 2, 3, 4 exhibits this behavior. in the example, is transaction 2 the only read-only transaction and therefore the only transaction to read from the read replica? i.e. transactions 1, 3, 4 use the primary and transaction 2 uses the read replica?
The Write-Ahead Logging (WAL) is single-threaded and maintains consistency at a specific point in time in each instance. However, there can be anomalies between two instances. This behavior is expected because the RDS Multi-AZ cluster does not wait for changes to be applied in the shared buffers. It only waits for the WAL to sync.
This is similar to the behavior of PostgreSQL when synchronous_commit is set to on. Nothing unexpected.
ah, so something like... if the primary ordered transaction 3 < transaction 1, but transaction 2 observes only transaction 1 on the read-only secondary potentially because the secondary orders transaction 1 < transaction 3?
The Persistence Infrastructure team develops and operates Discord’s real-time datastore systems that serve the data of Discord's 150M+ monthly active users—including over a trillion messages! We work across multiple systems areas: databases, disk storage and Rust-based data access services. We're a small team whose work has a huge impact on our organization's success and ability to grow.
Hey Mike, I've applied to discord a few times, but have received automated rejections each time. Can I send you my resume to see what is causing the automated rejections? I love discord and use it on a daily basis, so to work for the team would be incredible.
I think there's some confusion at what distributed databases like Spanner, Cockroach, etc do:
- A given row of data is configured to have N replicas, e.g. 3, 5, etc, but N doesn't have to equal the actual size the database cluster. e.g. you can have a database cluster of 21 nodes with N being 3. N represents the "Raft cluster size" and the database is composed of many "Raft clusters" that are responsible for segments of the data space. You can horizontally scale the size of the database cluster by adding more nodes and thus more "Raft clusters"
- Raft/Paxos are used to ensure linearizable writes to those replicas through a leader
- Systems like Spanner, Cockroach have higher write latencies because replicas in Raft cluster have to achieve quorum consensus, but reads latency can mostly not be affected by always reading from the leader of a cluster via things like leader leases.
> The standard deposit insurance coverage limit is $250,000 per depositor, per FDIC-insured bank, per ownership category. Deposits held in different ownership categories are separately insured, up to at least $250,000, even if held at the same bank.
Demand Deposit Account - the kind of account that allows for cleared funds to be withdrawn without advanced notice. Probably 99.999% of accounts such as checking and savings one can write checks/do ACH payments/send wires/transfer money from fall into this category.
>The standard deposit insurance coverage limit is $250,000 per depositor, per FDIC-insured bank, per ownership category
My understanding is that this might cause an unwelcome surprise to (for example) someone with a personal account at Bank A, and a sweep account at Brokerage P that sends its funds into accounts at Banks A, B, and C.
> Is it necessary to fully replicate the persistent store onto the striped SSD array? I admire such a simple solution, but I wonder if something like an LRU cache would achieve similar speedups while using fewer resources. On the other hand, it could be a small cost to pay for a more consistent and predictable workload.
One of the reasons an LRU cache like dm-cache wasn't feasible was because we had a higher than acceptable bad sector read rate which would cause a cache like dm-cache to bubble up a block device error up to the database. The database would then shut itself down when it encountered an disk-level error.
> How does md handle a synchronous write in a heterogenous mirror? Does it wait for both devices to be written?
Yes, md waits for both mirrors to be written.
> I'm also curious how this solution compares to allocating more ram to the servers, and either letting the database software use this for caching, or even creating a ramdisk and putting that in raid1 with the persistent storage. Since the SSDs are being treated as volatile anyways. I assume it would be prohibitively expensive to replicate the entire persistent store into main memory.
Yeah, we're talking many terabytes.
> I'd also be interested to know how this compares with replacing the entire persistent disk / SSD system with zfs over a few SSDs (which would also allow snapshoting). Of course it is probably a huge feature to be able to have snapshots be integrated into your cloud...
Would love if we could've used ZFS, but Scylla requires XFS.
Is it really a problem to bubble up that error (and kill the database server) if you can just bring up a new database server with a clean cache (potentially even on the same computer without rebooting it) instantly? (I have been doing this kind of thing using bcache over a raid array of EBS drives on Amazon for around a decade now, and I was surprised you were so concerned about those read failures so heavily to forgo the LRU benefits when to me your solution sounds absolutely brutal for needing a complete copy of the remote disk on the local "cache" disk at all times for the RAID array to operate at all, meaning it will be much harder to quickly recover from other hardware failures.)
> Is it really a problem to bubble up that error (and kill the database server) if you can just bring up a new database server with a clean cache (potentially even on the same computer without rebooting it) instantly?
Our estimations for MTTF for our larger clusters would mean there'd be a risk of simultaneous nodes stopping due to bad sector reads. Remediation in that case would basically require cleaning and rewarming the cache, which for large data sets could be on the order of an hour or more, which would mean we'd lose quorum availability during that time.
> to me your solution sounds absolutely brutal for needing a complete copy of the remote disk on the local "cache" disk at all times for the RAID array to operate at all, meaning it will be much harder to quickly recover from other hardware failures.)
In Scylla/Cassandra, you need to run full repairs that scan over all of the data. Having an LRU cache doesn't work well with this.
Curious what the perf tradeoffs are of multiple EBSes vs. a single large EBS? I know of their fixed IOPs-per-100Gb ratio or whatnot, maybe there is some benefit to splitting that up across devices.
Is it about the ability to dynamically add extra capacity over time or something?
Yeah: I do it when the max operations or bandwidth per disk is less than what I "should get" for the size and cost allocated. I had last done this long enough ago that they simply had much smaller maxes, but I recently am doing it again as I have something like 40TB of data and I split it up among (I think) 14 drives that can then be one of the cheapest EBS variants designed more for cold storage (as the overall utilization isn't anywhere near as high as would be implied by its size: it is just a lot of mostly dead data, some of which people care much more about).
Is the bad sector read rate abnormally high? Are GCE's SSDs particularly error prone? Or is the failure rate typical, but a bad sector read is just incredibly expensive?
I assume you investigated using various RAID levels to make an LRU cache acceptably reliable?
It's also surprising to me that GCE doesn't provide a suitable out of the box storage solution. I thought a major benefit of the cloud is supposed to be not having to worry about things like device failures. I wonder what technical constraints are going on behind the scenes at GCE.
Yes, incredibly error prone. Bad sector reads were observed at an alarming rate over a short period - well beyond what is expected if you were to just buy an enterprise nvme and slap it into a server.
Scylla (and Cassandra) provides cluster-level replication. Even with only local NVMes, a single node failure with loss of data would be tolerated. But relying on "ephemeral local SSDs" that nodes can lose if any VM is power-cycled adds additional risk that some incident could cause multiple replicas to lose their data.
It seems that the biggest issue then is that the storage primitives that are available (ephemeral local storage and persistent remote storage) make it hard to have high performance and highly resilient stateful systems.
That's good observation. We've spent a lot of time on our control plane which handles the various RAID1 failure modes, e.g. when a RAID1 degrades due to failed local SSD, we force stop the node so that it doesn't continue to operate as a slow node. Wait for part 2! :)
i was intuitively wondering the same but i'm having trouble reasoning how the post's example with transactions 1, 2, 3, 4 exhibits this behavior. in the example, is transaction 2 the only read-only transaction and therefore the only transaction to read from the read replica? i.e. transactions 1, 3, 4 use the primary and transaction 2 uses the read replica?