More

mikesun · 2025-04-29T23:38:03 1745969883

"I think what's going on here is that a read-only transaction against a secondary can observe some transaction T, but also miss transactions which must have logically executed before T."

i was intuitively wondering the same but i'm having trouble reasoning how the post's example with transactions 1, 2, 3, 4 exhibits this behavior. in the example, is transaction 2 the only read-only transaction and therefore the only transaction to read from the read replica? i.e. transactions 1, 3, 4 use the primary and transaction 2 uses the read replica?

aphyr · 2025-04-29T23:48:33 1745970513

Yeah, that's right. It may be that the (apparent) order of transactions differs between primary and secondary.

franckpachot · 2025-04-30T13:26:30 1746019590

The Write-Ahead Logging (WAL) is single-threaded and maintains consistency at a specific point in time in each instance. However, there can be anomalies between two instances. This behavior is expected because the RDS Multi-AZ cluster does not wait for changes to be applied in the shared buffers. It only waits for the WAL to sync. This is similar to the behavior of PostgreSQL when synchronous_commit is set to on. Nothing unexpected.

mikesun · 2025-04-30T01:13:50 1745975630

ah, so something like... if the primary ordered transaction 3 < transaction 1, but transaction 2 observes only transaction 1 on the read-only secondary potentially because the secondary orders transaction 1 < transaction 3?

mikesun · on May 1, 2024

Discord | Remote (US) | Senior Software Engineer, Persistence Infrastructure | Full-time

The Persistence Infrastructure team develops and operates Discord’s real-time datastore systems that serve the data of Discord's 150M+ monthly active users—including over a trillion messages! We work across multiple systems areas: databases, disk storage and Rust-based data access services. We're a small team whose work has a huge impact on our organization's success and ability to grow.

Apply here: https://discord.com/jobs/7288234002

creoborous · on May 5, 2024

Hey Mike, I've applied to discord a few times, but have received automated rejections each time. Can I send you my resume to see what is causing the automated rejections? I love discord and use it on a daily basis, so to work for the team would be incredible.

mikesun · on March 21, 2023

I think there's some confusion at what distributed databases like Spanner, Cockroach, etc do:

- A given row of data is configured to have N replicas, e.g. 3, 5, etc, but N doesn't have to equal the actual size the database cluster. e.g. you can have a database cluster of 21 nodes with N being 3. N represents the "Raft cluster size" and the database is composed of many "Raft clusters" that are responsible for segments of the data space. You can horizontally scale the size of the database cluster by adding more nodes and thus more "Raft clusters"

- Raft/Paxos are used to ensure linearizable writes to those replicas through a leader

- Multi row/partition transactions are handled through a different kind consensus called atomic commit/2 phase commit that's orthogonal to Raft/Paxos consensus: https://timilearning.com/posts/mit-6.824/lecture-12-distribu...

- Systems like Spanner, Cockroach have higher write latencies because replicas in Raft cluster have to achieve quorum consensus, but reads latency can mostly not be affected by always reading from the leader of a cluster via things like leader leases.

mikesun · on March 9, 2023

FDIC only insures up to $250K

johnbellone · on March 9, 2023

Per account.

Edit: It seems I am incorrect.

> The standard deposit insurance coverage limit is $250,000 per depositor, per FDIC-insured bank, per ownership category. Deposits held in different ownership categories are separately insured, up to at least $250,000, even if held at the same bank.

notyourday · on March 9, 2023

> Per account.

Per account "type" and structure. For DDAs if you are married it will be:

You: $250k

Your+your wife: $250k

You POD your wife : $250k

Your wife: $250k

Your wife POD you: $250k

jfim · on March 9, 2023

What's POD in this context?

htrp · on March 9, 2023

Payable on Death...

notyourday · on March 9, 2023

What @htrp said - payable on death.

ryan_j_naughton · on March 10, 2023

DDAs?

notyourday · on March 10, 2023

Demand Deposit Account - the kind of account that allows for cleared funds to be withdrawn without advanced notice. Probably 99.999% of accounts such as checking and savings one can write checks/do ACH payments/send wires/transfer money from fall into this category.

TMWNN · on March 9, 2023

>The standard deposit insurance coverage limit is $250,000 per depositor, per FDIC-insured bank, per ownership category

My understanding is that this might cause an unwelcome surprise to (for example) someone with a personal account at Bank A, and a sweep account at Brokerage P that sends its funds into accounts at Banks A, B, and C.

mikesun · on March 2, 2023

Zig. https://ziglearn.org/chapter-4/

mikesun · on Aug 15, 2022

> Is it necessary to fully replicate the persistent store onto the striped SSD array? I admire such a simple solution, but I wonder if something like an LRU cache would achieve similar speedups while using fewer resources. On the other hand, it could be a small cost to pay for a more consistent and predictable workload.

One of the reasons an LRU cache like dm-cache wasn't feasible was because we had a higher than acceptable bad sector read rate which would cause a cache like dm-cache to bubble up a block device error up to the database. The database would then shut itself down when it encountered an disk-level error.

> How does md handle a synchronous write in a heterogenous mirror? Does it wait for both devices to be written? Yes, md waits for both mirrors to be written.

> I'm also curious how this solution compares to allocating more ram to the servers, and either letting the database software use this for caching, or even creating a ramdisk and putting that in raid1 with the persistent storage. Since the SSDs are being treated as volatile anyways. I assume it would be prohibitively expensive to replicate the entire persistent store into main memory. Yeah, we're talking many terabytes.

> I'd also be interested to know how this compares with replacing the entire persistent disk / SSD system with zfs over a few SSDs (which would also allow snapshoting). Of course it is probably a huge feature to be able to have snapshots be integrated into your cloud... Would love if we could've used ZFS, but Scylla requires XFS.

saurik · on Aug 15, 2022

Is it really a problem to bubble up that error (and kill the database server) if you can just bring up a new database server with a clean cache (potentially even on the same computer without rebooting it) instantly? (I have been doing this kind of thing using bcache over a raid array of EBS drives on Amazon for around a decade now, and I was surprised you were so concerned about those read failures so heavily to forgo the LRU benefits when to me your solution sounds absolutely brutal for needing a complete copy of the remote disk on the local "cache" disk at all times for the RAID array to operate at all, meaning it will be much harder to quickly recover from other hardware failures.)

mikesun · on Aug 16, 2022

> Is it really a problem to bubble up that error (and kill the database server) if you can just bring up a new database server with a clean cache (potentially even on the same computer without rebooting it) instantly?

Our estimations for MTTF for our larger clusters would mean there'd be a risk of simultaneous nodes stopping due to bad sector reads. Remediation in that case would basically require cleaning and rewarming the cache, which for large data sets could be on the order of an hour or more, which would mean we'd lose quorum availability during that time.

> to me your solution sounds absolutely brutal for needing a complete copy of the remote disk on the local "cache" disk at all times for the RAID array to operate at all, meaning it will be much harder to quickly recover from other hardware failures.) In Scylla/Cassandra, you need to run full repairs that scan over all of the data. Having an LRU cache doesn't work well with this.

baobob · on Aug 15, 2022

Curious what the perf tradeoffs are of multiple EBSes vs. a single large EBS? I know of their fixed IOPs-per-100Gb ratio or whatnot, maybe there is some benefit to splitting that up across devices.

Is it about the ability to dynamically add extra capacity over time or something?

saurik · on Aug 16, 2022

Yeah: I do it when the max operations or bandwidth per disk is less than what I "should get" for the size and cost allocated. I had last done this long enough ago that they simply had much smaller maxes, but I recently am doing it again as I have something like 40TB of data and I split it up among (I think) 14 drives that can then be one of the cheapest EBS variants designed more for cold storage (as the overall utilization isn't anywhere near as high as would be implied by its size: it is just a lot of mostly dead data, some of which people care much more about).

ahepp · on Aug 16, 2022

Thanks for the insights!

Is the bad sector read rate abnormally high? Are GCE's SSDs particularly error prone? Or is the failure rate typical, but a bad sector read is just incredibly expensive?

I assume you investigated using various RAID levels to make an LRU cache acceptably reliable?

It's also surprising to me that GCE doesn't provide a suitable out of the box storage solution. I thought a major benefit of the cloud is supposed to be not having to worry about things like device failures. I wonder what technical constraints are going on behind the scenes at GCE.

jhgg · on Aug 16, 2022

> Are GCE's SSDs particularly error prone?

Yes, incredibly error prone. Bad sector reads were observed at an alarming rate over a short period - well beyond what is expected if you were to just buy an enterprise nvme and slap it into a server.

ec109685 · on Aug 16, 2022

That's quite an indictment of GCP if their Local SSD's reliability is that bad.

mikesun · on Aug 15, 2022

Scylla (and Cassandra) provides cluster-level replication. Even with only local NVMes, a single node failure with loss of data would be tolerated. But relying on "ephemeral local SSDs" that nodes can lose if any VM is power-cycled adds additional risk that some incident could cause multiple replicas to lose their data.

jfim · on Aug 15, 2022

It seems that the biggest issue then is that the storage primitives that are available (ephemeral local storage and persistent remote storage) make it hard to have high performance and highly resilient stateful systems.

mikesun · on Aug 15, 2022

That's good observation. We've spent a lot of time on our control plane which handles the various RAID1 failure modes, e.g. when a RAID1 degrades due to failed local SSD, we force stop the node so that it doesn't continue to operate as a slow node. Wait for part 2! :)

mikesun · on Aug 24, 2021

Also:

https://discord.com/jobs/5411664002

https://discord.com/jobs/5426301002

mikesun · on Aug 24, 2021

We're actively hiring for our storage infrastructure team (SF or remote) ! If this sounds interesting, check us out!

https://discord.com/jobs/5411664002

https://discord.com/jobs/5426301002