For my company, neither egress nor storage cost are the big issue. It’s the API call (PUT) cost.
We deal with payloads that are just a little too big for a database (we run Postgres and Clickhouse) but just too frequent (~100 per second) and small (think largish json blobs) to be effective on S3.
We are write heavy. Reads are probably 1% but need to be instant for a good UI and API experience.
What I have seen done before is concatenating many small blobs into a single large blob that is stored on S3. This works great for batch processing afterwards.
If you need read access to the objects one option is merge them into a large blob, and then create a small index file that keeps offsets for each of the tiny blobs. Then you fetch the index file, find the offset of the tiny blob you want and, do a range request for this offset into large blob.
This mostly works when you're not read heavy. I recently did an index file for serving HTML files out of a tarball. As an alternative to uploading many small files.
Have you looked at Kinesis Firehose? It was pretty much build for this use case although you will still need to see if you can define a partitioning scheme probably in combination with an S3 Select query to meet your query requirements.
We are using Kinesis. It’s fine. Great actually. We still need to store user logs and generated data persistently. Cold storage is also not an option. This is data that needs to be accessible the moment the event that generates the data happened. Don’t want to push my product too much but I run a synthetic monitoring comopany. Check my bio and you’ll get a gist of the type of workloads.
You can host your own S3 API-compatible object storage service on some EC2 instances (exercise left to the reader to figure out how to make that reliable). Zero PUT cost, higher operational overhead.
Yes, but it does blow up TOAST and has a lot of impact on the deletion behavior on busy tables. We removed all larger json blobs from PG. typical settings or config stuff in json in PG are fine. We use that all the time. But larger json blobs of several kilobytes are still an issue for semi timeseries data.
Could you elaborate on the TOAST issues you're having? We're pretty liberal with our use of large JSONB objects and might hit a billion objects in a year or so.
We deal with payloads that are just a little too big for a database (we run Postgres and Clickhouse) but just too frequent (~100 per second) and small (think largish json blobs) to be effective on S3.
We are write heavy. Reads are probably 1% but need to be instant for a good UI and API experience.