For my company, neither egress nor storage cost are the big issue. It’s the API ...

jopsen · on July 28, 2022

Yeah, S3 is not for tiny blobs..

What I have seen done before is concatenating many small blobs into a single large blob that is stored on S3. This works great for batch processing afterwards.

If you need read access to the objects one option is merge them into a large blob, and then create a small index file that keeps offsets for each of the tiny blobs. Then you fetch the index file, find the offset of the tiny blob you want and, do a range request for this offset into large blob.

This mostly works when you're not read heavy. I recently did an index file for serving HTML files out of a tarball. As an alternative to uploading many small files.

wmfiv · on July 28, 2022

Have you looked at Kinesis Firehose? It was pretty much build for this use case although you will still need to see if you can define a partitioning scheme probably in combination with an S3 Select query to meet your query requirements.

https://aws.amazon.com/kinesis/data-firehose/?nc=sn&loc=0

https://aws.amazon.com/blogs/aws/s3-glacier-select/

tnolet · on July 28, 2022

We are using Kinesis. It’s fine. Great actually. We still need to store user logs and generated data persistently. Cold storage is also not an option. This is data that needs to be accessible the moment the event that generates the data happened. Don’t want to push my product too much but I run a synthetic monitoring comopany. Check my bio and you’ll get a gist of the type of workloads.

krn · on July 28, 2022

> For my company, neither egress nor storage cost are the big issue. It’s the API call (PUT) cost.

> We are write heavy. Reads are probably 1% but need to be instant for a good UI and API experience.

It sounds like the recently released OVH High Performance Object Storage[1] might be a good fit.

It has better performance than S3[2], completely free API calls, and $0.015 / GB egress.

[1] https://corporate.ovhcloud.com/en/newsroom/news/high-perform...

[2] https://blog.ovhcloud.com/what-is-the-real-performance-of-th...

0xbadcafebee · on July 28, 2022

You can host your own S3 API-compatible object storage service on some EC2 instances (exercise left to the reader to figure out how to make that reliable). Zero PUT cost, higher operational overhead.

  Minio: https://github.com/minio/minio
  SeaweedFS: https://github.com/chrislusf/seaweedfs
  Ceph: https://ceph.com/en/discover/technology/

YawningAngel · on July 28, 2022

This is a very naive question so I might be very wrong, but isn't postgres pretty flexible about objects now?

michael1999 · on July 28, 2022

100/s is 3 billion per year. Postgres has a hard limit of 4 billion blobs in a column.

that ignores the reality that RDS storage is comparatively expensive.

singron · on July 29, 2022

If you think a table might get anywhere near this size (blobs or not), I highly recommend table partitioning.

michael1999 · on July 29, 2022

My problem is customer ops, not mine so much. They run our software.

michael1999 · on July 30, 2022

That's not true. Our ops struggle too. 32bit oids really are a barrier.

tnolet · on July 28, 2022

Yes, but it does blow up TOAST and has a lot of impact on the deletion behavior on busy tables. We removed all larger json blobs from PG. typical settings or config stuff in json in PG are fine. We use that all the time. But larger json blobs of several kilobytes are still an issue for semi timeseries data.

iknownothow · on July 28, 2022

Could you elaborate on the TOAST issues you're having? We're pretty liberal with our use of large JSONB objects and might hit a billion objects in a year or so.

LunaSea · on July 28, 2022

Have you tried using PostgreSQL Large Objects (LOs)?