Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For my company, neither egress nor storage cost are the big issue. It’s the API call (PUT) cost.

We deal with payloads that are just a little too big for a database (we run Postgres and Clickhouse) but just too frequent (~100 per second) and small (think largish json blobs) to be effective on S3.

We are write heavy. Reads are probably 1% but need to be instant for a good UI and API experience.



Yeah, S3 is not for tiny blobs..

What I have seen done before is concatenating many small blobs into a single large blob that is stored on S3. This works great for batch processing afterwards.

If you need read access to the objects one option is merge them into a large blob, and then create a small index file that keeps offsets for each of the tiny blobs. Then you fetch the index file, find the offset of the tiny blob you want and, do a range request for this offset into large blob.

This mostly works when you're not read heavy. I recently did an index file for serving HTML files out of a tarball. As an alternative to uploading many small files.


Have you looked at Kinesis Firehose? It was pretty much build for this use case although you will still need to see if you can define a partitioning scheme probably in combination with an S3 Select query to meet your query requirements.

https://aws.amazon.com/kinesis/data-firehose/?nc=sn&loc=0

https://aws.amazon.com/blogs/aws/s3-glacier-select/


We are using Kinesis. It’s fine. Great actually. We still need to store user logs and generated data persistently. Cold storage is also not an option. This is data that needs to be accessible the moment the event that generates the data happened. Don’t want to push my product too much but I run a synthetic monitoring comopany. Check my bio and you’ll get a gist of the type of workloads.


> For my company, neither egress nor storage cost are the big issue. It’s the API call (PUT) cost.

> We are write heavy. Reads are probably 1% but need to be instant for a good UI and API experience.

It sounds like the recently released OVH High Performance Object Storage[1] might be a good fit.

It has better performance than S3[2], completely free API calls, and $0.015 / GB egress.

[1] https://corporate.ovhcloud.com/en/newsroom/news/high-perform...

[2] https://blog.ovhcloud.com/what-is-the-real-performance-of-th...


You can host your own S3 API-compatible object storage service on some EC2 instances (exercise left to the reader to figure out how to make that reliable). Zero PUT cost, higher operational overhead.

  Minio: https://github.com/minio/minio
  SeaweedFS: https://github.com/chrislusf/seaweedfs
  Ceph: https://ceph.com/en/discover/technology/


This is a very naive question so I might be very wrong, but isn't postgres pretty flexible about objects now?


100/s is 3 billion per year. Postgres has a hard limit of 4 billion blobs in a column.

that ignores the reality that RDS storage is comparatively expensive.


If you think a table might get anywhere near this size (blobs or not), I highly recommend table partitioning.


My problem is customer ops, not mine so much. They run our software.


That's not true. Our ops struggle too. 32bit oids really are a barrier.


Yes, but it does blow up TOAST and has a lot of impact on the deletion behavior on busy tables. We removed all larger json blobs from PG. typical settings or config stuff in json in PG are fine. We use that all the time. But larger json blobs of several kilobytes are still an issue for semi timeseries data.


Could you elaborate on the TOAST issues you're having? We're pretty liberal with our use of large JSONB objects and might hit a billion objects in a year or so.


Have you tried using PostgreSQL Large Objects (LOs)?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: