Hydra – the fastest Postgres for analytics [benchmarks]

coatue · on Dec 13, 2022

Hi HN, we’re Joe and JD from Hydra (https://hydra.so/). We're excited to announce our team has added parallel query execution and vectorization to columnar storage on Postgres. In our blog you can review the clickbench benchmarks and the method of how we built Hydra.

Hydra open source (https://github.com/HydrasDB/hydra)

Starting today, we are now offering 14-day free trials of our cloud managed Hydra databases. Click the 'Get Started for Free" button on (https://hydra.so/) to get one.

Power to the Postgres people!

blixtra · on Dec 14, 2022

The project looks very interesting. But I had a look at the license of the Citus source code and it appears to be under an AGPL license and I didn't see an exception for the part of the code that you're including in Hydra. FWIU, AGPL code is not compatible with including in Apache code, although the opposite is compatible. So, I'd be interested to know if I'm understanding this wrong or if there is some license exception I'm not seeing.

blixtra · on Dec 14, 2022

I'll answer my own question after doing some more digging. It seems that as the columnar code is a self-contained PostgreSQL extension, you're only using the API which would be fine as there is no linking involved.

hartator · on Dec 14, 2022

I am a little confused by what you mean by columnar storage and "in fact, it doesn’t use indexes at all".

Does that mean every columns (lines) are auto-indexed? Can you use that strategy for storing things like users?

wuputah · on Dec 14, 2022

The metadata can act as a basic form of indexing (or sometimes caching, though Hydra doesn't use metadata to calculate results yet), but it's not an index in the traditional sense. It's used to eliminate stripes and blocks from consideration during a scan.

Columnar is not ideal for a `users` table where you want to select and update specific rows, often in very small, quick transactions (OLTP). You would want to continue to use a traditional (heap) table in that case. That's certainly something you can still do with Hydra, and combining both kinds of tables is considered HTAP, and something that is a unique use case of our product.

To contrast, columnar is best for "fact" tables -- data about something that happened (thus it does not change) that will be analyzed in an aggregate way. Those might be logs, events, transactions, etc.

winrid · on Dec 14, 2022

Clickhouse works the same way right?

pella · on Dec 14, 2022

Any info for "Columnar storage" for geo data? or PostGIS support?

wuputah · on Dec 14, 2022

Yes, Hydra Columnar and PostGIS get along just fine. We've not looked into any PostGIS-specific optimizations yet, but if users run into issues, we'll be happy to investigate.

pella · on Dec 15, 2022

based on https://github.com/HydrasDB/hydra/blob/main/columnar/src/bac...

the columnar "does not support gist, gin, spgist and brin indexes." :-(

and "gist" is important for spatial indexing : http://postgis.net/workshops/postgis-intro/indexing.html

senttoschool · on Dec 14, 2022

Just curious, can this be used together with TimescaleDB?

ahachete · on Dec 14, 2022

I'm a deep Postgres person and I'm very interested in the project. Congratulations and welcome as a new open source project to the ecosystem.

One important recommendation: please do not run benchmarks on variable-performance storage (gp2 in this case), as it obviously may yield different performance depending on the credit situation, potentially delivering more or less performance to different benchmark runs / scenarios.

Specifically, for 500GB you get the max throughput (250MB/s, which BTW is pretty low for an OLAP-style bench, in my opinion) but only 1.5K IOPS. See [1] for more information.

An easy alternative would have been gp3 volumes, which do not burst performance, and can be set to provide 16K IOPS and 1Gbps.

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/general-...

wuputah · on Dec 14, 2022

Yeah, I agree! However, ClickBench has used 500GB GP2 as the "standard" for some time, so I stuck to it for consistency. We use GP3 for our hosted service, and I did test on GP3 as well with identical settings as GP2 and the results are very similar.

ahachete · on Dec 14, 2022

But something that is variable in time cannot be "consistent" by definition ;)

Great to know this is known. My recommendation still holds: publish results with GP3: whatever others do (potentially, wrong) shouldn't prevent you from doing it right.

I'd be giving a deeper look at the project.

twoodfin · on Dec 14, 2022

The one table used for Clickbench isn’t that large. I assume these are the “warm” results that throw away the first execution of each query & all perf being measured here is in-memory.

ahachete · on Dec 14, 2022

That's an assumption. My #1 rule for a benchmark is that it should have reproducible results. Using variable-performance storage goes directly against reproducible results.

On a related topic: an OLAP benchmark with a small dataset that fits in memory caters only to what I'd consider a small set of OLAP use cases. I'd love to see one with a large dataset much bigger than memory.

garysahota93 · on Dec 13, 2022

Just checked out the site: I really like the fact that you guys are addressing some of the core capabilities in the data stack (ie ETL).

The fact that you guys are supporting DBT natively is a HUGE plus for a data professional like myself.

Will definitely be kicking the tires soon & testing against my own workloads.

drchaim · on Dec 13, 2022

Congrats for the launch!

As I’ve anticipated, clickbench will become the defacto analytics benchmark, which is great.

Something to improve IMHO it’s documentation regarding Cloud and Open Source. Is somehow mixed (the same happen in ClickHouse docs)

As a Postgres user I would like to see: how can I install this extension in my current Database? How to migrate to new tables.

Timescale makes a really great job in this case.

Anyway, as I’ve said. Congrats and good luck!

glogla · on Dec 14, 2022

Few questions:

How does it work for writing, I saw somewhere that the columnar tables are append-only? Are there any plans on storage which can be merge or updated into?

It is transactional, right? That part of Postgres doesn't change with the plugin.

Would you support a business model "install in customer cloud and we provide support" or is your own cloud the only option?

coatue · on Dec 14, 2022

Currently columnar is append-only. There are plans to change / improve this! Yes, Hydra is transactional- that doesn't change. Happy to make suggestions around "install and provide support." If you want to chat about it, click "get started for free" on hydra.so, book a time slot, and we can discuss.

glogla · on Dec 14, 2022

I don't have a use case for this myself, I just wanted to share that for "traditionalist" companies, connecting to "completely new cloud" is pretty difficult, while "run some software on ec2 with a support contract" is doable. I don't if that is worth it to you as a possible model, of course.

OJFord · on Dec 13, 2022

Why benchmark against AWS Aurora rather than Redshift, which would seem a closer/more relevant comparison to me?

Also omits ClickHouse, despite using its benchmarking tooling?

Edit: Oh, because for some reason Redshift isn't considered postgres compatible in ClickBench. Still, 'the fastest' is 'greenplum' according the full thing: https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...

wuputah · on Dec 13, 2022

A narrow distinction, but Hydra is Postgres - we only install an extension - while Greenplum and Redshift are forks but remain Postgres-compatible (to varying degrees). I'm not up on when Greenplum last merged updates from Postgres, but I would be concerned that it only runs on Ubuntu 18.04. If you have a look at the Greenplum install in ClickBench[1], you'll see it's not a typical Postgres setup. Hopefully we will be able to beat Greenplum straight-up soon. :)

Redshift is multi-node, which puts it in a different category -- with considerably higher costs.

[1]: https://github.com/ClickHouse/ClickBench/blob/main/greenplum...

castorp · on Dec 13, 2022

> rather than Redshift

Despite what the Amazon marketing is telling, Redshift is not really a "fork" of Postgres.

To my knowledge they only used the SQL parser and the wire protocol from Postgres.

The optimizer, query executor and storage engine are totally different. The whole "Redshift is Postgres" is complete marketing BS in my opinion.

OJFord · on Dec 13, 2022

I thought it was a genuine fork, just a very old (pre-v9 even) one?

Anyway, does it really matter? What is someone looking for a fast 'postgres' for analytics actually interested in?

(I didn't realise this was just an extension - in which case I'm amazed it's possible, but that obviously makes it an easy sell if you're already running pg. But if you're shopping about for managed solutions (which is obviously what Hydra wants to sell) with 'postgres' criterium, you're interested in the query language and maybe the wire protocol, surely?)

castorp · on Dec 13, 2022

Yes, it matters.

Many Postgres features aren't supported on Redshift (set returning functions, indexes, ...) and many tools that work just fine with Postgres error out because Redshift does things differently or doesn't support features that Postgres does.

twoodfin · on Dec 14, 2022

Clickbench is single-table. Do you have a join story?

wuputah · on Dec 14, 2022

Of course, that's the power of Postgres. You can join between columnar tables or between columnar and heap (row-based) tables. The performance of joins hasn't been a specific focus of our engineering work yet, but I made a little test of enriching an analytical query with user data here: https://gist.github.com/wuputah/e62b83f86880bd7e6623809afe4c...

bjering1984 · on Dec 14, 2022

Congratulations on the launch!

chrisfrantz · on Dec 14, 2022

Congrats on the launch!