Hi HN, we’re Joe and JD from Hydra (https://hydra.so/). We're excited to announce our team has added parallel query execution and vectorization to columnar storage on Postgres. In our blog you can review the clickbench benchmarks and the method of how we built Hydra.
Starting today, we are now offering 14-day free trials of our cloud managed Hydra databases. Click the 'Get Started for Free" button on (https://hydra.so/) to get one.
The project looks very interesting. But I had a look at the license of the Citus source code and it appears to be under an AGPL license and I didn't see an exception for the part of the code that you're including in Hydra. FWIU, AGPL code is not compatible with including in Apache code, although the opposite is compatible.
So, I'd be interested to know if I'm understanding this wrong or if there is some license exception I'm not seeing.
I'll answer my own question after doing some more digging. It seems that as the columnar code is a self-contained PostgreSQL extension, you're only using the API which would be fine as there is no linking involved.
The metadata can act as a basic form of indexing (or sometimes caching, though Hydra doesn't use metadata to calculate results yet), but it's not an index in the traditional sense. It's used to eliminate stripes and blocks from consideration during a scan.
Columnar is not ideal for a `users` table where you want to select and update specific rows, often in very small, quick transactions (OLTP). You would want to continue to use a traditional (heap) table in that case. That's certainly something you can still do with Hydra, and combining both kinds of tables is considered HTAP, and something that is a unique use case of our product.
To contrast, columnar is best for "fact" tables -- data about something that happened (thus it does not change) that will be analyzed in an aggregate way. Those might be logs, events, transactions, etc.
Yes, Hydra Columnar and PostGIS get along just fine. We've not looked into any PostGIS-specific optimizations yet, but if users run into issues, we'll be happy to investigate.
I'm a deep Postgres person and I'm very interested in the project. Congratulations and welcome as a new open source project to the ecosystem.
One important recommendation: please do not run benchmarks on variable-performance storage (gp2 in this case), as it obviously may yield different performance depending on the credit situation, potentially delivering more or less performance to different benchmark runs / scenarios.
Specifically, for 500GB you get the max throughput (250MB/s, which BTW is pretty low for an OLAP-style bench, in my opinion) but only 1.5K IOPS. See [1] for more information.
An easy alternative would have been gp3 volumes, which do not burst performance, and can be set to provide 16K IOPS and 1Gbps.
Yeah, I agree! However, ClickBench has used 500GB GP2 as the "standard" for some time, so I stuck to it for consistency. We use GP3 for our hosted service, and I did test on GP3 as well with identical settings as GP2 and the results are very similar.
But something that is variable in time cannot be "consistent" by definition ;)
Great to know this is known. My recommendation still holds: publish results with GP3: whatever others do (potentially, wrong) shouldn't prevent you from doing it right.
The one table used for Clickbench isn’t that large. I assume these are the “warm” results that throw away the first execution of each query & all perf being measured here is in-memory.
That's an assumption. My #1 rule for a benchmark is that it should have reproducible results. Using variable-performance storage goes directly against reproducible results.
On a related topic: an OLAP benchmark with a small dataset that fits in memory caters only to what I'd consider a small set of OLAP use cases. I'd love to see one with a large dataset much bigger than memory.
How does it work for writing, I saw somewhere that the columnar tables are append-only? Are there any plans on storage which can be merge or updated into?
It is transactional, right? That part of Postgres doesn't change with the plugin.
Would you support a business model "install in customer cloud and we provide support" or is your own cloud the only option?
Currently columnar is append-only. There are plans to change / improve this!
Yes, Hydra is transactional- that doesn't change.
Happy to make suggestions around "install and provide support." If you want to chat about it, click "get started for free" on hydra.so, book a time slot, and we can discuss.
I don't have a use case for this myself, I just wanted to share that for "traditionalist" companies, connecting to "completely new cloud" is pretty difficult, while "run some software on ec2 with a support contract" is doable. I don't if that is worth it to you as a possible model, of course.
A narrow distinction, but Hydra is Postgres - we only install an extension - while Greenplum and Redshift are forks but remain Postgres-compatible (to varying degrees). I'm not up on when Greenplum last merged updates from Postgres, but I would be concerned that it only runs on Ubuntu 18.04. If you have a look at the Greenplum install in ClickBench[1], you'll see it's not a typical Postgres setup. Hopefully we will be able to beat Greenplum straight-up soon. :)
Redshift is multi-node, which puts it in a different category -- with considerably higher costs.
I thought it was a genuine fork, just a very old (pre-v9 even) one?
Anyway, does it really matter? What is someone looking for a fast 'postgres' for analytics actually interested in?
(I didn't realise this was just an extension - in which case I'm amazed it's possible, but that obviously makes it an easy sell if you're already running pg. But if you're shopping about for managed solutions (which is obviously what Hydra wants to sell) with 'postgres' criterium, you're interested in the query language and maybe the wire protocol, surely?)
Many Postgres features aren't supported on Redshift (set returning functions, indexes, ...) and many tools that work just fine with Postgres error out because Redshift does things differently or doesn't support features that Postgres does.
Of course, that's the power of Postgres. You can join between columnar tables or between columnar and heap (row-based) tables. The performance of joins hasn't been a specific focus of our engineering work yet, but I made a little test of enriching an analytical query with user data here:
https://gist.github.com/wuputah/e62b83f86880bd7e6623809afe4c...
Hydra open source (https://github.com/HydrasDB/hydra)
Starting today, we are now offering 14-day free trials of our cloud managed Hydra databases. Click the 'Get Started for Free" button on (https://hydra.so/) to get one.
Power to the Postgres people!