Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Open table formats are inevitable for analytical datasets (ensembleanalytics.io)
60 points by benjaminwootton on Jan 18, 2024 | hide | past | favorite | 58 comments


The one and a half downsides not mentioned here:

1: Separating storage and compute tends to result in relatively high startup latency (Databricks, here's looking at you!) when you have to provision new compute. For a massive batch job, waiting 10 minutes for a cluster to come up is fine, but a lot of orgs don't realize the cost implications and end up with tiny developer/analyst clusters, or aggressively spin inactive clusters down, which results in long wait times. Analyst/developer ergonomics have not traditionally been a major concern in the big data space.

1.5: Dealing with multiple files instead of a data warehouse SQL interface requires work, and can introduce interesting performance issues. Obviously you can put a SQL interface in front of the data, but most RDBMS / Data Warehouses have a lot of functionality around maintaining your data that you may not get with the native file format, so you get soft-locked into the metadata format that comes from your data lake file format. It's all open, so this is only a half issue, but there are switching costs.


(I do some training for Databricks) 1. Yeah, cluster startup time can be not fun. Here are some solutions: - pools (keeps instances around so you don't have to wait for the cloud to provision them - serverless SQL warehouses (viable if you're doing only SQL) - one job with multiple tasks that share the same job cluster. Delta Live Tables does a similar thing but with streaming autoscaling - streaming: cluster never needs to go down. Can share multiple streams on the same cluster so they load balance each other


How do you keep the pool costs manageable?

I see a lot of companies that get sold on Databricks and then are surprised by the cost.


Pool costs become more manageable as you have more clusters sharing the same pool. You can also have it have no incremental cost increase by setting the timeout to be 0, but it makes it less useful. You can have more clusters take advantage of it by using the same instance families. If you purchase reserved instances from Azure/AWS, you might as well make a pool with those as well. You may also want to check out fleet instance types.


> The one and a half downsides not mentioned here:

Make that 2.5. If you are building new analytic products the most significant issue is that 90% of the information in data lakes is already there and not in an open format like Iceberg. Instead, it's heaps of CSV and Parquet files, maybe in Hive format but maybe just named in some unique way. If your query engine can't read these it's like being all dressed up with no place to go.


CSV and Parquet are open file formats.

They're not good structured table formats, but they are open compared to the binary storage you get with Oracle or Snowflake.


The vast majority of analytics needs do not warrant a cluster.

There are many other competitive query engines that do not suffer from the startup latency.


100% agree! Most analytics are done in excel, most analytics should be done in a SQLite or DuckDB database!


I don't see any mention of sqlite. Is a sqlite file not the same thing they're talking about here? Pretty sure it has a spec and hasn't changed formats in many years so if you wanted to read it out in something that isn't sqlite, I imagine it wouldn't be too hard.


This is more about arranging files in cloud storage than about the engine used to query them.

You could store your data in a SQLite database, but that's not really interoperable in the way a bunch of Parquet files are. Source: tried it.


Here's one more problem not mentioned by others: SQLite has some functions that it declares but doesn't define. For example, regexp(). So, if your table is defined as having a constraint with regexp()... well, sucks to be you: it might depend on what regexp() implementation you load whether that constraint can be applied or not.

I know this because I needed this functionality in Ada code, which links statically with SQLite, and I didn't want to also link with PCRE library just to get regexp(), especially since GNATCOLL already, sort of has regexp... except it doesn't have the "fancy" features, s.a. lookaheads / lookbehinds / Unicode support etc.

So, a table that uses regexp() in its constraint definition isn't portable. But, if you only use the table data without the schema you lose a lot of valuable information...

----

Also, come to think about it: unlike server-client databases (eg. MySQL, PostgreSQL etc) SQLite doesn't have a wire-transfer format. Its interface returns values in the way C language understands them. The on-disk binary format isn't at all designed for transfer because it's optimized for access efficiency. This, beside other things, results in SQLite database file typically having tons of empty space, it doesn't use efficient (compressed) value representation etc.

So, trying to use SQLite format for transferring data isn't going to be a good arrangement. It's going to be wasteful and slow.


The main disadvantage of sqlite compared to these other formats is that sqlite is designed for single machine processing. This can be problematic for huge datasets or complex queries.

These other formats easily support using clusters to process your data.


A sqlite file is no more single machine processing than a parquet file.


If there are no writers it doesn't matter how many readers there are.


I agree that sqlite has a number of similar benefits - openness, table abstractions and concurrent transactions. It’s also a library so close to how delta, iceberg and hudi are implemented.

I’m glad it’s had an uptick in interest recently but I haven’t yet seen it mentioned for analytics yet.

I assume it’s row rather than column oriented?


Yes, SQLite is row oriented. It's not a very space efficient format because it also doesn't support compression or compact representations of numbers in binary.

But it doesn't rely on the JVM and a typical JVM ecosystem. It is a big benefit for some use-cases, like dealing with numerical data on the edge.


Delta Lake also isn't JVM based. The main implementation is a Rust library with first-party Python bindings.


If you are searching for the analytics equivalent you should check out duckdb.


DuckDB is about querying existing files or am I missing something?


The documentation puts a lot of emphasis on loading/querying existing files, as that's the first thing you will want to do for a OLAP use-case, but in general it's the same as sqlite where you maintain a database file.


Thanks. That's good to know. The file format is not highlighted in the documentation but I found some information about it: https://duckdb.org/internals/storage.html

It seems to be a bit early to rely on it to store data in an object store, but I will do some tests to compare with SQLite:

> The DuckDB internal storage format is currently in flux, and is expected to change with each release until we reach v1.0.0.


DuckDB file format is the compressed column oriented equivalent of SQLite files.


OLAP or column oriented SQLite is DuckDB.


These are metadata layers on top of file formats, expecting to process many files (where schema may change) in typically object storage, with volumes in excess of a petabyte. There is nothing preventing you from using sqlite with iceberg, for example - except implementing it (may involve implementation in your execution engine too). It already supports Parquet, Orc, and Avro (which is more row oriented than column oriented)

Iceberg and Sqlite might be interesting if you wanted to colocate two tables in the same file, for example. A smart enough data access layer with an appropriate execution engine could possibly see the data for both.


Column orientation is extremely important for query + storage efficiency. Furthermore, you need a distributed query engine (Athena, Bigquery etc.) and they all support parquet.


DuckDB is SQLite but column oriented!


Yes we are using DuckDB right now. Love it. Works extremely well if you can partition down your data to manageable (few gbs) size.


I'm a fan of datasette.io, which is built around sqlite. Essentially it's a set of tools that you point at a sqlite db to explore datasettes.

- https://datasette.io/examples


Thanks for the upvotes.

Here is a more technical deep dive comparing the three main open formats - https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-ap...


I talk to customers basically all day about table formats. Only one customer has really brought up Hudi in a meaningful way. IMO, Hudi is basically out of contention for 95%+ of people looking at table formats.


For a Spark shop Delta is the default choice. If you deploy to AWS then Glue encourages you to go with Iceberg. What makes people use Hudi?


That comparison blog seems biased toward Hudi.


Biased in what way? The authors provide solid arguments for why they think Hudi is a good tool.


I think it is also wrong in the capabilities, example: Redshift should be able to read Iceberg via Redshift Spectrum.

https://docs.aws.amazon.com/redshift/latest/dg/querying-iceb...


Biased in that the authors seem to favor Hudi and the arguments for seem based on that favor rather than an objective presentation of all relevant factors.


Is there any such open source table format for row based data?

I've been dabbling with implementing a database, and adopting a format would be much easier.


Apache AVRO [1] is one but it has been largely replaced by Apache Parquet [2] which is a hybrid row/columnar format

[1] https://avro.apache.org/

[2] https://parquet.apache.org/


As a note, Iceberg also supports AVRO in addition to Parquet (and ORC).


SQLite is something like that.


Can we fix the title of this to the actual title of the blog per HN guidelines? The article isn't just about or even focused on Iceberg.


This is the best reference for comparing the 3 formats:

https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-ap...


It’s written by a group -really- trying to make one of them a thing, even though it’s in decline, so just have that lens for anyone reading it.


"Open table formats" here seem to be the same as "data lakehouse" I've read other places.

It'd be great to agree on some nomenclature. Neither are very descriptive (or simple) in my view.


They are closely related.

A lakehouse would be a collection of tables.

There would be some SQL engine such as Trino, Databricks, ClickHouse brokering access to these tables.

The lakehouse might be organised into layers of tables where we have raw, intermediate, processed tables.

The concept of a lakehouse doesn’t really work if you don’t have transactions and updates, which these table formats enable.

I always say, Lakehouse is a stupid name but an amazing concept and I’m sure the data industry will go this way.


Does lakehouse have the same meaning as datalake?

I ask because, if I didn't know either word, the one would mean, to me, "tiny storage next to a big body of data" and the other would mean "a big body of data".


Date lake can be thought of a file system. Imagine 100 CSVs in folders, usually stored on S3.

Data Lakehouse involves adding things like the ability to query via SQL, the ability to update/insert/delete, transactions.

Where before people needed warehouses for BI and lakes for data science, they can now have only one approach.

It’s likely to be a big trend as data moves to this format and arrangement and the DBMS vendors like Snowflake have to play nicely with it.


Data warehouse, data lake, data lakehouse. The data world has some terrible terminology.


You can't resell someone a file system if you don't rename it to a "data lake". You can't resell someone an indexed file system from the 80's that can be queried with SQL unless you rename it a "data lakehouse".


Wait til you dive into ETL vs ELT or even better, if you've been doing ETL since long before "ELT" was "a thing", but everyone did ETL actually in an ELT fashion...

It's not only the names, but something like 98% of the tools too, that suck.


Disclaimer - work at Snowflake. Two quick points to mention.

1. Snowflake has always used blob stores + file data + metadata. Architecturally it’s actually always been very Lakehouse-y

2. Parquet and Iceberg should be equivalent in performance and features. It’s more than playing nicely - it’s more choose your own adventure where all things are equal.


> Where before people needed warehouses for BI and lakes for data science, they can now have only one approach.

This is all very interesting, and thank you for taking the time to explain. Any good starting points for someone who would like to know more?


Databricks popularised the concept and explain it very well - https://youtu.be/g11y-kJHr3I?si=j8FAkFsIjScHv24f

It’s a technology independent pattern though.



"(Data) lakehouse" is an amalgamation of data warehouses and data lakes. It's meant to enable querying and all the support (transaction, etc) of traditional data warehouses on a data lake (unstructured data lying on cheap storage).


> "(Data) lakehouse" is an amalgamation of data warehouses and data lakes. It's meant to enable querying and all the support (transaction, etc) of traditional data warehouses on a data lake (unstructured data lying on cheap storage).

Thank you for that. Do you have any suggestions on where one would start if they wanted to get a better idea and/or some experience using lakehouses?


That's what it originally meant, at least in my experience. It was when warehouses got access to commodity storage through virtualization options (Hey! I can read S3 from Redshift and it looks like a Redshift table). Similar to Postgres foreign data wrappers or polybase in sql server.

Databricks (with Delta as the underpinning) seems to have lead the charge of lakehouse meaning, your data lake+file formats/helpers+compute==data lake+datawarehouse==lakehouse.

The latter seems to be the prevailing definition today with the former aging in place.


Is this the new SQL "killer" ?


No! Haven't you heard? The new SQL "killer" is simply tying LLMs into your data and giving it commands in the common business parlance of your choice! kinda. sorta.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: