dsq references a benchmark done by q (https://github.com/harelba/q/blob/master/t...

eatonphil · on Jan 11, 2022

Yeah frankly the q benchmark isn't the best even though dsq compares not terribly in it. It isn't well documented and exercises a very limited amount of functionality and isn't very rigorous from what I can see. That said, the caching q does is likely very solid (and not something dsq does).

The biggest risk (both in terms of SQL compliance and performance) I think with octosql (and cube2222 is here somewhere to disagree with me if I'm wrong) is that they have their own entire SQL engine whereas textql, q and dsq use SQLite. But q is also in Python whereas textql, octosql, and dsq are in Go.

In the next few weeks I'll be posting some benchmarks that I hope are a little fairer (or at least well-documented and reproducible). Though of course it would be appropriate to have independent benchmarks too since I now have a dog in the fight.

On a tangent, once the go-duckdb binding [0] matures I'd love to offer duckdb as an alternative engine flag within dsq (and DataStation). Would be neat to see.

[0] https://github.com/marcboeker/go-duckdb

cube2222 · on Jan 11, 2022

Hey there! OctoSQL author here.

I've released OctoSQL v0.4.0 recently[0] which is a grand rewrite and is 2-3 orders of magnitude faster than OctoSQL was before. It's also much more robust overall with static typing and the plugin system for easy extension. The q benchmarks are over a year old and haven't been updated to reflect that yet.

Take a look at the README[1] for more details.

My benchmarks should be reproducible, you can find the script in the benchmarks/ repo directory.

Btw if we're already talking @eatonphil I'd appreciate you updating your benchmarks to reflect these changes.

As far as the custom query engine goes - yes, there are both pros and cons to that. In the case of OctoSQL I'm building something that's much more dynamic - a full-blown dataflow engine - and can subscribe to multiple datasources to dynamically update queries as source data changes. This also means it can support streaming datasources. That is not possible with the other systems. It also means I don't have to load everything into a SQLite database before querying - I can optimize which columns I need to even read.

OctoSQL also let's you work with actual databases in your queries - like PostgreSQL or MySQL - and pushes predicates down to them, so it doesn't have to dump your whole database tables. That's useful if you need to do cross-db joins, or JSON-file-with-database joins.

As far as SQL compliance goes it gets hairy in the details - as usual. The overall dialect is based on MySQL as I'm using a parser based on vitess's one, but I don't support some syntax, and add original syntax too (type assertions, temporal extensions, object and type notation).

Stay tuned for a lot more supported datasources, as the plugin system lets me work on that much quicker.

[0]:https://github.com/cube2222/octosql/releases/tag/v0.4.0

[1]:https://github.com/cube2222/octosql#readme

eatonphil · on Jan 11, 2022

Neat! I'll give your script a shot.

em500 · on Jan 11, 2022

Another interesting comparison would be clickhouse-local:

https://altinity.com/blog/2019/6/11/clickhouse-local-the-pow...

eatonphil · on Jan 11, 2022

Awesome post, thanks for the link. I had no clue Clickhouse could do this. Shows off some more CLI tools in there too.

The big issue with ClickHouse is the incredibly non-standard SQL dialect and the random bugs that remain. It's an amazing project for analytics but you definitely have to be willing to hack around its SQL language (I say this as a massive fan of ClickHouse).

I wonder: does this mean I can embed ClickHouse in arbitrary software as a library? I'd be curious to provide that as an option in dsq.

em500 · on Jan 11, 2022

> I wonder: does this mean I can embed ClickHouse in arbitrary software as a library? I'd be curious to provide that as an option in dsq.

Not sure, but it's Apache licensed, so you likely can make it work if you want to. But realize that clickhouse(-local) is much heavier than sqlite / duckdb based solutions: the compiled binary is around 200mb iirc.

eatonphil · on Jan 11, 2022

Ah yeah that's pretty large.