Hacker Newsnew | past | comments | ask | show | jobs | submit | bede's commentslogin

For BAM this could be a good place to start: https://www.htslib.org/benchmarks/CRAM.html

Happy to discuss further


Amazing, thank you!

I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.


Another format that might be worth looking at in the bioinformatics world is hdf5. It's sort of a generic file format, often used for storing multiple related large tables. It has some built-in compression (gzip IIRC) but supports plugins. There may be an opportunity to integrate the self-describing nature of the hdf5 format with the self-describing decompression routines of openZL.



Author of [0] here. Congratulations and well done for resisting. Eager to try it!

Edit: Have you any specific advice for training a fasta compressor beyond that given in e.g. "Using OpenZL" (https://openzl.org/getting-started/using-openzl/)


A Zstd maintainer clarified this: https://news.ycombinator.com/item?id=45251544

> Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses


Fascinating, thank you.


Thanks for reminding me to benchmark this!


I've only tested this when writing my own parser where I could skip the record end checks, so idk if this improves perf on a existing parser. Excited to see what you find!


Yes, when doing anything intensive with lots of sequences it generally makes sense to liberate them from FASTA as early as possible and index them somehow. But as an interchange format FASTA seems quite sticky. I find the pervasiveness of fastq.gz particularly unfortunate with Gzip being as slow as it is.

> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)

I even confused myself about this while writing :-)


Note that BGZF solves gzip’s speed problem (libdeflate + parallel compression/decompression) without breaking compatibility, and usually the hit to compression ratio is tolerable.


BAM format is widely used but assemblies still tend to be generated and exchanged in FASTA text. BAM is quite a big spec and I think it's fair to say that none of the simpler binary equivalents to FASTA and FASTQ have caught on yet (XKCD competing standards etc.)

e.g. https://github.com/ArcInstitute/binseq


Thank you for clarifying this – yes the non-semantic nature of these particular line breaks is a key detail I omitted.


It might be worth (in some other context) introducing a pre-processing step which handles this at both ends. I'm thinking like PNG - the PNG compression is "just" zlib but for RGBA that wouldn't do a great job, however there's a (per row) filter step first, so e.g. we can store just the difference from the row above, now big areas of block colour or vertical stripes are mostly zeros and those compress well.

Guessing which PNG filters to use can make a huge difference to compression with only a tiny change to write speed. Or (like Adobe 20+ years ago) you can screw it up and get worse compression and slower speeds. These days brutal "try everything" modes exist which can squeeze out those last few bytes by trying even the unlikeliest combinations.

I can imagine a filter layer which says this textual data comes in 78 character blocks punctuated with \n so we're going to strip those out, then compress and in the opposite direction we decompress then put back the newlines.

For FASTA we can just unconditionally choose to remove the extra newlines but that may not be true for most inputs, so the filters would help there.


For one approach to compressing FASTQ (which is a series 3 distinct lines), we broke the distinct lines each into their own streams and then compressed each stream independently. That way the coder for the header line, sequence line, and error line could learn the model of that specific line type and get slightly better compression (not unlike a columnar format, although in this case, we simply combined "blocks" of streams together with a record separator.

I'm still pretty amazed that periodic newlines hurt compression ratios so much, given the compressor can use both a huffman coding and a lookback dictionary.

The best rule in sequence data storage is to store as little of it as possible.


Exactly. The line breaks break the runs of otherwise identical bits in identical sequences. Unless two identical subsequences are exactly in phase with respect to their line breaks, the hashes used for long range matching are different for otherwise identical subsequences.


Yes I'd expect a dict-based approach to do better here. That's probably how it should be done. But --long is compelling for me because using it requires almost no effort, it's still very fast, and yet it can dramatically improve compression ratio.


From what I've read (although I haven't tested and I can't find my source from when I read it), dictionaries aren't very useful when dataset is big, and just by using '--long' you can cover that improvement.

Have any of you tested it?


I don’t think the size of content matters, it’s all about patterns (and their repetitiveness) within, and FASTA is a great target, if I understand the format correctly


Size of content does matter. Because the start of your content effectively builds up a dictionary. Once you have built a custom dictionary from the content you are compressing the initial dictionary is no longer relevant. So a dictionary effectively only helps at the start of the content. Exactly what "start" means will depend on your data and the algorithm but as the size of the data you compress grows the relative benefit of the dictionary drops.

Or another way to look at this is that the total bytes saved of the dictionary will plateau. So your dictionary may save 50% of the first MB, 10% of the next MB and 5% of the rest of the first 10MB. It matters a lot if you are compressing 2MB of data (7% savings!) but not so much if you are compressing 1GB (<1%).


I tried building a zstd dictionary for something(compressing many instances of the same Mono(.net) binary serialized class, mostly identical), and in this case it provided no real advantage. Honestly, I didn't dig into it too much, but will give --long a try shortly.

PS: what the author implicitly suggests cannot be replaced with zstd tweaks. It'll be interesting to look at the file in imhex- especially if I can find an existing Pattern File.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: