Hacker Newsnew | past | comments | ask | show | jobs | submit | snyy's commentslogin

You have the right understanding.

We've found that maximizing chunk size gives the best retrieval performance and is easier to maintain since you don't have to customize chunking strategy per document type.

The upper limit for chunk size is set by your embedding model. After a certain size, encoding becomes too lossy and performance degrades.

There is a downside: blindly splitting into large chunks may cut a sentence or word off mid-way. We handle this by splitting at delimiters and adding overlap to cover abbreviations and other edge cases.


A big chunk size with overlap solves this. Chunks don't have to be be "perfectly" split in order to work well.


True, but you don’t need 150GB/s delimiter scanning in that case either.


As the other comment said, its a practice in good enough chunks quality. We focus on big chunks (largest we can make without hurting embedding quality) as fast as possible. In our experience, retrieval accuracy is mostly driven by embedding quality, so perfect splits don't move the needle much.

But as the number of files to ingest grows, chunking speed does become a bottleneck. We want faster everything (chunking, embedding, retrieval) but chunking was the first piece we tackled. Memchunk is the fastest we could build.


Which language are you thinking of? Ideally, how would you identify split points in this language?

I suppose we've only tested this with languages that do have delimiters - Hindi, English, Spanish, and French

There are two ways to control the splitting point. First is through delimiters, and the second is by setting chunk size. If you're parsing a language where chunks can't be described by either of those params, then I suppose memchunk wouldn't work. I'd be curious to see what does work though!


There are certainly cases of Greek/Latin without any punctuation at all, typically in a historical context. Chinese & Japanese historically did not have any punctuation whatsoever.


Do the delimiters have to be single bytes? e.g. Japanese full stop (IDEOGRAPHIC FULL STOP) is 3 bytes in UTF-8.


No, delimiters can be multiple bytes. They have to be passed as a pattern.

// With multi-byte pattern

let metaspace = "<japanese_full_stop>".as_bytes();

let chunks: Vec<&[u8]> = chunk(text).pattern(metaspace).prefix().collect();


> Chunking is generally a one-time process where users aren't latency sensitive.

This is not necessarily true. For example, in our use case we are constantly monitoring websites, blogs, and other sources for changes. When a new page is added, we need to chunk and embed it fast so it's searchable immediately. Chunking speed matters for us.

When you're processing changes constantly, chunking is in the hot path. I think as LLMs get used more in real time workflows, every part of the stack will start facing latency pressure.


How much compute do your systems expend on chunking vs. the embedding itself?


Memchunk is already in Chonkie as the `FastChunker`

To install: pip install chonkie[fast]

``` from chonkie import FastChunker

chunker = FastChunker(chunk_size=4096) chunks = chunker(huge_document) ```


We're the maintainers of Chonkie, a chunking library for RAG pipelines.

Recently, we've been using Chonkie to build deep research agents that watch topics for new developments and automatically update their reports. This requires chunking a large amount of data constantly.

While building this, we noticed Chonkie felt slow. We started wondering: what's the theoretical limit here? How fast can text chunking actually get if we throw out all the abstractions and go straight to the metal?

This post is about that rabbit hole and how it led us to build memchunk - the fastest chunking library, capable of chunking text at 1TB/s.

Blog: https://minha.sh/posts/so,-you-want-to-chunk-really-fast

GitHub: https://github.com/chonkie-inc/memchunk

Happy to answer any questions!


English word, clause, sentence, and paragraph boundaries do not always match characters.

How does the software handle these:

Mrs. Blue went to the sea shore with Mr. Black.

"What's for dinner?" Mrs. Blue asked.


How did the OTP case work? Where was the OTP received and how did the browser know?

Or did it bypass it entirely with corporation from the website?


We setup an agent mailbox with Agentmail (https://agentmail.to/). Whoever owns the account (likely the developer) sets up a forwarding rule to this account.

When our agent signs in, we input the forwarded otp code to get access.


We're working on Mongo integrations!


awesome! let me know if I can be helpful in any way in connecting with Mongo resources


We want to be the platform that connects documents to AI for all applications. Consequently, we want to cover all use cases, including the ones you mentioned :)


Yes :) Code chunker is fantastic for SQL


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: