More

snyy · 2026-01-05T22:48:24 1767653304

You have the right understanding.

We've found that maximizing chunk size gives the best retrieval performance and is easier to maintain since you don't have to customize chunking strategy per document type.

The upper limit for chunk size is set by your embedding model. After a certain size, encoding becomes too lossy and performance degrades.

There is a downside: blindly splitting into large chunks may cut a sentence or word off mid-way. We handle this by splitting at delimiters and adding overlap to cover abbreviations and other edge cases.

snyy · 2026-01-05T20:55:15 1767646515

A big chunk size with overlap solves this. Chunks don't have to be be "perfectly" split in order to work well.

srcreigh · 2026-01-05T21:16:37 1767647797

True, but you don’t need 150GB/s delimiter scanning in that case either.

snyy · 2026-01-05T22:31:58 1767652318

As the other comment said, its a practice in good enough chunks quality. We focus on big chunks (largest we can make without hurting embedding quality) as fast as possible. In our experience, retrieval accuracy is mostly driven by embedding quality, so perfect splits don't move the needle much.

But as the number of files to ingest grows, chunking speed does become a bottleneck. We want faster everything (chunking, embedding, retrieval) but chunking was the first piece we tackled. Memchunk is the fastest we could build.

snyy · 2026-01-05T19:20:08 1767640808

Which language are you thinking of? Ideally, how would you identify split points in this language?

I suppose we've only tested this with languages that do have delimiters - Hindi, English, Spanish, and French

There are two ways to control the splitting point. First is through delimiters, and the second is by setting chunk size. If you're parsing a language where chunks can't be described by either of those params, then I suppose memchunk wouldn't work. I'd be curious to see what does work though!

smlacy · 2026-01-05T19:42:40 1767642160

There are certainly cases of Greek/Latin without any punctuation at all, typically in a historical context. Chinese & Japanese historically did not have any punctuation whatsoever.

ks2048 · 2026-01-05T20:35:25 1767645325

Do the delimiters have to be single bytes? e.g. Japanese full stop (IDEOGRAPHIC FULL STOP) is 3 bytes in UTF-8.

snyy · 2026-01-05T22:24:06 1767651846

No, delimiters can be multiple bytes. They have to be passed as a pattern.

// With multi-byte pattern

let metaspace = "<japanese_full_stop>".as_bytes();

let chunks: Vec<&[u8]> = chunk(text).pattern(metaspace).prefix().collect();

snyy · 2026-01-05T19:10:44 1767640244

> Chunking is generally a one-time process where users aren't latency sensitive.

This is not necessarily true. For example, in our use case we are constantly monitoring websites, blogs, and other sources for changes. When a new page is added, we need to chunk and embed it fast so it's searchable immediately. Chunking speed matters for us.

When you're processing changes constantly, chunking is in the hot path. I think as LLMs get used more in real time workflows, every part of the stack will start facing latency pressure.

rfw300 · 2026-01-06T01:02:00 1767661320

How much compute do your systems expend on chunking vs. the embedding itself?

snyy · 2026-01-05T17:56:31 1767635791

Memchunk is already in Chonkie as the `FastChunker`

To install: pip install chonkie[fast]

``` from chonkie import FastChunker

chunker = FastChunker(chunk_size=4096) chunks = chunker(huge_document) ```

snyy · 2026-01-05T17:19:33 1767633573

We're the maintainers of Chonkie, a chunking library for RAG pipelines.

Recently, we've been using Chonkie to build deep research agents that watch topics for new developments and automatically update their reports. This requires chunking a large amount of data constantly.

While building this, we noticed Chonkie felt slow. We started wondering: what's the theoretical limit here? How fast can text chunking actually get if we throw out all the abstractions and go straight to the metal?

This post is about that rabbit hole and how it led us to build memchunk - the fastest chunking library, capable of chunking text at 1TB/s.

Blog: https://minha.sh/posts/so,-you-want-to-chunk-really-fast

GitHub: https://github.com/chonkie-inc/memchunk

Happy to answer any questions!

djoldman · 2026-01-06T14:38:33 1767710313

English word, clause, sentence, and paragraph boundaries do not always match characters.

How does the software handle these:

Mrs. Blue went to the sea shore with Mr. Black.

"What's for dinner?" Mrs. Blue asked.

snyy · 2025-09-25T22:28:21 1758839301

How did the OTP case work? Where was the OTP received and how did the browser know?

Or did it bypass it entirely with corporation from the website?

rajit · 2025-09-25T22:45:46 1758840346

We setup an agent mailbox with Agentmail (https://agentmail.to/). Whoever owns the account (likely the developer) sets up a forwarding rule to this account.

When our agent signs in, we input the forwarded otp code to get access.

snyy · 2025-06-10T00:44:20 1749516260

We're working on Mongo integrations!

hweller · 2025-06-18T22:20:44 1750285244

awesome! let me know if I can be helpful in any way in connecting with Mongo resources

snyy · 2025-06-10T00:38:53 1749515933

We want to be the platform that connects documents to AI for all applications. Consequently, we want to cover all use cases, including the ones you mentioned :)

snyy · 2025-06-10T00:34:17 1749515657

Yes :) Code chunker is fantastic for SQL