One reason is clearly the fast past at which nvidia is evolving the hardware. I would consider cuda a very well documented platform in general. What they lack is low level tutorials, but this is where posts like this one can be a good resource
We saw different results of pipelining with the Attention kernel vs the MLP kernel (since MLP W1 has to project the attention results into a much higher dimension, the arithmetic intensity shifts towards compute bound characteristics)
Hi thanks for feedback! That’s a good point I did compare to torch but at a high enough sequence length (~1024) torch version starts OOM because it has to materialize the S^2 in global mem. On small sequence length, torch does win solely on optimised cublas matmuls
I’ve spent the last few weeks deconstructing FlashAttention. While the original paper is brilliant, I found that just reading it didn't give me a "gut feeling" for why certain engineering choices were made (the transition from v1 to v2).
I decided to rebuild it from scratch using Triton. This post is a chronicle of that journey—moving beyond the high-level algorithm and into the "performance archaeology" of the GPU:
- Profiling with Nsight Compute to find the real bottlenecks.
- Looking at the generated PTX and SASS code.
- Debugging shared memory bank conflicts and MIO bottlenecks.
- Iterating through the logic to see why tiling and online softmax are hardware-necessitated, not just mathematical tricks.
I’ve tried to keep it in the spirit of Simon Boehm’s matmul deep dive. Would love to hear from any GPU engineers on whether my interpretations of the SASS/bank conflict behavior match what you've seen in production.
I hope you finish this one though. It starts strong (I particularly liked how you looked into ncu and shows what each recommendation means, this is very helpful for beginners), but ends with something not satisfying. You didn't explore tensor core (particularly, fp16 / tf32 / bf16), and swizzling (which is the right way to solve the K transpose issue, especially giving Triton itself provides a few ways to do this), and / or async loading (pipelining).
Do you have problem to access H100 or similar chips? Wondering if there anything can help to finish this write-up.
Hi, thanks a lot for the feedback! I'm glad you enjoyed the profiling sections.
You've hit the nail on the head regarding the missing pieces. I actually hit a bit of a wall with my current hardware; using an RTX 2070 made it difficult to meaningfully explore the async loading (TMA) and pipelining optimizations that were used in FA3 and FA4. I also felt the write-up was already pushing the limits of a single post's length, so I decided to "ship it" as a first part.
I would love to dive into TMA for Part 2. If I can get my hands on an H100 (or even an A100), that's highly appreciatediated on my end! If you have any leads on hardware access, please let me know—I’d love to finish the story!
After battling Python dependencies, slow processing, and deployment headaches with tools like unstructured, I finally snapped—and built Ferrules, a blazing-fast document parser in Rust.
Why Ferrules?
- Speed – Native PDF parsing (pdfium), hardware-accelerated ML inference
- No Python – Single binary, zero dependency deployment, built-in tracing
- Smart Processing – Layout detection, OCR, intelligent element merging
- Flexible Outputs – JSON, HTML, Markdown (ideal for RAG pipelines)
Tech Highlights
- Runs layout detection on Apple Neural Engine/GPU
- Apple Vision API for high-quality OCR (macOS)
- Multithreaded, CLI + HTTP API server
- Debug mode with visual parsing output
If you're tired of Python-based parsers in production, check it out !
(P.S. Named after those metal rings on pencils—because it keeps your documents structured )
Just reimplemented the infamous video "Clean" Code, Horrible Performance" in Rust!
Casey Muratori showed how trying to be cute with your code and introducing unnecessary indirection can hurt performance. The enum version is 1.6x faster than using traits + dynamic dispatch. The data oriented version is 2.27x faster than dynamic one.
Hope you'll enjoy this short article and I'd be happy to get comments on the implementation and the subject in general!
I started working on a distributed task queue library a few months back. The library is available as a python package to install a start using : daskqueue - pypi package
For all its greatness, Dask implements a central scheduler (basically a simple tornado event loop) involved in every decision, which can sometimes create a central bottleneck. This is a pretty serious limitation when trying to use Dask in high-throughput situations.
Daskqueue is a small python library built on top of Dask and Dask Distributed that implements a very lightweight Distributed Task Queue. Daskqueue also implements persistent queues for holding tasks on disk and surviving Dask cluster restart.