I’m working on a SIMD parser for a YAML-like language and ran into what feels like a good SIMD coding challenge.
The task is intentionally minimal:
detect newlines (\n)
for each newline, identify the first non-space character that follows
Scanning for newlines alone is trivial and runs at memory bandwidth.
As soon as I add “find the first non-space after each newline,” throughput drops sharply.
There’s no branching, no backtracking, no variable-length tokens. In theory this should still be a linear, bandwidth-bound pass, but adding this second condition introduces a dependency I don’t know how to express efficiently in SIMD.
I’m interested in algorithmic / data-parallel approaches to this problem — not micro-optimizations.
If you treat this as a SIMD coding challenge, what approach would you try?
Another formulation:
# Bit-Parallel Challenge: O(1) "First Set Bit After Each Set Bit"
Given two 64-bit masks `A` and `B`, count positions where `B[i]=1` and the nearest set bit in `A|B` before position `i` is in `A`.
Equivalently: for each segment between consecutive bits in `A`, does `B` have any bit set?
*Example:* `A=0b10010000`, `B=0b01100110` → answer is 2 (positions 1 and 5)
Newline scan alone: 90% memory bandwidth. Adding this drops to 50%.
Is there an O(1) bit-parallel solution using x86 BMI/AVX2, or is O(popcount(A)) the lower bound?
1. Compute M1 = ~A & ~B, which is the mask of all spaces that are not newlines 2. Compute M2 = M1 + (A << 1) + 1, which is the first non-space or newline after each newline and then additional bits behind each such newline. 3. Compute M3 = M2 & ~M1, which removes the junk bits, leaving only the first match in each section
Here is what it looks like:
Note that this code treats newlines as non-spaces, meaning if a line comprises only spaces, the terminating NL character is returned. You can have it treat newlines as spaces (meaning a line of all spaces is not a match) by computing M4 = M3 & ~A.reply