Good point. Though, if we are talking about something coming down a network pipe...

hermitcrab · on March 7, 2023

If you start threads at positions 0/25/50/75 inside a CSV, how do you know if the characters at 25, 50 & 75 are inside or outside quoted data values? You could start at a carriage return, but that could also be inside quoting.

mattewong · on March 9, 2023

Yes, that is exactly my point. You cannot start threads at 0/25/50/75 if your data is in CSV format. But what I am saying is that, if you could do that, then your performance difference will be negligible, compared to using a single thread that parses the CSV into rows and passes chunks of rows to 4 separate threads.

In fact, the single-thread parser approach (with multi-thread processing) might even be better, because it is not trying to access your hard disk in 4 places at the same time. Then again, if your threads are doing some non-trivial task with each row, then IO will not be your bottleneck either way.

Obviously starts to break down if you aren't reading the whole file and you wanted to start some meaningful portion of the way in and never process what comes before it. The point is, the benefit of being able to, effectively, implicitly shard a file without saving as separate files-- might not be as impactful in practice as in theory

hermitcrab · on March 9, 2023

>Yes, that is exactly my point. You cannot start threads at 0/25/50/75 if your data is in CSV format.

My mistake, I misread your answer!