I work in a ML ecosystem ATM and concurrency is a major problem in python:
- Threads can't be used efficiently because of the GIL
- multiprocesses has to serialize everything in a single thread often killing performance. (Unless you use shared memory space techniques, but that's less than ideal compared to threads)
- You can't use multiprocess while inside a multiprocess executor. This makes building things on top of frameworks/libs that use multiprocess a nightmare... e.g try to use a web server like over something like Keras...
Those are the top reasons I don't like python but if you got appetite for more:
- The dependency ecosystem is a pita, between python versions, package versions pinned or unpinned, requirements.txt, pipenv, poetry, conda... pick one and you're still sure to get into issues with other tools needing one system of another, or packages working a bit differently in conda etc... (I use poetry, with conda or pyenv)
- The culture of let's write code easily is good to start with but it becomes a problem as people especially maybe in DS don't go further then that... and you end up with bad practices all over the place, un-testable code (the tests systems are also a pain to navigate), copy & pasted blobs etc... Reading the code of some major libraries doesn't inspire confidence, especially compared to like Java, C++, go...
And last note I've seen way better emacs setup for python and presentations, it's ok as it is but I would not call it a Jimi Hendrix of python like a comment said...
Could you give examples of where exactly in the ML process/lifecycle you're hitting these issues?
For example: "When training a [type] model with X characteristics, the GIL causes Y, which makes it impossible to do Z".
We're building our machine learning platform[0] to solve problems we have faced shipping ML products to enterprise, and are interested in your problems as well.
For example, we've faced the environment/dependencies/"runs on my machine" problems and have addressed these with Docker images. Our users can spin up a notebook server with near real-time collaboration to work with others, and no setup because the environment is there.
The same with training jobs: they can click on a button and schedule a long-running notebook that runs against a specific environment to avoid "just yesterday I had X accuracy on my machine". The runs are tracked, the models, parameters, and metrics are automatically tracked because if we rely on a notebook author to do it, they might forget or have to context switch and it's an added cognitive load.
Some problems we faced were during deployment, too, where a "data scientist" writes a notebook to train a model and then we had to deploy that model reading their notebook or looking into dependencies. Now they can click on a button and deploy whichever model they want. It really was hindering us because they were asking someone else's help, who may have been working on something else.
I've been building a program that heavily uses multiprocessing for the past few months. It works quite well, but it did take me a little bit to figure out the best way to work with it.
> - Threads can't be used efficiently because of the GIL
Python's "threads" are actually fibers. Once you shift your thought process toward that then its easy enough to work with them. Async is a better solution, though, because "threads" aren't smart when switching between themselves. Async makes concurrency smart.
But if you want to use real threads, multiprocessing's "processes" are actually system threads.
> - multiprocesses has to serialize everything in a single thread often killing performance. (Unless you use shared memory space techniques, but that's less than ideal compared to threads)
I'm not quite sure what you mean. Multiprocessing's processes have their own GIL and are "single-threaded", but you can still spawn fibers and more processes from them, as well as use async.
Or are you talking about using the Manager and namespaces to communicate between processes? That is a little slow, yes. High speed code should probably use something else. Most programs will be fine with it, but it is way slower than rolling your own solution. However, it does work easily, so that's something to be said about it. Shared memory space techniques do work, too, but they are a little obtuse. Personally, I rolled my own data structures using the multiprocessing primitives. You have to set them up ahead of time, but they're insanely fast. Or you can use redis pubsub for IPC. Or write to a memory-mapped file.
- You can't use multiprocess while inside a multiprocess executor. This makes building things on top of frameworks/libs that use multiprocess a nightmare... e.g try to use a web server like over something like Keras...
I'm not sure what you mean. Multiprocessing simply spawns other Python processes. You can spawn processes from processes, so I don't know why you would have issues. Perhaps communication is an issue?
If you use numba (or cython, c extensions, etc) you can make them run without requiring that they hold the GIL, and they can run in parallel. Here's an example that should keep a CPU pegged at 100% utilization for a while:
import numba as nb
from concurrent.futures import ThreadPoolExecutor
from multiprocessing import cpu_count
@nb.jit(nogil=True)
def slow_calculation(x):
out = 0
for i in range(x):
out += i**0.01
return out
ex = ThreadPoolExecutor(max_workers=cpu_count())
futures = [ex.submit(slow_calculation, 100_000_000_000+i) for i in range(cpu_count())]
Even without requiring the GIL, these are still child threads of the main process, correct? And because of that, wouldn't the OS keep them all on the same core? And if that's the case, would ProcessPoolExecutor solve that problem?
This is a presentation given by David Beazley (dabeaz), wherein he live codes, in emacs and from scratch, a concurrent socket server to illustrate concepts of IO vs CPU bound concurrency in Python. The presentation is not just illuminating on how socket programming works in Python, but it is also a fun and relatively unique example of live coding being an effective presentation tool. I was at the conference live for this session and someone next to me in the crowd, when it was over, said, "I think we just witnessed the Jimi Hendrix of Python."
> So, what happens when you lock a Python programmer in a secret vault containing 1.5 TBytes of C++ source code and no internet connection? Find out as I describe how I used Python as a secret weapon of "discovery" in an epic legal battle.
It's very disappointing to see someone who seems authoritative spreading the myth that the GIL prevents you from using threads to achieve concurrency.
It didn't work with his toy example, calling the Fibonacci function, because it's pure Python. Typically, if you have pure CPU-bound processing like this then you wouldn't want to use pure Python anyway as it would be too slow. You'd either using C extension library like numpy, scipy or pytorch, or (more rarely) write the code yourself in Cython. In either case the GIL is released whenever you make a call into them (you have to do it manually in Cython but it's straightforward). If his example had been multiplying matrices together with np.dot then threads wouldn't have been a problem.
The GIL is also released whenever you do I/O like reading data from a socket or reading from a file. This includes libraries that do this for you, such as those reading from a database (whether a remote DBMS like Postgres or a file-based one like SQLite).
Taken together, these cover 99% of cases where you want concurrency. In that 1% where the GIL is not released, fixing that pure Python computation code is often worth doing before adding concurrency anyway.
It's a python conference demonstrating achieving goals in python. Basically saying "eh use C for cpu bound work" isn't helpful.
I think you missed the point a bit, which was purely to show co-operative yielding. The fact that some libs use c and release the gil really doesn't matter. This is another way. The real world usecase, as with basically all python, is io.
I'm not saying write your whole program in C. I'm saying, at worst, write a tiny corner of it in C so that you don't have to restructure the rest of your application - overall it would be simpler, even for most mostly-Python devs. Or, much more likely, use a library like numpy which you were probably going to do anyway. I regularly use numpy and don't find myself thinking "I wish didn't have to write all this C code...". (I must admit, an unfortunate consequence of this is that I've interviewed a few junior candidates that were so used to writing vectorised code that they seemed to be terrified of using a for loop!)
I admit I didn't watch the whole of the talk. (Personally I much prefer learning from text than video, and I was put off by the GIL bit anyway.) From the parts I saw, it seemed like the GIL issue was critical for motivating everything he did afterwards, but perhaps there were other reasons that became clear later. In that case, he could have avoided mentioning the GIL altogether (given that he ended up being so misleading about it).
> The real world usecase, as with basically all python, is io.
That is just straight up not true. Yes, IO is a valid use case of Python but it definitely isn't "basically all Python". I'm sure the vast majority of data scientists use Python, and for most of them their workloads are almost entirely computational (using deep learning libraries with C/C++/CUDA backends).
I think we have to consider that this talk is from 2015, when datascience was a much smaller part of the python ecosystem. I have used python since 2003 for various things and never once I’ve touched numpy, but I bet today 1 in 3 python devs use it probably every day. I understand your point of view and I understand the speaker’s own too - his is just more old-school.
We’ve been able to sidestep the GIL in many ways since forever (for example with alternative implementations like Jython, IronPython, PyPy etc), and I think everyone has known that for a long time, but here Dave just wanted to show what the actual problem is, beyond the buzzwords (it had become a bit of a myth in the early ‘10s, when Go suddenly exploded in the sysadmin/backend niche partially because “it can do parallelism better”).
About the fact this talk is from 2015: The GIL is released by many libraries that aren't about data science, so was and is relevant to many other types of application. In the Python standard library, for example, zipfile (which can operate on files or in-memory buffers) and hashlib come to mind. As I mentioned in another comment, hashlib calls have released the GIL at least as far back as 2009, well before this talk, and releasing the GIL was possible all the way back in 1998. Many (most?) Python libraries are implemented in C, and almost all of those that are will release the GIL.
About appreciating both points of view: I'm sorry if I'm making it sound like I think Python developers should always use threads and never multiprocessing. I don't believe that. And I don't believe the speaker should have reformulated their whole talk to use threads instead of processes. Just that it's such a pervasive myth that the GIL prevents all thread-based parallelism that they should have been careful to avoid reinforcing that. The speaker showed an example where the GIL actually did prevent parallelism; all I wanted was for them to add verbally that, by the way, if you were calling almost any CPU-bound library function rather than one you wrote yourself then this wouldn't be a problem.
> GIL prevents you from using threads to achieve concurrency
I didn't watch, but nothing in python prevents concurrency. In Python, a lot of things prevent parallelism. Threads/async are ways of achieving concurrency in python. Parallelism is achieved using multiprocessing. It is inefficient, but it works.
Also from certain perspectives parallelism is achieved with threads during IO.
So if 20 threads are making blocking IO calls, all 20 threads can make progress on those IO calls in parallel while having an additional parallel thread doing compute operations yielding a total of 21 threads executing in parallel.
Quote: "In data transmission, parallel communication is a method of conveying multiple binary digits (bits) simultaneously. It contrasts with serial communication, which conveys only a single bit at a time; this distinction is one way of characterizing a communications link."
Oops, you're right, I was talking about parallelism, not just concurrency. Thanks for the correction.
> Parallelism is achieved using multiprocessing.
As I said in my previous comment (but used the wrong term), threads are a perfectly reasonable way of getting parallelism in Python, so long so long your application uses C-based libraries like numpy (or even some built-in modules like zipfile). In my experience, the vast majority of Python programs that might need parallelism fall into that category (or are IO bound) anyway.
One use case is IO bound applications. Similar to NodeJS which can handle 10K concurrent IO calls on a single thread, python releases the GIL on IO. So you can write pure python and have multi threading and have all the benefits of "parallelism" depending on whether or not python releases the GIL which it does with certain libraries and on IO.
I remember using the multiprocessing library to speed up some file I/O. Basically I found out that I could spawn a separate process for each physical storage interface, and I was able to hash like 7 TiB of data in a little over 7 hours. My understanding was that threads are subject to GIL and thus cannot run on multiple cores, whereas processes don't have that same restriction, so I needed to use the multiprocessing library, NOT the multithreading library to parallelize my I/O.
Also, the featured video is from 2015. Maybe it's just outdated information?
No, it isn't outdated. The claims are still issues. In the scenario you describe, multiprocessing should work pretty well because each processes can run independently of the others and report one big result at the end (all the hashes that process computed). There are plenty of scenarios, however, where the different processes do have to talk to each other frequently which, in Python, means you introduce a ton of overhead in serializing and deserializing these data (Python transmits pickled data between processes). From personal experience I can tell you that this can be a major problem in terms of performance.
Other languages give you more flexibility in how to share data between processes.
The fundamental claim that threads running pure Python are limited by the GIL still stands. Others do point out that you can get around this in C and some of the standard libraries in Python (like hashlib) do this for you. That goes a long way to helping the issue, of course, but as yet others point out, it is weird to support Python in this context by saying "Yes Python can use threads effectively; all you have to do is use C."
> My understanding was that threads are subject to GIL and thus cannot run on multiple cores
That's exactly the myth I'm trying to address. It's true that in some circumstances the GIL prevents threads from running in multiple cores, but not all. The GIL would definitely have been released during the file I/O calls in your program. The GIL could well have been released during hashing too, depending on what hashing library you were using, which would have enabled true thread-based parallelism. For example, the built in Python hashlib says [1]:
> Note: For better multithreading performance, the Python GIL is released for data larger than 2047 bytes at object creation or on update.
On the other hand, I'm absolutely not saying that nobody should use multiple processes. For some applications, it's no more complex to use the multiprocessing module than to use threads and not much overhead to pass the data between processes. In that case, it's nice because you just don't even need to worry about whether the GIL is going to play a role, which can be a pain since sometimes there's no documentation saying whether specific functions release the GIL or not. All I'm saying is we should always make clear that to clear to everyone that threads are an option that the GIL doesn't (always) prevent.
> Also, the featured video is from 2015. Maybe it's just outdated information?
Another note on the hashlib page said that the above feature was added in Python 3.1 (released in 2009). That's just about that particular module. I found documentation from Python 1.5 (released in 1998) that describes how C extensions can release the GIL [2]
You are completely wrong. GIL is released on I/O. So threads can parallelize during I/O calls.
This means if your application is bottle necked by IO threading is better. If it's bottle necked by both IO and compute or only compute then multiple processes are better.
For your specific case it sounds like the latter, your application is bound by both, therefore multiple processes are better, but what you are implying with your post of using multiprocessing over multi threading as the more "correct" way to "speed up" IO is categorically wrong.
Dealing with the IO bottleneck by allowing IO calls to make progress while switching to other threads is a very common pattern in web development and the pattern has literally been baked into several frameworks and languages. NodeJS, GoLang, Erlang and Elixir can handle around 10k concurrent IO calls on a few threads without skipping a beat.
I think you're confusing async and parallelism. In multithreading, the computer will switch between multiple tasks whenever there is nothing to do, such as waiting for I/O read or waiting for a network packet to arrive. Things don't happen at the same time, it's just a more efficient, dynamic scheduling of a sequence of tasks happening one after the other. It's not physically possible for a multiple threads to happen at the same time unless they are on separate cores. In my use-case, hashing 200000 files on 7 separate hard drives, the only way to speed up the computation was to read from all 7 drives AT THE SAME TIME. There is no way to schedule 10^6 disk reads and hash operations into a single thread of execution that will reduce the runtime. Those async tricks can work in a more complicated program but you have to understand, I was ONLY reading from disk. The hashing step was faster than disk read (thanks xxhash) and the data was written to a sqlite file in a RAMdisk instantly so pretty much the entire runtime was spent reading from disk. multiprocessing was able to parallelize that unambiguously. Personally I don't trust python's multithreading because I have no control over whether it executes on one core or multiple cores, and in the world of parallel (not async) it is a very common convention that processes are parallel, while threads are async.
However, that being said, I'm open-minded. I will try to benchmark reading from multiple hdds using multithreading and compare bandwidth to the multiprocessing approach. How's that?
No you're confused. I'm perfectly aware of the differences between parallelism and concurrency. You are not.
Have you ever heard of a popular web server framework called NodeJS? You can create servers with NodeJS that handle 10k concurrent IO requests.
Let's say each request takes about 1s to finish because of the internet. If you fire 10k of these requests at a NodeJS server, the server can echo the ALL requests back in probably 2s.
And get this. NodeJS is single threaded.
If you fire 10k requests and NONE of the requests are parallel, you would typically expect all requests to finish at 10k seconds. This is not what happens with a single thread of NodeJS. People write entire chat servers in NodeJS that handle IO requests that hit around 10k concurrent messages in flight. NodeJS handles ALL of this on a SINGLE thread and from the users perspective all these requests finish in seconds.
If you don't understand why the above happens it means you don't understand concurrency vs. parallelism in the context of IO.
>that processes are parallel, while threads are async.
No. Both processes and threads are async. Only processes are parallel on compute (for python) and threads become parallel on only IO (for python). And technically certain python libraries can parallelize threads.
That is the key. A single thread can handle parallel IO. Why? Because the time spent on an IO operation is usually waiting for an inflight message to travel across the wire. So in a sense you can have 50 messages in flight across a wire handled by a single thread while the CPU just spends all this time waiting for the message to arrive. It's a form of "Parallelism" if you will but you won't find any literature using the word "parallelism" in conjunction with single threaded IO even though this is technically what is going on.
>However, that being said, I'm open-minded. I will try to benchmark reading from multiple hdds using multithreading and compare bandwidth to the multiprocessing approach. How's that?
This is a useless test. Your code involves both compute and IO. So processes will parallelize BOTH IO and compute. Threads will only parallelize IO, so if you have say like 10 fixed threads and 10 fixed processes OF COURSE processes will beat threads for your test case.
To make it a fair test and to see the benefits of python threads over processes you need to get rid of the hash operation. Maybe make your python program write a copy the file to another file. Then try to make everything concurrent. By everything I mean for every single file, fire off a new thread and do the same for processes.
You will find that the threaded approach will take up much less resources and go much further before crashing.
Even better than threading though... high IO operations can actually be handled by a single thread. You can parallelize 200000 IO calls on a single thread with python async await or nodejs (i recommend node for this type of stuff).
Though the other bottleneck will be your HD in this case. When your HDs see 200000 IO calls the HD's themselves will start serializing the requests. Also with such short message flight time in the wire is super short. Likely your IO is spending a good chunk of time writing messages to program memory as well.
The most fair test is 200000 IO calls to multiple external services that can handle such volume. Don't touch your HD. Just focus tests on services that will not block or serialize IO.
>multiprocessing was able to parallelize that unambiguously.
No dude. Wrong again. Have you noticed that you're able to create more processes than you have cores? Have you ever thought about all the processes that are running on your OS? Likely more total processes then you have cores. How does this happen? Because not all processes are parallel.
For OS threads and for all processes your OS decides whether they will be parallel or whether they will not. I know of no API that gives you direct control over which Core to execute your thread/process on.
The only difference between threads and processes is that threads have shared memory with other threads and are less expensive to instantiate in terms of time and memory and processes have their own memory space and are therefore more "expensive". Python is unique in the sense that threads are not parallel on compute due to the GIL, but this is only true for python and python-like languages.
How's that for a wall of text? Hope you learned something junior.
Did you read my full post? I don't think you did. Let me quote it:
>To make it a fair test and to see the benefits of python threads over processes you need to get rid of the hash operation. Maybe make your python program write a copy the file to another file. Then try to make everything concurrent. By everything I mean for every single file, fire off a new thread and do the same for processes.
>Though the other bottleneck will be your HD in this case. When your HDs see 200000 IO calls the HD's themselves will start serializing the requests. Also with such short message flight time in the wire is super short. Likely your IO is spending a good chunk of time writing messages to program memory as well.
>The most fair test is 200000 IO calls to multiple external services that can handle such volume. Don't touch your HD. Just focus tests on services that will not block or serialize IO.
I'm very very knowledgeable about concurrency. The problem is you. You failed to read my post and the caveats of all the details I mentioned. You failed to adjust the test fully to make it fair. You just overall failed and you don't know much.
Let me explain why you failed. Like I quoted above. HD IO has very very short time of flight meaning that the actual operation of compute (writing the data to program memory) takes much more time. Your threads aren't parallelizing that.
Like I said before THE ONLY DIFFERENCE in processes and threads is that python threads DO NOT parallelize compute and processes are MORE EXPENSIVE to spawn in terms of memory space. SO if you logically parallelize off of a fix number of threads/processes what do you think will occur? Of course processes will be faster.
I told you to adjust the test to spawn a new thread/process for every IO call to see the benefits. Because threads are cheaper in memory to spawn it will be able to spawn MANY more threads than it can processes. So if your operation is IO bound what you typically did with like 10 processes can likely be done with 1000 threads. <--- That is the difference.
You said you have 2000000 files? That means 2000000 threads/processes must be spawned for a fair test. And if you can, try to hit some server on the internet that can handle that many requests to fully see parallelization of IO because time of flight is longer over the internet.
Additionally your HDD's themselves can't parallelize 2000000 requests so in the end all your threads/processes will be bottlenecked by the HD itself NOT by IO. That's why I told you to hit something over the internet that can handle 2000000 IO calls. I mentioned All of this in my original response and you failed to understand.
Now do you Get it? I actually am trying to teach you something and you have the balls to tell me that I don't know about concurrency. It's quite obvious you have no idea what you're talking about.
If I was doing network I/O then there's a chance that threads would be faster... but you do realize that would also be multiprocessing on separate physical CPUs right? A separate computer, running a separate process, is processing your request on the network at the same time. That's how async works. It's really just an illusion of single-thread paralellism.
Anyways, the original scenario didn't require any network services and hashing each data chunk and writing to ramdisk really was instant. My script originally was supposed to enable integrity checks and deduplication of files on separate hard drives. There's no point turning it into a webapp just to shoehorn async in there.
Also this is a dumb conversation. I didn't actually run the benchmark. I just lied to you because I wanted to see you waste hours of your life typing out a wall of text for someone who doesn't care what you think. Also I still think you have no idea what you're talking about.
>Also this is a dumb conversation. I didn't actually run the benchmark. I just lied to you because I wanted to see you waste hours of your life typing out a wall of text for someone who doesn't care what you think. Also I still think you have no idea what you're talking about.
I was trying out of my heart to help out someone who didn't know anything. Looks like you took my altruism and threw it on the ground and stepped on it.
>If I was doing network I/O then there's a chance that threads would be faster... but you do realize that would also be multiprocessing on separate physical CPUs right?
Your HDD has it's own processor inside of it that handles reads and writes. Your CPU only sends it commands. So it's the same thing either way. It's just servers are designed to be hit by 200000 simultaneous requests, HDDs are not.
>That's how async works. It's really just an illusion of single-thread paralellism.
Except IO messages travel down the wire in parallel. This is what you need to get through your head. For compute, time is divided among threads in a single core but not for IO.
>Anyways, the original scenario didn't require any network services and hashing each data chunk and writing to ramdisk really was instant. My script originally was supposed to enable integrity checks and deduplication of files on separate hard drives. There's no point turning it into a webapp just to shoehorn async in there.
I didn't tell you to turn it into a webapp. I told you that your test didn't make sense and I told you to run a completely different test. I didn't tell you to shoehorn your app. If you're using sockets the API for interfacing with IO is exactly the same whether it's web OR an HDD. No shoehorning.
> If it's bottle necked by both IO and compute or only compute then multiple processes are better.
This is why I started this whole thread. The myth that you can't multithread compute in Python is so pervasive that even when we're in the middle of a thread specifically about the topic, and right next to my comment where I show hashlib specifically does release the GIL, there's still a comment saying that multiple processes are needed for compute parallelism.
My comment is more addressing the guy trying to use processing OVER threading to improve overall speed of completing multiple IO tasks.
Yeah I'm aware of your comment. And I can see how you could be pissed off over someone not mentioning what you're saying but please be aware that forum threads can go off into slight tangents when someone states something that is factually wrong.
Yes. I know... Programmers can turn to C or they have to use specific libraries to achieve parallelism with threading. I get it. Though you're describing special cases. Similar to the grammatical special cases in English: https://www.e-education.psu.edu/styleforstudents/c1_p6.html
Special cases make certain grammatical rules technically incorrect but the general notion behind those rules and why those rules are still pervasive and stated by people who are aware of the special cases is still valid. It's the same story with the python GIL. There could be multitudes of special cases but this does not negate the existence of a general grammatical rule: Compute in python threads is not parallel.
Do I go into a long winded technical discussion of all the special cases or do I just use the generality to point out his categorical mistake? I guess I should have taken the long winded route because if I don't I'll piss you off. Thank you for voting me down btw, I will be sure to address all stakeholders in HN threads in my future replies.
What is mind boggling for me is how many of you out there can talk and code in parallel at the same time, while text you typing is completely different than the things you saying?
I wish ML/DS would switch to Julia