Recently I decided to investigate the internals of our chosen application server, gunicorn, to better understand the async worker model. In particular, I wanted to understand how requests were concurrently processed, and any performance tradeoffs this entailed. These are some notes I made along the way.
First, let's cover some fundamentals of gunicorn's design. Gunicorn is a WSGI HTTP server. WSGI is a standard (PEP 333) which specifies how Python applications interface with a web server.
Gunicorn uses a pre-fork worker model, meaning that one master process forks to create child process(es) which handle the actual HTTP requests.
Where this gets more complex is that gunicorn can use different types of worker processes, depending on how it is configured. Two types of worker processes are sync and async (there is a third, tornado, which I won't cover here).
Sync workers handle one HTTP request at a time. These are suitable if, for instance, the work your application does is likely CPU-bound (apart from the actual socket I/O).
Async workers can handle multiple requests concurrently (but not in parallel). In effect, a worker can yield mid-request to handle another request, then resume handling the prior request at a later point. This is useful when, for instance, your application makes an external service call (i.e. is I/O bound) since it can handle another incoming request without waiting for the external service call to complete.
In our application, almost all requests make external service calls, so async workers are the most appropriate worker type.
We'll now look at how these async workers achieve non-blocking request processing. For simplicity, let's assume the only type of async worker available is gevent.
To understand these async workers, we must understand the libraries and programming model they are built upon.
Coroutines may switch their flow of control to another coroutine and resume execution at the same point later. The following pseudocode (adapted from the gevent tutorial) shows two coroutines, worker_a and worker_b yielding execution to each other.
There is an excellent description of coroutines here which would be a good thing to read at this point.
An important aspect here is that coroutines are cooperatively scheduled, that is, there is no scheduler deciding which coroutine should run. In terms of concurrent data, this makes local reasoning substantially easier.
Greenlet is a library implementing coroutines in Python.
As a side note, greenlets are compatible with Python threads. It's perfectly possible to have multiple threads, each with its own greenlets, but these are isolated. Switching will not work across threads.
Now that we understand coroutines, let's look at how greenlet provides the foundation for async workers.
Greenlet provides a method of switching context between cooperating functions. This is very useful in an I/O bound context. This brings us to gevent, a coroutine-based networking library built on greenlet.
Here's an example:
This is from gevent For the Working Python Developer which I recommend (MIT License).
So in essence, when something is waiting for I/O, gevent makes it easy to be doing other work. This obviously has big advantages for concurrent networking code, as well as anything else that is I/O bound where the caller can be safely suspended and resumed later.
Gevent will patch standard library functions so that they can be used asynchronously. It's worth noting that, while we haven't experienced any problems with our actual application code, monkey patching is not always problem free.
I noted earlier that requests are handled concurrently by async workers. However, the unit of parallel work is still the OS process, so to utilise multiple CPU cores, multiple worker processes are required. This isn't necessarily a problem, but it's worth being precise about the modes of work available because it impacts how they are scaled.
Finally, you might find this talk on coroutines from PyCon useful.