Skip to content

Commit 638ed5f

Browse files
authored
Merge pull request #504 from wsot/patch-1
Edit grammar in why.rst
2 parents dbf5047 + c162f8e commit 638ed5f

1 file changed

Lines changed: 116 additions & 113 deletions

File tree

docs/why.rst

Lines changed: 116 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -30,92 +30,92 @@ We have lots of web pages to index, so we simply handle them one by one:
3030

3131
.. image:: why_throughput.png
3232

33-
We assume the time of each task is constant. Within 1 second, 2 tasks are done.
34-
So we can say, the throughput of current system is 2 tasks/sec. How can we
35-
improve the throughput? An obvious answer is to add more CPU cores:
33+
Let's assume the time of each task is constant: each second, 2 tasks are done.
34+
Thus we can say what the throughput of the current system is 2 tasks/sec. How
35+
can we improve the throughput? An obvious answer is to add more CPU cores:
3636

3737
.. image:: why_multicore.png
3838

3939
This simply doubles our throughput to 4 tasks/sec, and linearly scales as we
40-
adding more CPU cores, if the network is not a bottleneck. But can we improve
40+
add more CPU cores, if the network is not a bottleneck. But can we improve
4141
the throughput for each CPU core? The answer is yes, we can use
4242
multi-threading:
4343

4444
.. image:: why_multithreading.png
4545

46-
Wait a second here, 2 threads barely finished 6 tasks in 2 seconds, the
47-
throughput is only 2.7 tasks/sec, much lower than 4 tasks/sec with 2 cores.
46+
Wait a second! The 2 threads barely finished 6 tasks in 2 seconds, a
47+
throughput of only 2.7 tasks/sec, much lower than 4 tasks/sec with 2 cores.
4848
What's wrong with multi-threading? From the diagram we can see:
4949

5050
* There are yellow bars taking up extra time.
5151
* The green bars can still overlap with any bar in the other thread, but
5252
* non-green bars cannot overlap with non-green bars in the other thread.
5353

5454
The yellow bars are time taken by `context switches
55-
<https://en.wikipedia.org/wiki/Context_switch>`_, a technique to allow multiple
56-
threads or processes to run on a single CPU core concurrently. Because one CPU
57-
core can do only one thing at a time (let's assume a world without
58-
`Hyper-threading <https://en.wikipedia.org/wiki/Hyper-threading>`_ or something
59-
like that), so in order to run several threads concurrently, the CPU must
60-
`split its time <https://en.wikipedia.org/wiki/Time-sharing>`_ into small
61-
slices, and run a little bit of each thread with these slices. The yellow bar
62-
is the very cost for CPU to switch its context to run a different thread. The
55+
<https://en.wikipedia.org/wiki/Context_switch>`_, a necessary part of allowing
56+
multiple threads or processes to run on a single CPU core concurrently.
57+
One CPU core can do only one thing at a time (let's assume a world without
58+
`Hyper-threading <https://en.wikipedia.org/wiki/Hyper-threading>`_ or similar),
59+
so in order to run several threads concurrently the CPU must `split its
60+
time <https://en.wikipedia.org/wiki/Time-sharing>`_ into small
61+
slices, and run a little bit of each thread within these slices. The yellow bar
62+
is the overhead for the CPU to switch context to run a different thread. The
6363
scale is a bit dramatic, but it helps with the point.
6464

65-
Wait again here, the green bars are overlapping between threads, the CPU is
65+
Wait again here, the green bars are overlapping between threads. Is the CPU
6666
doing two things at the same time? No, the CPU is doing nothing in the middle
67-
of the green bar, because it's waiting for the HTTP response (I/O). That's why
68-
multi-threading could improve the throughput to 2.7, instead of making it
69-
worse to 1.7 tasks/sec. You may try in real to run CPU-intensive tasks with
70-
multi-threading on single core, there won't be any improvement. Like the
71-
multiplexed red bars (in practice there might be more context switches
72-
depending on the task), they seems to be running at the same time, but the
73-
total time for all to finish is actually longer than running each of them one
67+
of the green bar, because it's waiting for the HTTP response (I/O). That's how
68+
multi-threading could improve the throughput to 2.7 tasks/sec, instead of
69+
decreasing it to 1.7 tasks/sec. You may try in real to run CPU-intensive
70+
tasks with multi-threading on single core, there won't be any improvement. Like
71+
the multiplexed red bars (in practice there might be more context switches
72+
depending on the task), they appear to be running at the same time, but the
73+
total time for all to finish is actually longer than running the tasks one
7474
by one. That's also why this is called concurrency instead of parallelism.
7575

76-
Foreseeably as adding more threads, the increase of throughput will slow down,
77-
or even get decreasing, because context switches are wasting too much time,
78-
not to mention the extra memory footprint taken by new threads. It is usually
79-
not quite practical to have tens of thousands of threads running on a single
80-
CPU core. But is it possible to have tens of thousands of I/O-bound tasks to
81-
run concurrently on a single CPU core somehow? This is the once-famous `C10k
76+
As you might imagine, throughput will improve less with each additional thread,
77+
until throughput begins to decrease because context switches are wasting too
78+
much time, not to mention the extra memory footprint taken by new threads. It
79+
is usually not practical to have tens of thousands of threads running on a single
80+
CPU core. How, then, is it possible to have tens of thousands of I/O-bound tasks
81+
running concurrently on a single CPU core? This is the once-famous `C10k
8282
problem <https://en.wikipedia.org/wiki/C10k_problem>`_, usually solved by
8383
asynchronous I/O:
8484

8585
.. image:: why_coroutine.png
8686

8787
.. note::
8888

89-
Asynchronous I/O and coroutine are two different things, but they usually
90-
work together. Here we shall not follow too deep into the rabbit hole of
91-
low-level asynchronous I/O, and stay with coroutines for its simplicity.
89+
Asynchronous I/O and coroutines are two different things, but they usually
90+
go together. Here we will stick with coroutines for simplicity.
9291

93-
Awesome! The throughput is 3.7 tasks/sec, almost as good as 4 tasks/sec of 2
94-
CPU cores. Though this is not real data, comparing to OS threads, coroutines
95-
do take much less context switch time and memory footprint, thus made it an
96-
ideal option for the C10k problem.
92+
Awesome! The throughput is 3.7 tasks/sec, nearly as good as 4 tasks/sec of 2
93+
CPU cores. Though this is not real data, compared to OS threads coroutines
94+
do take much less time to context switch and have a lower memory footprint,
95+
thus making them an ideal option for the C10k problem.
9796

9897

9998
Cooperative multitasking
10099
------------------------
101100

102-
So what is coroutine?
103-
104-
In the last diagram above, you may have noticed one difference comparing to all
105-
the other previous diagrams: the green bars are overlapping within the same
106-
thread. That is because the last diagram is using asynchronous I/O, while the
107-
rest are using blocking I/O. Like its naming, blocking I/O will block the
108-
thread until the I/O result is ready, thus there can be only one blocking I/O
109-
operation running in a thread. To achieve concurrency, blocking I/O has to go
110-
for multi-threading or multi-processing. Oppositely, asynchronous I/O allows
111-
thousands (or even more) of concurrent I/O reads and writes within the same
112-
thread, each I/O operation only blocks one coroutine instead of the whole
113-
thread. Like threads, coroutine here is a way to organize concurrency with
114-
asynchronous I/O.
115-
116-
Threads are scheduled by the operating system in an approach called `preemptive
101+
So what is a coroutine?
102+
103+
In the last diagram above, you may have noticed a difference compared to the
104+
previous diagrams: the green bars are overlapping within the same thread.
105+
That is because the in the last diagram, our code is using asynchronous I/O,
106+
whereas the previously we were using blocking I/O. As the name suggests, blocking
107+
I/O will block the thread until the I/O result is ready. Thus, there can be only
108+
one blocking I/O operation running in a thread at a time. To achieve concurrency
109+
with blocking I/O, either multi-threading or multi-processing must be used.
110+
In contrast, asynchronous I/O allows thousands (or even more) of concurrent
111+
I/O reads and writes within the same thread, with each I/O operation blocking
112+
only the coroutine performing the I/O rather than the whole thread. Like
113+
multi-threading, coroutines provide a means to have concurrency during I/O,
114+
but unlike multi-threading this concurrency occurs within a single thread.
115+
116+
Threads are scheduled by the operating system using an approach called `preemptive
117117
multitasking <https://en.wikipedia.org/wiki/Preemption_(computing)>`_. For
118-
example in previous multi-threading diagram, there was only one CPU core. When
118+
example, in previous multi-threading diagram there was only one CPU core. When
119119
Thread 2 tried to start processing the first web page content, Thread 1 hadn't
120120
finished processing its own. The OS brutally interrupted Thread 1 and shared
121121
some resource (time) for Thread 2. But Thread 1 also needed CPU time to finish
@@ -138,13 +138,13 @@ something like this:
138138
Thread 2: Alright ... but I really need the CPU.
139139
OS: You'll have it later. Thread 1, hurry up!
140140
141-
Differently, coroutines are scheduled by themselves cooperatively with the help
141+
In contrast, coroutines are scheduled by themselves cooperatively with the help
142142
of an event manager. The event manager lives in the same thread as the
143-
coroutines, it is interestingly the opposite to the OS scheduler for threads:
144-
while OS scheduler pauses threads, coroutines pauses themselves; thread knows
143+
coroutines and unlike the OS scheduler that forces context switches on threads,
144+
the event manager acts only when coroutines pause themselves. A thread knows
145145
when it wants to run, but coroutines don't - only the event manager knows which
146146
coroutine should run. The event manager may only trigger the next coroutine to
147-
run, after the previous coroutine yields control to wait for an event (e.g.
147+
run after the previous coroutine yields control to wait for an event (e.g.
148148
wait for an HTTP response). This approach to achieve concurrency is called
149149
`cooperative multitasking
150150
<https://en.wikipedia.org/wiki/Cooperative_multitasking>`_. It's like this:
@@ -165,13 +165,14 @@ wait for an HTTP response). This approach to achieve concurrency is called
165165
Coroutine 1: Arrrrh gotta kill myself with an exception :S
166166
Event manager: Up to you :/
167167
168-
For coroutines, a task cannot be paused externally, only the task itself could
169-
pause from within. When there are a lot of coroutines, concurrency depends on
170-
each of them shall pause from time to time to wait for events. If you wrote a
171-
coroutine that never pauses, it allows no concurrency at all when running. On
172-
the other hand, you should feel safe in the code between pauses, because no
173-
other coroutines could run at the same time to mess up shared states. That's
174-
why in previous last diagram, the red bars are not interlaced like threads.
168+
For coroutines, a task cannot be paused externally, the task can only pause
169+
itself from within. When there are a lot of coroutines, concurrency depends on
170+
each of them pausing from time to time to wait for events. If you wrote a
171+
coroutine that never paused, it would allow no concurrency at all when running
172+
because no other coroutine would have a chance to run. On the other hand, you
173+
can feel safe in the code between pauses, because no other coroutine can
174+
run at the same time to mess up shared states. That's why in the last diagram,
175+
the red bars are not interleaved like threads.
175176

176177
.. tip::
177178

@@ -183,71 +184,73 @@ Pros and cons
183184
-------------
184185

185186
Asynchronous I/O may handle tens of thousands of concurrent I/O operations in
186-
the same thread. This may save a lot of time from context switching, and memory
187-
from multi-threading. Therefore if you are dealing with lots of I/O-bound tasks
188-
concurrently, asynchronous I/O could efficiently use limited CPU and memory to
187+
the same thread. This can save a lot of CPU time from context switching, and
188+
memory from multi-threading. Therefore if you are dealing with lots of I/O-bound
189+
tasks concurrently, asynchronous I/O can efficiently use limited CPU and memory to
189190
deliver greater throughput.
190191

191192
With coroutines, you can naturally write sequential code that is cooperatively
192193
scheduled. If your business logic is complex, coroutines could greatly improve
193194
readability of asynchronous I/O code.
194195

195-
However for single task, asynchronous I/O is actually making it slower. For a
196-
simple ``recv()`` operation for example, blocking I/O would just block and
197-
return result, but it needs quite some steps in asynchronous I/O: register for
198-
the read event, wait until event arrives, try to ``recv()``, repeat until
199-
result returns, feed the result to a callback at last. With coroutines, the
200-
framework cost is even larger. Thanks to uvloop_ this cost has been minimized
201-
in Python, still it is overhead comparing to raw blocking I/O.
202-
203-
And, asynchronous I/O is unpredictable in time, because of its cooperative
204-
nature. For example, in a coroutine you want to sleep for 1 second. But another
205-
coroutine took the control and ran for 2 seconds. When we get back to the
206-
former coroutine, it is already 2 seconds later. Therefore, ``sleep(1)`` means
207-
to wait for at least 1 second. In practice, you should try your best to make
208-
sure that all code between ``await`` should finish ASAP, being literally
209-
cooperative. Still, there can be code beyond control, so it is important to
210-
keep the uncertainty in mind all the time.
211-
212-
At last, asynchronous programming is complicated, it's easier said than done.
213-
Debugging is a tough job too. Especially when a whole team is working on the
214-
same piece of asynchronous code, it could easily go wrong. Therefore, a general
215-
suggestion is, use asynchronous I/O carefully for I/O-bound high concurrency
216-
scenarios only. It's not a drop-in replacement for performance boost, but more
217-
like a sharp blade for concurrency with two edges. And if you are dealing with
218-
deadline-intensive tasks, think again to be sure.
196+
However for a single task, asynchronous I/O can actually impair throughput. For
197+
example, for a simple ``recv()`` operation blocking I/O would just block until
198+
returning the result, but for asynchronous I/O additional steps are required:
199+
register for the read event, wait until event arrives, try to ``recv()``, repeat
200+
until a result returns, and finally feed the result to a callback. With coroutines,
201+
the framework cost is even larger. Thanks to uvloop_ this cost has been minimized
202+
in Python, but it is still additional overhead compared to raw blocking I/O.
203+
204+
Timing in Asynchronous I/O is also less predictable because of its cooperative
205+
nature. For example, in a coroutine you may want to sleep for 1 second. However,
206+
if another coroutine received control and ran for 2 seconds, by the time we get
207+
back to the first coroutine 2 seconds have already passed. Therefore, ``sleep(1)``
208+
means to wait for at least 1 second. In practice, you should try your best to make
209+
sure that all code between ``await`` finishes ASAP, being literally cooperative.
210+
Still, there can be code outside your control, so it is important to keep this
211+
unpredictibility of timing in mind.
212+
213+
Finally, asynchronous programming is complicated. Writing good asynchronous code
214+
is easier said than done, and debugging it is more difficult than debugging
215+
similar synchronous code. Especially when a whole team is working on the
216+
same piece of asynchronous code, it can easily go wrong. Therefore, a general
217+
suggestion is to use asynchronous I/O carefully for I/O-bound high concurrency
218+
scenarios only. It's not a drop-in that will provide a performance boost, but
219+
more like a sharp blade for concurrency with two edges. And if you are dealing with
220+
time-critical tasks, think again to be sure.
219221

220222

221223
About Database and ORM
222224
----------------------
223225

224226
Finally, GINO. We assume a scenario that asynchronous I/O is anyway required
225-
for the server itself, regardless of how we handle database.
227+
for the server itself, regardless of how we handle the database.
226228

227-
Now that we know asynchronous I/O is for I/O intensive tasks. But isn't it I/O
229+
We now know that asynchronous I/O is for I/O intensive tasks. But isn't it I/O
228230
intensive to frequently talk to a remote database? It depends. Like Mike said,
229-
"intensive" is relative to your actual code. Modern databases are super fast
230-
and reliable, network is reliable if put in LAN, therefore if actual database
231-
access time is of the minority of the program, it is not I/O intensive. Using
232-
asynchronous I/O for database in this case could not improve throughput much,
233-
or even make it worse due to asynchronous framework overhead as we mentioned.
234-
It looks easier to just use blocking database operations in your coroutines
235-
instead without harming performance.
236-
237-
But there is a high risk to cause dead locks. For example, the first coroutine
238-
starts a transaction and updated a row, then the second coroutine tries to
239-
update the same row before the first coroutine closes the transaction. The
240-
second coroutine will block the whole thread at the non-async update, waiting
241-
for the row lock to be released, but the releasing is in the first coroutine
242-
which is blocked by the second coroutine. Thus it will block forever.
231+
"intensive" is relative to your actual code. Modern databases are very fast
232+
and reliable, network is reliable if put in LAN. If actual database access time
233+
is of the minority of the total time taken by the program, it is not I/O
234+
intensive. Using asynchronous I/O for database connections and queries in this
235+
case will not improve throughput much, and may make it worse due to asynchronous
236+
framework overheads mentioned earlier. It looks easier to just use blocking database
237+
operations in your coroutines instead without harming performance.
238+
239+
Using blocking operations in coroutines carries a high risk of causing dead locks.
240+
For example, imagine a coroutine starts a transaction and updates a row before yielding
241+
control. A second coroutine tries to update the same row before the first coroutine
242+
closes the transaction. This second coroutine will block on the non-async update,
243+
waiting for the row lock to be released and preventing any other coroutine running.
244+
However, releasing the lock is in the first coroutine which is now blocked by the
245+
second coroutine. Thus it will block forever.
243246

244247
This may happen even if you optimized all database interactions to be as
245-
quickly as possible. Racing condition just happens under pressure, and anything
246-
that may block will eventually block. Therefore, don't call blocking methods in
247-
coroutines, ever. (Unless you are 100% sure it won't cause a dead lock)
248+
quick as possible. Race conditions happen under pressure, and anything that can block
249+
will eventually block. Therefore, don't call blocking methods in coroutines, ever.
250+
(Unless you are 100% sure it won't cause a dead lock)
248251

249252
A simple fix would be to defer the database operations into threads, so that
250-
they won't block the main thread, thus won't cause a dead lock easily. It
253+
they won't block the main thread and thus won't cause a dead lock easily. It
251254
usually works and there is even a library to do so. However when it comes to
252255
ORM, things become dirty.
253256

@@ -256,14 +259,14 @@ for example. In a larger project, you never know which statement has a side
256259
effect to make an implicit database call, and block the main thread. Since you
257260
cannot put only the underlying database access into the thread pool (you need
258261
to ``await`` on the deferred database call), you'll start putting pieces of
259-
code into the thread pool. But coroutines run only in the main thread, your
262+
code into the thread pool. But because coroutines run only in the main thread, your
260263
code starts to fall apart. This is usually the time when I suggest to separate
261264
the server into two parts: "normal blocking with ORM" and "asynchronous without
262265
ORM".
263266

264-
Eventually this is where GINO can be useful: convenience of database
265-
abstraction is wanted in a classic asynchronous context. And thanks to
266-
asyncpg_, the asynchronous overhead is by far still buried in its incredible
267+
This is where GINO can be useful: it provides the convenience of database
268+
abstraction in a classic asynchronous context. And thanks to
269+
asyncpg_, the asynchronous overhead is by far exceeded by its incredible
267270
performance boost.
268271

269272

0 commit comments

Comments
 (0)