···
3
+
title = "How much memory is needed to run 1M Erlang processes?"
4
+
description = "How to not write benchmarks"
16
+
Recently [benchmark for concurrency implementation in different
17
+
languages][benchmark]. In this article [Piotr Kołaczkowski][] used Chat GPT to
18
+
generate the examples in the different languages and benchmarked them. This was
19
+
poor choice as I have found this article and read the Elixir example:
21
+
[benchmark]: https://pkolaczk.github.io/memory-consumption-of-async/ "How Much Memory Do You Need to Run 1 Million Concurrent Tasks?"
22
+
[Piotr Kołaczkowski]: https://github.com/pkolaczk
26
+
for _ <- 1..num_tasks do
32
+
Task.await_many(tasks, :infinity)
35
+
And, well, it's pretty poor example of BEAM's process memory usage, and I am
36
+
not talking about the fact that it uses 4 spaces for indentation.
38
+
For 1 million processes this code reported 3.94 GiB of memory used by the process
39
+
in Piotr's benchmark, but with little work I managed to reduce it about 4 times
40
+
to around 0.93 GiB of RAM usage. In this article I will describe:
43
+
- why the original code was consuming so much memory
44
+
- why in the real world you probably should not optimise like I did here
45
+
- why using ChatGPT to write benchmarking code sucks (TL;DR because that will
46
+
nerd snipe people like me)
48
+
## What are Erlang processes?
50
+
Erlang is ~~well~~ known of being language which support for concurrency is
51
+
superb, and Erlang processes are the main reason for that. But what are these?
53
+
In Erlang *process* is the common name for what other languages call *virtual
54
+
threads* or *green threads*, but in Erlang these have small neat twist - each of
55
+
the process is isolated from the rest and these processes can communicate only
56
+
via message passing. That gives Erlang processes 2 features that are rarely
57
+
spotted in other implementations:
59
+
- Failure isolation - bug, unhandled case, or other issue in single process will
60
+
not directly affect any other process in the system. VM can send some messages
61
+
due to process shutdown, and other processes may be killed because of that,
62
+
but by itself shutting down single process will not cause problems in any
63
+
process not related to that.
64
+
- Location transparency - process can be spawned locally or on different
65
+
machine, but from the viewpoint of the programmer, there is no difference.
67
+
The above features and requirements results in some design choices, but for our
68
+
purpose only one is truly needed today - each process have separate and (almost)
69
+
independent memory stack from any other process.
71
+
### Process dictionary
73
+
Each process in Erlang VM has dedicated *mutable* memory space for their
74
+
internal uses. Most people do not use it for anything because in general it
75
+
should not be used unless you know exactly what you are doing (in my case, a bad
76
+
carpenter could count cases when I needed it, on single hand). In general it's
77
+
*here be dragons* area.
79
+
How it's relevant to us?
81
+
Well, OTP internally uses process dictionary (`pdict` for short) to store
82
+
metadata about given process that can be later used for debugging purposes. Some
83
+
data that it store are:
85
+
- Initial function that was run by the given process
86
+
- PIDs to all ancestors of the given process
88
+
Different processes abstractions (like `get_server`/`GenServer`, Elixir's
89
+
`Task`, etc.) can store even more metadata there, `logger` store process
90
+
metadata in process dictionary, `rand` store state of the PRNGs in the process
91
+
dictionary. it's used quite extensively by some OTP features.
93
+
### "Well behaved" OTP process
95
+
In addition to the above metadata if the process is meant to be "well behaved"
96
+
process in OTP system, i.e. process that can be observed and debugged using OTP
97
+
facilities, it must respond to some additional messages defined by [`sys`][]
98
+
module. Without that the features like [`observer`][] would not be able to "see"
99
+
the content of the process state.
101
+
[`sys`]: https://erlang.org/doc/man/sys.html
102
+
[`observer`]: https://erlang.org/doc/man/observer.html
104
+
## Process memory usage
106
+
As we have seen above, the `Task.async/1` function form Elixir **must** do
107
+
much more than just simple "start process and live with it". That was one of the
108
+
most important problems with the original process, it was using system, that was
109
+
allocating quite substantial memory alongside of the process itself, just to
110
+
operate this process. In general, that would be desirable approach (as you
111
+
**really, really, want the debugging facilities**), but in synthetic benchmarks,
112
+
it reduce the feasibility of such benchmark.
114
+
If we want to avoid that additional memory overhead in our spawned processes we
115
+
need to go back to more primitive functions in Erlang, namely `erlang:spawn/1`
116
+
(`Kernel.spawn/1` in Elixir). But that mean that we cannot use
117
+
`Task.await_many/2` anymore, so we need to workaround it by using custom
122
+
def await(pid) when is_pid(pid) do
123
+
# Monitor is internal feature of Erlang that will inform you (by sending
124
+
# message) when process you monitor die. The returned value is type called
125
+
# "reference" which is just simply unique value returned by the VM.
126
+
# If the process is already dead, then message will be delivered
128
+
ref = Process.monitor(pid)
131
+
{:DOWN, ^ref, :process, _, _} -> :ok
135
+
def await_many(pids) do
136
+
Enum.each(pids, &await/1)
141
+
for _ <- 1..num_tasks do
142
+
# `Kernel` module is imported by default, so no need for `Kernel.` prefix
144
+
:timer.sleep(10000)
148
+
Bench.await_many(tasks)
151
+
We already removed one problem (well, two in fact, but we will go into
152
+
details in next section).
154
+
## All your lists belongs to us now
156
+
Erlang, like most of the functional programming languages, have 2 built-in
159
+
- Tuples - which are non-growable product type of the values, so you can access
160
+
any field quite fast, but adding more values is performance no-no
161
+
- (Singly) linked lists - growable type (in most case it will have single type
162
+
values in it, but in Erlang that is not always the case), which is fast to
163
+
prepend or pop data from the beginning, but do not try to do anything else if
164
+
you care about performance.
166
+
In this case we will focus on the 2nd one, as there tuples aren't important at
169
+
Singly linked list is simple data structure. It's either special value `[]`
170
+
(an empty list) or it's something called "cons-cell". Cons-cells are also
171
+
simple structures - it's 2ary tuple (tuple with 2 elements) where first value
172
+
is head - the value in the list cell, and another one is the "tail" of the list (aka
173
+
rest of the list). In Elixir the cons-cell is denoted like that `[head | tail]`.
174
+
Super simple structure as you can see, and perfect for the functional
175
+
programming as you can add new values to the list without modifying existing
176
+
values, so you can be immutable and fast. However if you need to construct the
177
+
sequence of a lot of values (like our list of all tasks) then we have problem.
178
+
Because Elixir promises that list returned from the `for` will be **in-order**
179
+
of the values passed to it. That mean that we either need to process our data
183
+
def map([], _), do: []
185
+
def map([head | tail], func) do
186
+
[func.(head) | map(tail, func)]
190
+
Where we build call stack (as we cannot have tail call optimisation there, of
191
+
course sans compiler optimisations). Or we need to build our list in reverse
192
+
order, and then reverse it before returning (so we can have TCO):
195
+
def map(list, func), do: do_map(list, func, [])
197
+
def map([], _func, agg), do: :lists.reverse(agg)
199
+
def map([head | tail], func, agg) do
200
+
map(tail, func, [func.(head) | agg])
204
+
Which one of these approaches is more performant is irrelevant[^erlang-perf],
205
+
what is relevant is that we need either build call stack or construct our list
206
+
*twice* to be able to conform to the Elixir promises (even if in this case we do
207
+
not care about order of the list returned by the `for`).
209
+
[^erlang-perf]: Sometimes body recursion will be faster, sometimes TCO will be
210
+
faster. it's impossible to tell without more benchmarking. For more info check
211
+
out [superb article by Ferd Herbert](https://ferd.ca/erlang-s-tail-recursion-is-not-a-silver-bullet.html).
213
+
Of course we could mitigate our problem by using `Enum.reduce/3` function (or
214
+
writing it on our own) and end with code like:
218
+
def await(pid) when is_pid(pid) do
219
+
ref = Process.monitor(pid)
222
+
{:DOWN, ^ref, :process, _, _} -> :ok
226
+
def await_many(pids) do
227
+
Enum.each(pids, &await/1)
232
+
Enum.reduce(1..num_tasks, [], fn _, agg ->
233
+
# `Kernel` module is imported by default, so no need for `Kernel.` prefix
235
+
spawn(fn -> :timer.sleep(10000) end)
240
+
Bench.await_many(tasks)
243
+
Even then we build list of all PIDs.
245
+
Here I can also go back to the "second problem* I have mentioned above.
246
+
`Task.await_many/1` *also construct a list*. it's list of return value from all
247
+
the processes in the list, so not only we constructed list for the tasks' PIDs,
248
+
we also constructed list of return values (which will be `:ok` for all processes
249
+
as it's what `:timer.sleep/1` returns), and immediately discarded all of that.
251
+
How we can better? See that **all** we care is that all `num_task` processes
252
+
have gone down. We do not care about any of the return values, all what we want
253
+
is to know that all processes that we started went down. For that we can just
254
+
send messages from the spawned processes and count the received messages count:
258
+
def worker(parent) do
259
+
:timer.sleep(10000)
260
+
send(parent, :done)
263
+
def start(0), do: :ok
264
+
def start(n) when n > 0 do
266
+
spawn(fn -> worker(this) end)
271
+
def await(0), do: :ok
272
+
def await(n) when n > 0 do
274
+
:done -> await(n - 1)
279
+
Bench.start(num_tasks)
280
+
Bench.await(num_tasks)
283
+
Now we do not have any lists involved and we still do what the original task
284
+
meant to do - spawn `num_tasks` processes and wait till all go down.
286
+
## Arguments copying
288
+
One another thing that we can account there - lambda context and data passing
291
+
You see, we need to pass `this` (which is PID of the parent) to our newly
292
+
spawned process. That is suboptimal, as we are looking for the way to reduce
293
+
amount of the memory (and ignore all other metrics at the same time). As Erlang
294
+
processes are meant to be "share nothing" type of processes there is problem -
295
+
we need to copy that PID to all processes. it's just 1 word (which mean 8 bytes
296
+
on 64-bit architectures, 4 bytes on 32-bit), but hey, we are microbenchmarking,
297
+
so we cut whatever we can (with 1M processes, this adds up to 8 MiBs).
299
+
Hey, we can avoid that by using yet another feature of Erlang, called
300
+
*registry*. This is yet another simple feature that allows us to assign PID of
301
+
the process to the atom, which allows us then to send messages to that process
302
+
using just name, we have given. While atoms are also 1 word that wouldn't make
303
+
sense to send it as well, but instead we can do what any reasonable
304
+
microbenchmarker would do - *hardcode stuff*:
309
+
:timer.sleep(10000)
310
+
send(:parent, :done)
313
+
def start(0), do: :ok
314
+
def start(n) when n > 0 do
315
+
spawn(fn -> worker() end)
320
+
def await(0), do: :ok
321
+
def await(n) when n > 0 do
323
+
:done -> await(n - 1)
328
+
Process.register(self(), :parent)
330
+
Bench.start(num_tasks)
331
+
Bench.await(num_tasks)
334
+
Now we do not pass any arguments, and instead rely on the registry to dispatch
335
+
our messages to respective processes.
339
+
As you may have already noticed we are passing lambda to the `spawn/1`. That is
340
+
also quite suboptimal, because of [difference between remote and local call][remote-vs-local].
341
+
This mean that we are paying slight memory cost for these processes to keep the
342
+
old module in memory. Instead we can use either fully qualified function capture
343
+
or `spawn/3` function that accepts MFA (module, function name, arguments list)
344
+
argument. We end with:
346
+
[remote-vs-local]: https://www.erlang.org/doc/reference_manual/code_loading.html#code-replacement
351
+
:timer.sleep(10000)
352
+
send(:parent, :done)
355
+
def start(0), do: :ok
356
+
def start(n) when n > 0 do
357
+
spawn(&__MODULE__.worker/0)
362
+
def await(0), do: :ok
363
+
def await(n) when n > 0 do
365
+
:done -> await(n - 1)
370
+
Process.register(self(), :parent)
372
+
Bench.start(num_tasks)
373
+
Bench.await(num_tasks)
378
+
With given Erlang compilation:
381
+
Erlang/OTP 25 [erts-13.2.2.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]
383
+
Elixir 1.14.5 (compiled with Erlang/OTP 25)
386
+
> Note no JIT as Nix on macOS currently[^currently] disable it and I didn't bother to enable
387
+
> it in the derivation (it was disabled because there were some issues, but IIRC
388
+
> these are resolved now).
390
+
[^currently]: Nixpkgs rev `bc3ec5ea`
392
+
The results are as follow (in bytes of peak memory footprint returned by
393
+
`/usr/bin/time` on macOS):
395
+
| Implementation | 1k | 100k | 1M |
396
+
| -------------- | -------: | --------: | ---------: |
397
+
| Original | 45047808 | 452837376 | 4227715072 |
398
+
| Spawn | 43728896 | 318230528 | 2869723136 |
399
+
| Reduce | 43552768 | 314798080 | 2849304576 |
400
+
| Count | 43732992 | 313507840 | 2780540928 |
401
+
| Registry | 44453888 | 311988224 | 2787237888 |
402
+
| RemoteCall | 43597824 | 310595584 | 2771525632 |
404
+
As we can see we have reduced the memory use by about 30% by just changing
405
+
from `Task.async/1` to `spawn/1`. Further optimisations reduced memory usage
406
+
slightly, but with no such drastic changes.
410
+
Well, with some VM flags tinkering - of course.
412
+
You see, by default Erlang VM will not only create some data required for
413
+
handling process itself[^word]:
415
+
[^word]: Again, word here mean 8 bytes on 64-bit and 4 bytes on 32-bit architectures.
417
+
> | Data Type | Memory Size |
420
+
> | Erlang process | 338 words when spawned, including a heap of 233 words. |
422
+
> -- <https://erlang.org/doc/efficiency_guide/advanced.html#Advanced>
424
+
As we can see, there are 105 words that are required and 233 words which are
425
+
used for preallocated heap. But this is microbenchmarking, so as we do not need
426
+
that much of memory (because our processes basically does nothing), we can
427
+
reduce it. We do not care about time performance anyway. For that we can use
428
+
`+hms` flag and set it to some small value, for example `1`.
430
+
In addition to heap size Erlang by default load some additional data from the
431
+
BEAM files. That data is used for debugging and error reporting, but again, we
432
+
are microbenchmarking, and who need debugging support anyway (answer: everyone,
433
+
so **do not** do it in production). Luckily for us, the VM has yet another flag
434
+
for that purpose `+L`.
436
+
Erlang also uses some [ETS][] (Erlang Term Storage) tables by default (for
437
+
example to support process registry we have mentioned above). ETS tables can be
438
+
compressed, but by default it's not done, as it can slow down some kinds of
439
+
operations on such tables. Fortunately there is, another, flag `+ec` that has
442
+
> Forces option compressed on all ETS tables. Only intended for test and
445
+
[ETS]: https://erlang.org/doc/man/ets.html
447
+
Sounds good enough for me.
449
+
With all these flags enabled we get peak memory footprint at 996257792 bytes.
451
+
Compare it in more human readable units.
453
+
| | Peak Memory Footprint for 1M processes |
454
+
| ------------------------ | -------------------------------------- |
455
+
| Original code | 3.94 GiB |
456
+
| Improved code | 2.58 GiB |
457
+
| Improved code with flags | 0.93 GiB |
459
+
Result - about 76% of the peak memory usage reduction. Not bad.
465
+
> Please, do not use ChatGPT for writing code for microbenchmarks.
467
+
The thing about *micro*benchmarking is that we write code that does as little as
468
+
possible to show (mostly) meaningless features of the given technology in
469
+
abstract environment. ChatGPT cannot do that, not out of malice or incompetence,
470
+
but because it used (mostly) *good* and idiomatic code to teach itself,
471
+
microbenchmarks rarely are something that people will consider to have these
472
+
qualities. It also cannot consider other features that [wetware][] can take into
473
+
account (like our "we do not need lists there" thing).
475
+
[wetware]: https://en.wikipedia.org/wiki/Wetware_(brain)