this repo has no description
1+++
2date = 2023-06-10
3title = "How much memory is needed to run 1M Erlang processes?"
4description = "How to not write benchmarks"
5
6[taxonomies]
7tags = [
8 "beam",
9 "elixir",
10 "erlang",
11 "benchmarks",
12 "programming"
13]
14+++
15
16Recently [benchmark for concurrency implementation in different
17languages][benchmark]. In this article [Piotr Kołaczkowski][] used Chat GPT to
18generate the examples in the different languages and benchmarked them. This was
19poor choice as I have found this article and read the Elixir example:
20
21[benchmark]: https://pkolaczk.github.io/memory-consumption-of-async/ "How Much Memory Do You Need to Run 1 Million Concurrent Tasks?"
22[Piotr Kołaczkowski]: https://github.com/pkolaczk
23
24```elixir
25tasks =
26 for _ <- 1..num_tasks do
27 Task.async(fn ->
28 :timer.sleep(10000)
29 end)
30 end
31
32Task.await_many(tasks, :infinity)
33```
34
35And, well, it's pretty poor example of BEAM's process memory usage, and I am
36not talking about the fact that it uses 4 spaces for indentation.
37
38For 1 million processes this code reported 3.94 GiB of memory used by the process
39in Piotr's benchmark, but with little work I managed to reduce it about 4 times
40to around 0.93 GiB of RAM usage. In this article I will describe:
41
42- how I did that
43- why the original code was consuming so much memory
44- why in the real world you probably should not optimise like I did here
45- why using ChatGPT to write benchmarking code sucks (TL;DR because that will
46 nerd snipe people like me)
47
48## What are Erlang processes?
49
50Erlang is ~~well~~ known of being language which support for concurrency is
51superb, and Erlang processes are the main reason for that. But what are these?
52
53In Erlang *process* is the common name for what other languages call *virtual
54threads* or *green threads*, but in Erlang these have small neat twist - each of
55the process is isolated from the rest and these processes can communicate only
56via message passing. That gives Erlang processes 2 features that are rarely
57spotted in other implementations:
58
59- Failure isolation - bug, unhandled case, or other issue in single process will
60 not directly affect any other process in the system. VM can send some messages
61 due to process shutdown, and other processes may be killed because of that,
62 but by itself shutting down single process will not cause problems in any
63 process not related to that.
64- Location transparency - process can be spawned locally or on different
65 machine, but from the viewpoint of the programmer, there is no difference.
66
67The above features and requirements results in some design choices, but for our
68purpose only one is truly needed today - each process have separate and (almost)
69independent memory stack from any other process.
70
71### Process dictionary
72
73Each process in Erlang VM has dedicated *mutable* memory space for their
74internal uses. Most people do not use it for anything because in general it
75should not be used unless you know exactly what you are doing (in my case, a bad
76carpenter could count cases when I needed it, on single hand). In general it's
77*here be dragons* area.
78
79How it's relevant to us?
80
81Well, OTP internally uses process dictionary (`pdict` for short) to store
82metadata about given process that can be later used for debugging purposes. Some
83data that it store are:
84
85- Initial function that was run by the given process
86- PIDs to all ancestors of the given process
87
88Different processes abstractions (like `get_server`/`GenServer`, Elixir's
89`Task`, etc.) can store even more metadata there, `logger` store process
90metadata in process dictionary, `rand` store state of the PRNGs in the process
91dictionary. it's used quite extensively by some OTP features.
92
93### "Well behaved" OTP process
94
95In addition to the above metadata if the process is meant to be "well behaved"
96process in OTP system, i.e. process that can be observed and debugged using OTP
97facilities, it must respond to some additional messages defined by [`sys`][]
98module. Without that the features like [`observer`][] would not be able to "see"
99the content of the process state.
100
101[`sys`]: https://erlang.org/doc/man/sys.html
102[`observer`]: https://erlang.org/doc/man/observer.html
103
104## Process memory usage
105
106As we have seen above, the `Task.async/1` function form Elixir **must** do
107much more than just simple "start process and live with it". That was one of the
108most important problems with the original process, it was using system, that was
109allocating quite substantial memory alongside of the process itself, just to
110operate this process. In general, that would be desirable approach (as you
111**really, really, want the debugging facilities**), but in synthetic benchmarks,
112it reduce the feasibility of such benchmark.
113
114If we want to avoid that additional memory overhead in our spawned processes we
115need to go back to more primitive functions in Erlang, namely `erlang:spawn/1`
116(`Kernel.spawn/1` in Elixir). But that mean that we cannot use
117`Task.await_many/2` anymore, so we need to workaround it by using custom
118function:
119
120```elixir
121defmodule Bench do
122 def await(pid) when is_pid(pid) do
123 # Monitor is internal feature of Erlang that will inform you (by sending
124 # message) when process you monitor die. The returned value is type called
125 # "reference" which is just simply unique value returned by the VM.
126 # If the process is already dead, then message will be delivered
127 # immediately.
128 ref = Process.monitor(pid)
129
130 receive do
131 {:DOWN, ^ref, :process, _, _} -> :ok
132 end
133 end
134
135 def await_many(pids) do
136 Enum.each(pids, &await/1)
137 end
138end
139
140tasks =
141 for _ <- 1..num_tasks do
142 # `Kernel` module is imported by default, so no need for `Kernel.` prefix
143 spawn(fn ->
144 :timer.sleep(10000)
145 end)
146 end
147
148Bench.await_many(tasks)
149```
150
151We already removed one problem (well, two in fact, but we will go into
152details in next section).
153
154## All your lists belongs to us now
155
156Erlang, like most of the functional programming languages, have 2 built-in
157sequence types:
158
159- Tuples - which are non-growable product type of the values, so you can access
160 any field quite fast, but adding more values is performance no-no
161- (Singly) linked lists - growable type (in most case it will have single type
162 values in it, but in Erlang that is not always the case), which is fast to
163 prepend or pop data from the beginning, but do not try to do anything else if
164 you care about performance.
165
166In this case we will focus on the 2nd one, as there tuples aren't important at
167all.
168
169Singly linked list is simple data structure. It's either special value `[]`
170(an empty list) or it's something called "cons-cell". Cons-cells are also
171simple structures - it's 2ary tuple (tuple with 2 elements) where first value
172is head - the value in the list cell, and another one is the "tail" of the list (aka
173rest of the list). In Elixir the cons-cell is denoted like that `[head | tail]`.
174Super simple structure as you can see, and perfect for the functional
175programming as you can add new values to the list without modifying existing
176values, so you can be immutable and fast. However if you need to construct the
177sequence of a lot of values (like our list of all tasks) then we have problem.
178Because Elixir promises that list returned from the `for` will be **in-order**
179of the values passed to it. That mean that we either need to process our data
180like that:
181
182```elixir
183def map([], _), do: []
184
185def map([head | tail], func) do
186 [func.(head) | map(tail, func)]
187end
188```
189
190Where we build call stack (as we cannot have tail call optimisation there, of
191course sans compiler optimisations). Or we need to build our list in reverse
192order, and then reverse it before returning (so we can have TCO):
193
194```elixir
195def map(list, func), do: do_map(list, func, [])
196
197def map([], _func, agg), do: :lists.reverse(agg)
198
199def map([head | tail], func, agg) do
200 map(tail, func, [func.(head) | agg])
201end
202```
203
204Which one of these approaches is more performant is irrelevant[^erlang-perf],
205what is relevant is that we need either build call stack or construct our list
206*twice* to be able to conform to the Elixir promises (even if in this case we do
207not care about order of the list returned by the `for`).
208
209[^erlang-perf]: Sometimes body recursion will be faster, sometimes TCO will be
210faster. it's impossible to tell without more benchmarking. For more info check
211out [superb article by Ferd Herbert](https://ferd.ca/erlang-s-tail-recursion-is-not-a-silver-bullet.html).
212
213Of course we could mitigate our problem by using `Enum.reduce/3` function (or
214writing it on our own) and end with code like:
215
216```elixir
217defmodule Bench do
218 def await(pid) when is_pid(pid) do
219 ref = Process.monitor(pid)
220
221 receive do
222 {:DOWN, ^ref, :process, _, _} -> :ok
223 end
224 end
225
226 def await_many(pids) do
227 Enum.each(pids, &await/1)
228 end
229end
230
231tasks =
232 Enum.reduce(1..num_tasks, [], fn _, agg ->
233 # `Kernel` module is imported by default, so no need for `Kernel.` prefix
234 pid =
235 spawn(fn -> :timer.sleep(10000) end)
236
237 [pid | agg]
238 end)
239
240Bench.await_many(tasks)
241```
242
243Even then we build list of all PIDs.
244
245Here I can also go back to the "second problem* I have mentioned above.
246`Task.await_many/1` *also construct a list*. it's list of return value from all
247the processes in the list, so not only we constructed list for the tasks' PIDs,
248we also constructed list of return values (which will be `:ok` for all processes
249as it's what `:timer.sleep/1` returns), and immediately discarded all of that.
250
251How we can better? See that **all** we care is that all `num_task` processes
252have gone down. We do not care about any of the return values, all what we want
253is to know that all processes that we started went down. For that we can just
254send messages from the spawned processes and count the received messages count:
255
256```elixir
257defmodule Bench do
258 def worker(parent) do
259 :timer.sleep(10000)
260 send(parent, :done)
261 end
262
263 def start(0), do: :ok
264 def start(n) when n > 0 do
265 this = self()
266 spawn(fn -> worker(this) end)
267
268 start(n - 1)
269 end
270
271 def await(0), do: :ok
272 def await(n) when n > 0 do
273 receive do
274 :done -> await(n - 1)
275 end
276 end
277end
278
279Bench.start(num_tasks)
280Bench.await(num_tasks)
281```
282
283Now we do not have any lists involved and we still do what the original task
284meant to do - spawn `num_tasks` processes and wait till all go down.
285
286## Arguments copying
287
288One another thing that we can account there - lambda context and data passing
289between processes.
290
291You see, we need to pass `this` (which is PID of the parent) to our newly
292spawned process. That is suboptimal, as we are looking for the way to reduce
293amount of the memory (and ignore all other metrics at the same time). As Erlang
294processes are meant to be "share nothing" type of processes there is problem -
295we need to copy that PID to all processes. it's just 1 word (which mean 8 bytes
296on 64-bit architectures, 4 bytes on 32-bit), but hey, we are microbenchmarking,
297so we cut whatever we can (with 1M processes, this adds up to 8 MiBs).
298
299Hey, we can avoid that by using yet another feature of Erlang, called
300*registry*. This is yet another simple feature that allows us to assign PID of
301the process to the atom, which allows us then to send messages to that process
302using just name, we have given. While atoms are also 1 word that wouldn't make
303sense to send it as well, but instead we can do what any reasonable
304microbenchmarker would do - *hardcode stuff*:
305
306```elixir
307defmodule Bench do
308 def worker do
309 :timer.sleep(10000)
310 send(:parent, :done)
311 end
312
313 def start(0), do: :ok
314 def start(n) when n > 0 do
315 spawn(fn -> worker() end)
316
317 start(n - 1)
318 end
319
320 def await(0), do: :ok
321 def await(n) when n > 0 do
322 receive do
323 :done -> await(n - 1)
324 end
325 end
326end
327
328Process.register(self(), :parent)
329
330Bench.start(num_tasks)
331Bench.await(num_tasks)
332```
333
334Now we do not pass any arguments, and instead rely on the registry to dispatch
335our messages to respective processes.
336
337## One more thing
338
339As you may have already noticed we are passing lambda to the `spawn/1`. That is
340also quite suboptimal, because of [difference between remote and local call][remote-vs-local].
341This mean that we are paying slight memory cost for these processes to keep the
342old module in memory. Instead we can use either fully qualified function capture
343or `spawn/3` function that accepts MFA (module, function name, arguments list)
344argument. We end with:
345
346[remote-vs-local]: https://www.erlang.org/doc/reference_manual/code_loading.html#code-replacement
347
348```elixir
349defmodule Bench do
350 def worker do
351 :timer.sleep(10000)
352 send(:parent, :done)
353 end
354
355 def start(0), do: :ok
356 def start(n) when n > 0 do
357 spawn(&__MODULE__.worker/0)
358
359 start(n - 1)
360 end
361
362 def await(0), do: :ok
363 def await(n) when n > 0 do
364 receive do
365 :done -> await(n - 1)
366 end
367 end
368end
369
370Process.register(self(), :parent)
371
372Bench.start(num_tasks)
373Bench.await(num_tasks)
374```
375
376## Results
377
378With given Erlang compilation:
379
380```txt
381Erlang/OTP 25 [erts-13.2.2.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]
382
383Elixir 1.14.5 (compiled with Erlang/OTP 25)
384```
385
386> Note no JIT as Nix on macOS currently[^currently] disable it and I didn't bother to enable
387> it in the derivation (it was disabled because there were some issues, but IIRC
388> these are resolved now).
389
390[^currently]: Nixpkgs rev `bc3ec5ea`
391
392The results are as follow (in bytes of peak memory footprint returned by
393`/usr/bin/time` on macOS):
394
395| Implementation | 1k | 100k | 1M |
396| -------------- | -------: | --------: | ---------: |
397| Original | 45047808 | 452837376 | 4227715072 |
398| Spawn | 43728896 | 318230528 | 2869723136 |
399| Reduce | 43552768 | 314798080 | 2849304576 |
400| Count | 43732992 | 313507840 | 2780540928 |
401| Registry | 44453888 | 311988224 | 2787237888 |
402| RemoteCall | 43597824 | 310595584 | 2771525632 |
403
404As we can see we have reduced the memory use by about 30% by just changing
405from `Task.async/1` to `spawn/1`. Further optimisations reduced memory usage
406slightly, but with no such drastic changes.
407
408Can we do better?
409
410Well, with some VM flags tinkering - of course.
411
412You see, by default Erlang VM will not only create some data required for
413handling process itself[^word]:
414
415[^word]: Again, word here mean 8 bytes on 64-bit and 4 bytes on 32-bit architectures.
416
417> | Data Type | Memory Size |
418> | - | - |
419> | … | … |
420> | Erlang process | 338 words when spawned, including a heap of 233 words. |
421>
422> -- <https://erlang.org/doc/efficiency_guide/advanced.html#Advanced>
423
424As we can see, there are 105 words that are required and 233 words which are
425used for preallocated heap. But this is microbenchmarking, so as we do not need
426that much of memory (because our processes basically does nothing), we can
427reduce it. We do not care about time performance anyway. For that we can use
428`+hms` flag and set it to some small value, for example `1`.
429
430In addition to heap size Erlang by default load some additional data from the
431BEAM files. That data is used for debugging and error reporting, but again, we
432are microbenchmarking, and who need debugging support anyway (answer: everyone,
433so **do not** do it in production). Luckily for us, the VM has yet another flag
434for that purpose `+L`.
435
436Erlang also uses some [ETS][] (Erlang Term Storage) tables by default (for
437example to support process registry we have mentioned above). ETS tables can be
438compressed, but by default it's not done, as it can slow down some kinds of
439operations on such tables. Fortunately there is, another, flag `+ec` that has
440description:
441
442> Forces option compressed on all ETS tables. Only intended for test and
443> evaluation.
444
445[ETS]: https://erlang.org/doc/man/ets.html
446
447Sounds good enough for me.
448
449With all these flags enabled we get peak memory footprint at 996257792 bytes.
450
451Compare it in more human readable units.
452
453| | Peak Memory Footprint for 1M processes |
454| ------------------------ | -------------------------------------- |
455| Original code | 3.94 GiB |
456| Improved code | 2.58 GiB |
457| Improved code with flags | 0.93 GiB |
458
459Result - about 76% of the peak memory usage reduction. Not bad.
460
461## Summary
462
463First of all:
464
465> Please, do not use ChatGPT for writing code for microbenchmarks.
466
467The thing about *micro*benchmarking is that we write code that does as little as
468possible to show (mostly) meaningless features of the given technology in
469abstract environment. ChatGPT cannot do that, not out of malice or incompetence,
470but because it used (mostly) *good* and idiomatic code to teach itself,
471microbenchmarks rarely are something that people will consider to have these
472qualities. It also cannot consider other features that [wetware][] can take into
473account (like our "we do not need lists there" thing).
474
475[wetware]: https://en.wikipedia.org/wiki/Wetware_(brain)