content/post/beam-process-memory-usage.md at 1c93ec66d5c769436a624e8fcb751d152e4e6fe5 · hauleth.dev/blog

hauleth.dev / blog
this repo has no description
blog / content / post / beam-process-memory-usage.md
at 1c93ec66d5c769436a624e8fcb751d152e4e6fe5 17 kB view raw view rendered
  1+++
  2date = 2023-06-10
  3title = "How much memory is needed to run 1M Erlang processes?"
  4description = "How to not write benchmarks"
  5
  6[taxonomies]
  7tags = [
  8  "beam",
  9  "performance"
 10]
 11+++
 12
 13Recently [benchmark for concurrency implementation in different
 14languages][benchmark]. In this article [Piotr Kołaczkowski][] used Chat GPT to
 15generate the examples in the different languages and benchmarked them. This was
 16poor choice as I have found this article and read the Elixir example:
 17
 18[benchmark]: https://pkolaczk.github.io/memory-consumption-of-async/ "How Much Memory Do You Need to Run 1 Million Concurrent Tasks?"
 19[Piotr Kołaczkowski]: https://github.com/pkolaczk
 20
 21```elixir
 22tasks =
 23    for _ <- 1..num_tasks do
 24        Task.async(fn ->
 25            :timer.sleep(10000)
 26        end)
 27    end
 28
 29Task.await_many(tasks, :infinity)
 30```
 31
 32And, well, it's pretty poor example of BEAM's process memory usage, and I am
 33not talking about the fact that it uses 4 spaces for indentation.
 34
 35For 1 million processes this code reported 3.94 GiB of memory used by the process
 36in Piotr's benchmark, but with little work I managed to reduce it about 4 times
 37to around 0.93 GiB of RAM usage. In this article I will describe:
 38
 39- how I did that
 40- why the original code was consuming so much memory
 41- why in the real world you probably should not optimise like I did here
 42- why using ChatGPT to write benchmarking code sucks (TL;DR because that will
 43  nerd snipe people like me)
 44
 45## What are Erlang processes?
 46
 47Erlang is ~~well~~ known of being language which support for concurrency is
 48superb, and Erlang processes are the main reason for that. But what are these?
 49
 50In Erlang *process* is the common name for what other languages call *virtual
 51threads* or *green threads*, but in Erlang these have small neat twist - each of
 52the process is isolated from the rest and these processes can communicate only
 53via message passing. That gives Erlang processes 2 features that are rarely
 54spotted in other implementations:
 55
 56- Failure isolation - bug, unhandled case, or other issue in single process will
 57  not directly affect any other process in the system. VM can send some messages
 58  due to process shutdown, and other processes may be killed because of that,
 59  but by itself shutting down single process will not cause problems in any
 60  process not related to that.
 61- Location transparency - process can be spawned locally or on different
 62  machine, but from the viewpoint of the programmer, there is no difference.
 63
 64The above features and requirements results in some design choices, but for our
 65purpose only one is truly needed today - each process have separate and (almost)
 66independent memory stack from any other process.
 67
 68### Process dictionary
 69
 70Each process in Erlang VM has dedicated *mutable* memory space for their
 71internal uses. Most people do not use it for anything because in general it
 72should not be used unless you know exactly what you are doing (in my case, a bad
 73carpenter could count cases when I needed it, on single hand). In general it's
 74*here be dragons* area.
 75
 76How it's relevant to us?
 77
 78Well, OTP internally uses process dictionary (`pdict` for short) to store
 79metadata about given process that can be later used for debugging purposes. Some
 80data that it store are:
 81
 82- Initial function that was run by the given process
 83- PIDs to all ancestors of the given process
 84
 85Different processes abstractions (like `gen_server`/`GenServer`, Elixir's
 86`Task`, etc.) can store even more metadata there, `logger` store process
 87metadata in process dictionary, `rand` store state of the PRNGs in the process
 88dictionary. it's used quite extensively by some OTP features.
 89
 90### "Well behaved" OTP process
 91
 92In addition to the above metadata if the process is meant to be "well behaved"
 93process in OTP system, i.e. process that can be observed and debugged using OTP
 94facilities, it must respond to some additional messages defined by [`sys`][]
 95module. Without that the features like [`observer`][] would not be able to "see"
 96the content of the process state.
 97
 98[`sys`]: https://erlang.org/doc/man/sys.html
 99[`observer`]: https://erlang.org/doc/man/observer.html
100
101## Process memory usage
102
103As we have seen above, the `Task.async/1` function form Elixir **must** do
104much more than just simple "start process and live with it". That was one of the
105most important problems with the original process, it was using system, that was
106allocating quite substantial memory alongside of the process itself, just to
107operate this process. In general, that would be desirable approach (as you
108**really, really, want the debugging facilities**), but in synthetic benchmarks,
109it reduce the feasibility of such benchmark.
110
111If we want to avoid that additional memory overhead in our spawned processes we
112need to go back to more primitive functions in Erlang, namely `erlang:spawn/1`
113(`Kernel.spawn/1` in Elixir). But that mean that we cannot use
114`Task.await_many/2` anymore, so we need to workaround it by using custom
115function:
116
117```elixir
118defmodule Bench do
119  def await(pid) when is_pid(pid) do
120    # Monitor is internal feature of Erlang that will inform you (by sending
121    # message) when process you monitor die. The returned value is type called
122    # "reference" which is just simply unique value returned by the VM.
123    # If the process is already dead, then message will be delivered
124    # immediately.
125    ref = Process.monitor(pid)
126
127    receive do
128      {:DOWN, ^ref, :process, _, _} -> :ok
129    end
130  end
131
132  def await_many(pids) do
133    Enum.each(pids, &await/1)
134  end
135end
136
137tasks =
138  for _ <- 1..num_tasks do
139    # `Kernel` module is imported by default, so no need for `Kernel.` prefix
140    spawn(fn ->
141      :timer.sleep(10000)
142    end)
143  end
144
145Bench.await_many(tasks)
146```
147
148We already removed one problem (well, two in fact, but we will go into
149details in next section).
150
151## All your lists are belongs to us now
152
153Erlang, like most of the functional programming languages, have 2 built-in
154sequence types:
155
156- Tuples - which are non-growable product type of the values, so you can access
157  any field quite fast, but adding more values is performance no-no
158- (Singly) linked lists - growable type (in most case it will have single type
159  values in it, but in Erlang that is not always the case), which is fast to
160  prepend or pop data from the beginning, but do not try to do anything else if
161  you care about performance.
162
163In this case we will focus on the 2nd one, as there tuples aren't important at
164all.
165
166Singly linked list is simple data structure. It's either special value `[]`
167(an empty list) or it's something called "cons-cell". Cons-cells are also
168simple structures - it's 2ary tuple (tuple with 2 elements) where first value
169is head - the value in the list cell, and another one is the "tail" of the list (aka
170rest of the list). In Elixir the cons-cell is denoted like that `[head | tail]`.
171Super simple structure as you can see, and perfect for the functional
172programming as you can add new values to the list without modifying existing
173values, so you can be immutable and fast. However if you need to construct the
174sequence of a lot of values (like our list of all tasks) then we have problem.
175Because Elixir promises that list returned from the `for` will be **in-order**
176of the values passed to it. That mean that we either need to process our data
177like that:
178
179```elixir
180def map([], _), do: []
181
182def map([head | tail], func) do
183  [func.(head) | map(tail, func)]
184end
185```
186
187Where we build call stack (as we cannot have tail call optimisation there, of
188course sans compiler optimisations). Or we need to build our list in reverse
189order, and then reverse it before returning (so we can have TCO):
190
191```elixir
192def map(list, func), do: do_map(list, func, [])
193
194def map([], _func, agg), do: :lists.reverse(agg)
195
196def map([head | tail], func, agg) do
197  map(tail, func, [func.(head) | agg])
198end
199```
200
201Which one of these approaches is more performant is irrelevant[^erlang-perf],
202what is relevant is that we need either build call stack or construct our list
203*twice* to be able to conform to the Elixir promises (even if in this case we do
204not care about order of the list returned by the `for`).
205
206[^erlang-perf]: Sometimes body recursion will be faster, sometimes TCO will be
207faster. it's impossible to tell without more benchmarking. For more info check
208out [superb article by Ferd Herbert](https://ferd.ca/erlang-s-tail-recursion-is-not-a-silver-bullet.html).
209
210Of course we could mitigate our problem by using `Enum.reduce/3` function (or
211writing it on our own) and end with code like:
212
213```elixir
214defmodule Bench do
215  def await(pid) when is_pid(pid) do
216    ref = Process.monitor(pid)
217
218    receive do
219      {:DOWN, ^ref, :process, _, _} -> :ok
220    end
221  end
222
223  def await_many(pids) do
224    Enum.each(pids, &await/1)
225  end
226end
227
228tasks =
229  Enum.reduce(1..num_tasks, [], fn _, agg ->
230    # `Kernel` module is imported by default, so no need for `Kernel.` prefix
231    pid =
232      spawn(fn -> :timer.sleep(10000) end)
233
234    [pid | agg]
235  end)
236
237Bench.await_many(tasks)
238```
239
240Even then we build list of all PIDs.
241
242Here I can also go back to the "second problem* I have mentioned above.
243`Task.await_many/1` *also construct a list*. it's list of return value from all
244the processes in the list, so not only we constructed list for the tasks' PIDs,
245we also constructed list of return values (which will be `:ok` for all processes
246as it's what `:timer.sleep/1` returns), and immediately discarded all of that.
247
248How we can better? See that **all** we care is that all `num_task` processes
249have gone down. We do not care about any of the return values, all what we want
250is to know that all processes that we started went down. For that we can just
251send messages from the spawned processes and count the received messages count:
252
253```elixir
254defmodule Bench do
255  def worker(parent) do
256    :timer.sleep(10000)
257    send(parent, :done)
258  end
259
260  def start(0), do: :ok
261  def start(n) when n > 0 do
262    this = self()
263    spawn(fn -> worker(this) end)
264
265    start(n - 1)
266  end
267
268  def await(0), do: :ok
269  def await(n) when n > 0 do
270    receive do
271      :done -> await(n - 1)
272    end
273  end
274end
275
276Bench.start(num_tasks)
277Bench.await(num_tasks)
278```
279
280Now we do not have any lists involved and we still do what the original task
281meant to do - spawn `num_tasks` processes and wait till all go down.
282
283## Arguments copying
284
285One another thing that we can account there - lambda context and data passing
286between processes.
287
288You see, we need to pass `this` (which is PID of the parent) to our newly
289spawned process. That is suboptimal, as we are looking for the way to reduce
290amount of the memory (and ignore all other metrics at the same time). As Erlang
291processes are meant to be "share nothing" type of processes there is problem -
292we need to copy that PID to all processes. it's just 1 word (which mean 8 bytes
293on 64-bit architectures, 4 bytes on 32-bit), but hey, we are microbenchmarking,
294so we cut whatever we can (with 1M processes, this adds up to 8 MiBs).
295
296Hey, we can avoid that by using yet another feature of Erlang, called
297*registry*. This is yet another simple feature that allows us to assign PID of
298the process to the atom, which allows us then to send messages to that process
299using just name, we have given. While atoms are also 1 word that wouldn't make
300sense to send it as well, but instead we can do what any reasonable
301microbenchmarker would do - *hardcode stuff*:
302
303```elixir
304defmodule Bench do
305  def worker do
306    :timer.sleep(10000)
307    send(:parent, :done)
308  end
309
310  def start(0), do: :ok
311  def start(n) when n > 0 do
312    spawn(fn -> worker() end)
313
314    start(n - 1)
315  end
316
317  def await(0), do: :ok
318  def await(n) when n > 0 do
319    receive do
320      :done -> await(n - 1)
321    end
322  end
323end
324
325Process.register(self(), :parent)
326
327Bench.start(num_tasks)
328Bench.await(num_tasks)
329```
330
331Now we do not pass any arguments, and instead rely on the registry to dispatch
332our messages to respective processes.
333
334## One more thing
335
336As you may have already noticed we are passing lambda to the `spawn/1`. That is
337also quite suboptimal, because of [difference between remote and local call][remote-vs-local].
338This mean that we are paying slight memory cost for these processes to keep the
339old module in memory. Instead we can use either fully qualified function capture
340or `spawn/3` function that accepts MFA (module, function name, arguments list)
341argument. We end with:
342
343[remote-vs-local]: https://www.erlang.org/doc/reference_manual/code_loading.html#code-replacement
344
345```elixir
346defmodule Bench do
347  def worker do
348    :timer.sleep(10000)
349    send(:parent, :done)
350  end
351
352  def start(0), do: :ok
353  def start(n) when n > 0 do
354    spawn(&__MODULE__.worker/0)
355
356    start(n - 1)
357  end
358
359  def await(0), do: :ok
360  def await(n) when n > 0 do
361    receive do
362      :done -> await(n - 1)
363    end
364  end
365end
366
367Process.register(self(), :parent)
368
369Bench.start(num_tasks)
370Bench.await(num_tasks)
371```
372
373## Results
374
375With given Erlang compilation:
376
377```txt
378Erlang/OTP 25 [erts-13.2.2.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]
379
380Elixir 1.14.5 (compiled with Erlang/OTP 25)
381```
382
383> Note no JIT as Nix on macOS currently[^currently] disable it and I didn't bother to enable
384> it in the derivation (it was disabled because there were some issues, but IIRC
385> these are resolved now).
386
387[^currently]: Nixpkgs rev `bc3ec5ea`
388
389The results are as follow (in bytes of peak memory footprint returned by
390`/usr/bin/time` on macOS):
391
392| Implementation |       1k |      100k |         1M |
393| -------------- | -------: | --------: | ---------: |
394| Original       | 45047808 | 452837376 | 4227715072 |
395| Spawn          | 43728896 | 318230528 | 2869723136 |
396| Reduce         | 43552768 | 314798080 | 2849304576 |
397| Count          | 43732992 | 313507840 | 2780540928 |
398| Registry       | 44453888 | 311988224 | 2787237888 |
399| RemoteCall     | 43597824 | 310595584 | 2771525632 |
400
401As we can see we have reduced the memory use by about 30% by just changing
402from `Task.async/1` to `spawn/1`. Further optimisations reduced memory usage
403slightly, but with no such drastic changes.
404
405Can we do better?
406
407Well, with some VM flags tinkering - of course.
408
409You see, by default Erlang VM will not only create some data required for
410handling process itself[^word]:
411
412[^word]: Again, word here mean 8 bytes on 64-bit and 4 bytes on 32-bit architectures.
413
414> | Data Type | Memory Size |
415> | - | - |
416> | … | … |
417> | Erlang process | 338 words when spawned, including a heap of 233 words. |
418>
419> -- [Erlang Efficiency Guide: 11. Advanced](https://erlang.org/doc/efficiency_guide/advanced.html#Advanced)
420
421As we can see, there are 105 words that are required and 233 words which are
422used for preallocated heap. But this is microbenchmarking, so as we do not need
423that much of memory (because our processes basically does nothing), we can
424reduce it. We do not care about time performance anyway. For that we can use
425`+hms` flag and set it to some small value, for example `1`.
426
427In addition to heap size Erlang by default load some additional data from the
428BEAM files. That data is used for debugging and error reporting, but again, we
429are microbenchmarking, and who need debugging support anyway (answer: everyone,
430so **do not** do it in production). Luckily for us, the VM has yet another flag
431for that purpose `+L`.
432
433Erlang also uses some [ETS][] (Erlang Term Storage) tables by default (for
434example to support process registry we have mentioned above). ETS tables can be
435compressed, but by default it's not done, as it can slow down some kinds of
436operations on such tables. Fortunately there is, another, flag `+ec` that has
437description:
438
439> Forces option compressed on all ETS tables. Only intended for test and
440> evaluation.
441
442[ETS]: https://erlang.org/doc/man/ets.html
443
444Sounds good enough for me.
445
446With all these flags enabled we get peak memory footprint at 996257792 bytes.
447
448Compare it in more human readable units.
449
450|                          | Peak Memory Footprint for 1M processes |
451| ------------------------ | -------------------------------------- |
452| Original code            | 3.94 GiB                               |
453| Improved code            | 2.58 GiB                               |
454| Improved code with flags | 0.93 GiB                               |
455
456Result - about 76% of the peak memory usage reduction. Not bad.
457
458## Summary
459
460First of all:
461
462> Please, do not use ChatGPT for writing code for microbenchmarks.
463
464The thing about *micro*benchmarking is that we write code that does as little as
465possible to show (mostly) meaningless features of the given technology in
466abstract environment. ChatGPT cannot do that, not out of malice or incompetence,
467but because it used (mostly) *good* and idiomatic code to teach itself,
468microbenchmarks rarely are something that people will consider to have these
469qualities. It also cannot consider other features that [wetware][] can take into
470account (like our "we do not need lists there" thing).
471
472[wetware]: https://en.wikipedia.org/wiki/Wetware_(brain)