content/post/beam-process-memory-usage.md at cfab16d4d3c58bdf679e52a165acf3311b7d5277 · hauleth.dev/blog

hauleth.dev / blog
this repo has no description
blog / content / post / beam-process-memory-usage.md
at cfab16d4d3c58bdf679e52a165acf3311b7d5277 17 kB view raw view rendered
  1+++
  2date = 2023-06-10
  3title = "How much memory is needed to run 1M Erlang processes?"
  4description = "How to not write benchmarks"
  5
  6[taxonomies]
  7tags = [
  8  "beam",
  9  "elixir",
 10  "erlang",
 11  "benchmarks",
 12  "programming"
 13]
 14+++
 15
 16Recently [benchmark for concurrency implementation in different
 17languages][benchmark]. In this article [Piotr Kołaczkowski][] used Chat GPT to
 18generate the examples in the different languages and benchmarked them. This was
 19poor choice as I have found this article and read the Elixir example:
 20
 21[benchmark]: https://pkolaczk.github.io/memory-consumption-of-async/ "How Much Memory Do You Need to Run 1 Million Concurrent Tasks?"
 22[Piotr Kołaczkowski]: https://github.com/pkolaczk
 23
 24```elixir
 25tasks =
 26    for _ <- 1..num_tasks do
 27        Task.async(fn ->
 28            :timer.sleep(10000)
 29        end)
 30    end
 31
 32Task.await_many(tasks, :infinity)
 33```
 34
 35And, well, it's pretty poor example of BEAM's process memory usage, and I am
 36not talking about the fact that it uses 4 spaces for indentation.
 37
 38For 1 million processes this code reported 3.94 GiB of memory used by the process
 39in Piotr's benchmark, but with little work I managed to reduce it about 4 times
 40to around 0.93 GiB of RAM usage. In this article I will describe:
 41
 42- how I did that
 43- why the original code was consuming so much memory
 44- why in the real world you probably should not optimise like I did here
 45- why using ChatGPT to write benchmarking code sucks (TL;DR because that will
 46  nerd snipe people like me)
 47
 48## What are Erlang processes?
 49
 50Erlang is ~~well~~ known of being language which support for concurrency is
 51superb, and Erlang processes are the main reason for that. But what are these?
 52
 53In Erlang *process* is the common name for what other languages call *virtual
 54threads* or *green threads*, but in Erlang these have small neat twist - each of
 55the process is isolated from the rest and these processes can communicate only
 56via message passing. That gives Erlang processes 2 features that are rarely
 57spotted in other implementations:
 58
 59- Failure isolation - bug, unhandled case, or other issue in single process will
 60  not directly affect any other process in the system. VM can send some messages
 61  due to process shutdown, and other processes may be killed because of that,
 62  but by itself shutting down single process will not cause problems in any
 63  process not related to that.
 64- Location transparency - process can be spawned locally or on different
 65  machine, but from the viewpoint of the programmer, there is no difference.
 66
 67The above features and requirements results in some design choices, but for our
 68purpose only one is truly needed today - each process have separate and (almost)
 69independent memory stack from any other process.
 70
 71### Process dictionary
 72
 73Each process in Erlang VM has dedicated *mutable* memory space for their
 74internal uses. Most people do not use it for anything because in general it
 75should not be used unless you know exactly what you are doing (in my case, a bad
 76carpenter could count cases when I needed it, on single hand). In general it's
 77*here be dragons* area.
 78
 79How it's relevant to us?
 80
 81Well, OTP internally uses process dictionary (`pdict` for short) to store
 82metadata about given process that can be later used for debugging purposes. Some
 83data that it store are:
 84
 85- Initial function that was run by the given process
 86- PIDs to all ancestors of the given process
 87
 88Different processes abstractions (like `gen_server`/`GenServer`, Elixir's
 89`Task`, etc.) can store even more metadata there, `logger` store process
 90metadata in process dictionary, `rand` store state of the PRNGs in the process
 91dictionary. it's used quite extensively by some OTP features.
 92
 93### "Well behaved" OTP process
 94
 95In addition to the above metadata if the process is meant to be "well behaved"
 96process in OTP system, i.e. process that can be observed and debugged using OTP
 97facilities, it must respond to some additional messages defined by [`sys`][]
 98module. Without that the features like [`observer`][] would not be able to "see"
 99the content of the process state.
100
101[`sys`]: https://erlang.org/doc/man/sys.html
102[`observer`]: https://erlang.org/doc/man/observer.html
103
104## Process memory usage
105
106As we have seen above, the `Task.async/1` function form Elixir **must** do
107much more than just simple "start process and live with it". That was one of the
108most important problems with the original process, it was using system, that was
109allocating quite substantial memory alongside of the process itself, just to
110operate this process. In general, that would be desirable approach (as you
111**really, really, want the debugging facilities**), but in synthetic benchmarks,
112it reduce the feasibility of such benchmark.
113
114If we want to avoid that additional memory overhead in our spawned processes we
115need to go back to more primitive functions in Erlang, namely `erlang:spawn/1`
116(`Kernel.spawn/1` in Elixir). But that mean that we cannot use
117`Task.await_many/2` anymore, so we need to workaround it by using custom
118function:
119
120```elixir
121defmodule Bench do
122  def await(pid) when is_pid(pid) do
123    # Monitor is internal feature of Erlang that will inform you (by sending
124    # message) when process you monitor die. The returned value is type called
125    # "reference" which is just simply unique value returned by the VM.
126    # If the process is already dead, then message will be delivered
127    # immediately.
128    ref = Process.monitor(pid)
129
130    receive do
131      {:DOWN, ^ref, :process, _, _} -> :ok
132    end
133  end
134
135  def await_many(pids) do
136    Enum.each(pids, &await/1)
137  end
138end
139
140tasks =
141  for _ <- 1..num_tasks do
142    # `Kernel` module is imported by default, so no need for `Kernel.` prefix
143    spawn(fn ->
144      :timer.sleep(10000)
145    end)
146  end
147
148Bench.await_many(tasks)
149```
150
151We already removed one problem (well, two in fact, but we will go into
152details in next section).
153
154## All your lists are belongs to us now
155
156Erlang, like most of the functional programming languages, have 2 built-in
157sequence types:
158
159- Tuples - which are non-growable product type of the values, so you can access
160  any field quite fast, but adding more values is performance no-no
161- (Singly) linked lists - growable type (in most case it will have single type
162  values in it, but in Erlang that is not always the case), which is fast to
163  prepend or pop data from the beginning, but do not try to do anything else if
164  you care about performance.
165
166In this case we will focus on the 2nd one, as there tuples aren't important at
167all.
168
169Singly linked list is simple data structure. It's either special value `[]`
170(an empty list) or it's something called "cons-cell". Cons-cells are also
171simple structures - it's 2ary tuple (tuple with 2 elements) where first value
172is head - the value in the list cell, and another one is the "tail" of the list (aka
173rest of the list). In Elixir the cons-cell is denoted like that `[head | tail]`.
174Super simple structure as you can see, and perfect for the functional
175programming as you can add new values to the list without modifying existing
176values, so you can be immutable and fast. However if you need to construct the
177sequence of a lot of values (like our list of all tasks) then we have problem.
178Because Elixir promises that list returned from the `for` will be **in-order**
179of the values passed to it. That mean that we either need to process our data
180like that:
181
182```elixir
183def map([], _), do: []
184
185def map([head | tail], func) do
186  [func.(head) | map(tail, func)]
187end
188```
189
190Where we build call stack (as we cannot have tail call optimisation there, of
191course sans compiler optimisations). Or we need to build our list in reverse
192order, and then reverse it before returning (so we can have TCO):
193
194```elixir
195def map(list, func), do: do_map(list, func, [])
196
197def map([], _func, agg), do: :lists.reverse(agg)
198
199def map([head | tail], func, agg) do
200  map(tail, func, [func.(head) | agg])
201end
202```
203
204Which one of these approaches is more performant is irrelevant[^erlang-perf],
205what is relevant is that we need either build call stack or construct our list
206*twice* to be able to conform to the Elixir promises (even if in this case we do
207not care about order of the list returned by the `for`).
208
209[^erlang-perf]: Sometimes body recursion will be faster, sometimes TCO will be
210faster. it's impossible to tell without more benchmarking. For more info check
211out [superb article by Ferd Herbert](https://ferd.ca/erlang-s-tail-recursion-is-not-a-silver-bullet.html).
212
213Of course we could mitigate our problem by using `Enum.reduce/3` function (or
214writing it on our own) and end with code like:
215
216```elixir
217defmodule Bench do
218  def await(pid) when is_pid(pid) do
219    ref = Process.monitor(pid)
220
221    receive do
222      {:DOWN, ^ref, :process, _, _} -> :ok
223    end
224  end
225
226  def await_many(pids) do
227    Enum.each(pids, &await/1)
228  end
229end
230
231tasks =
232  Enum.reduce(1..num_tasks, [], fn _, agg ->
233    # `Kernel` module is imported by default, so no need for `Kernel.` prefix
234    pid =
235      spawn(fn -> :timer.sleep(10000) end)
236
237    [pid | agg]
238  end)
239
240Bench.await_many(tasks)
241```
242
243Even then we build list of all PIDs.
244
245Here I can also go back to the "second problem* I have mentioned above.
246`Task.await_many/1` *also construct a list*. it's list of return value from all
247the processes in the list, so not only we constructed list for the tasks' PIDs,
248we also constructed list of return values (which will be `:ok` for all processes
249as it's what `:timer.sleep/1` returns), and immediately discarded all of that.
250
251How we can better? See that **all** we care is that all `num_task` processes
252have gone down. We do not care about any of the return values, all what we want
253is to know that all processes that we started went down. For that we can just
254send messages from the spawned processes and count the received messages count:
255
256```elixir
257defmodule Bench do
258  def worker(parent) do
259    :timer.sleep(10000)
260    send(parent, :done)
261  end
262
263  def start(0), do: :ok
264  def start(n) when n > 0 do
265    this = self()
266    spawn(fn -> worker(this) end)
267
268    start(n - 1)
269  end
270
271  def await(0), do: :ok
272  def await(n) when n > 0 do
273    receive do
274      :done -> await(n - 1)
275    end
276  end
277end
278
279Bench.start(num_tasks)
280Bench.await(num_tasks)
281```
282
283Now we do not have any lists involved and we still do what the original task
284meant to do - spawn `num_tasks` processes and wait till all go down.
285
286## Arguments copying
287
288One another thing that we can account there - lambda context and data passing
289between processes.
290
291You see, we need to pass `this` (which is PID of the parent) to our newly
292spawned process. That is suboptimal, as we are looking for the way to reduce
293amount of the memory (and ignore all other metrics at the same time). As Erlang
294processes are meant to be "share nothing" type of processes there is problem -
295we need to copy that PID to all processes. it's just 1 word (which mean 8 bytes
296on 64-bit architectures, 4 bytes on 32-bit), but hey, we are microbenchmarking,
297so we cut whatever we can (with 1M processes, this adds up to 8 MiBs).
298
299Hey, we can avoid that by using yet another feature of Erlang, called
300*registry*. This is yet another simple feature that allows us to assign PID of
301the process to the atom, which allows us then to send messages to that process
302using just name, we have given. While atoms are also 1 word that wouldn't make
303sense to send it as well, but instead we can do what any reasonable
304microbenchmarker would do - *hardcode stuff*:
305
306```elixir
307defmodule Bench do
308  def worker do
309    :timer.sleep(10000)
310    send(:parent, :done)
311  end
312
313  def start(0), do: :ok
314  def start(n) when n > 0 do
315    spawn(fn -> worker() end)
316
317    start(n - 1)
318  end
319
320  def await(0), do: :ok
321  def await(n) when n > 0 do
322    receive do
323      :done -> await(n - 1)
324    end
325  end
326end
327
328Process.register(self(), :parent)
329
330Bench.start(num_tasks)
331Bench.await(num_tasks)
332```
333
334Now we do not pass any arguments, and instead rely on the registry to dispatch
335our messages to respective processes.
336
337## One more thing
338
339As you may have already noticed we are passing lambda to the `spawn/1`. That is
340also quite suboptimal, because of [difference between remote and local call][remote-vs-local].
341This mean that we are paying slight memory cost for these processes to keep the
342old module in memory. Instead we can use either fully qualified function capture
343or `spawn/3` function that accepts MFA (module, function name, arguments list)
344argument. We end with:
345
346[remote-vs-local]: https://www.erlang.org/doc/reference_manual/code_loading.html#code-replacement
347
348```elixir
349defmodule Bench do
350  def worker do
351    :timer.sleep(10000)
352    send(:parent, :done)
353  end
354
355  def start(0), do: :ok
356  def start(n) when n > 0 do
357    spawn(&__MODULE__.worker/0)
358
359    start(n - 1)
360  end
361
362  def await(0), do: :ok
363  def await(n) when n > 0 do
364    receive do
365      :done -> await(n - 1)
366    end
367  end
368end
369
370Process.register(self(), :parent)
371
372Bench.start(num_tasks)
373Bench.await(num_tasks)
374```
375
376## Results
377
378With given Erlang compilation:
379
380```txt
381Erlang/OTP 25 [erts-13.2.2.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]
382
383Elixir 1.14.5 (compiled with Erlang/OTP 25)
384```
385
386> Note no JIT as Nix on macOS currently[^currently] disable it and I didn't bother to enable
387> it in the derivation (it was disabled because there were some issues, but IIRC
388> these are resolved now).
389
390[^currently]: Nixpkgs rev `bc3ec5ea`
391
392The results are as follow (in bytes of peak memory footprint returned by
393`/usr/bin/time` on macOS):
394
395| Implementation |       1k |      100k |         1M |
396| -------------- | -------: | --------: | ---------: |
397| Original       | 45047808 | 452837376 | 4227715072 |
398| Spawn          | 43728896 | 318230528 | 2869723136 |
399| Reduce         | 43552768 | 314798080 | 2849304576 |
400| Count          | 43732992 | 313507840 | 2780540928 |
401| Registry       | 44453888 | 311988224 | 2787237888 |
402| RemoteCall     | 43597824 | 310595584 | 2771525632 |
403
404As we can see we have reduced the memory use by about 30% by just changing
405from `Task.async/1` to `spawn/1`. Further optimisations reduced memory usage
406slightly, but with no such drastic changes.
407
408Can we do better?
409
410Well, with some VM flags tinkering - of course.
411
412You see, by default Erlang VM will not only create some data required for
413handling process itself[^word]:
414
415[^word]: Again, word here mean 8 bytes on 64-bit and 4 bytes on 32-bit architectures.
416
417> | Data Type | Memory Size |
418> | - | - |
419> | … | … |
420> | Erlang process | 338 words when spawned, including a heap of 233 words. |
421>
422> -- [Erlang Efficiency Guide: 11. Advanced](https://erlang.org/doc/efficiency_guide/advanced.html#Advanced)
423
424As we can see, there are 105 words that are required and 233 words which are
425used for preallocated heap. But this is microbenchmarking, so as we do not need
426that much of memory (because our processes basically does nothing), we can
427reduce it. We do not care about time performance anyway. For that we can use
428`+hms` flag and set it to some small value, for example `1`.
429
430In addition to heap size Erlang by default load some additional data from the
431BEAM files. That data is used for debugging and error reporting, but again, we
432are microbenchmarking, and who need debugging support anyway (answer: everyone,
433so **do not** do it in production). Luckily for us, the VM has yet another flag
434for that purpose `+L`.
435
436Erlang also uses some [ETS][] (Erlang Term Storage) tables by default (for
437example to support process registry we have mentioned above). ETS tables can be
438compressed, but by default it's not done, as it can slow down some kinds of
439operations on such tables. Fortunately there is, another, flag `+ec` that has
440description:
441
442> Forces option compressed on all ETS tables. Only intended for test and
443> evaluation.
444
445[ETS]: https://erlang.org/doc/man/ets.html
446
447Sounds good enough for me.
448
449With all these flags enabled we get peak memory footprint at 996257792 bytes.
450
451Compare it in more human readable units.
452
453|                          | Peak Memory Footprint for 1M processes |
454| ------------------------ | -------------------------------------- |
455| Original code            | 3.94 GiB                               |
456| Improved code            | 2.58 GiB                               |
457| Improved code with flags | 0.93 GiB                               |
458
459Result - about 76% of the peak memory usage reduction. Not bad.
460
461## Summary
462
463First of all:
464
465> Please, do not use ChatGPT for writing code for microbenchmarks.
466
467The thing about *micro*benchmarking is that we write code that does as little as
468possible to show (mostly) meaningless features of the given technology in
469abstract environment. ChatGPT cannot do that, not out of malice or incompetence,
470but because it used (mostly) *good* and idiomatic code to teach itself,
471microbenchmarks rarely are something that people will consider to have these
472qualities. It also cannot consider other features that [wetware][] can take into
473account (like our "we do not need lists there" thing).
474
475[wetware]: https://en.wikipedia.org/wiki/Wetware_(brain)