this repo has no description
1+++ 2date = 2023-06-10 3title = "How much memory is needed to run 1M Erlang processes?" 4description = "How to not write benchmarks" 5 6[taxonomies] 7tags = [ 8 "beam", 9 "performance" 10] 11+++ 12 13Recently [benchmark for concurrency implementation in different 14languages][benchmark]. In this article [Piotr Kołaczkowski][] used Chat GPT to 15generate the examples in the different languages and benchmarked them. This was 16poor choice as I have found this article and read the Elixir example: 17 18[benchmark]: https://pkolaczk.github.io/memory-consumption-of-async/ "How Much Memory Do You Need to Run 1 Million Concurrent Tasks?" 19[Piotr Kołaczkowski]: https://github.com/pkolaczk 20 21```elixir 22tasks = 23 for _ <- 1..num_tasks do 24 Task.async(fn -> 25 :timer.sleep(10000) 26 end) 27 end 28 29Task.await_many(tasks, :infinity) 30``` 31 32And, well, it's pretty poor example of BEAM's process memory usage, and I am 33not talking about the fact that it uses 4 spaces for indentation. 34 35For 1 million processes this code reported 3.94 GiB of memory used by the process 36in Piotr's benchmark, but with little work I managed to reduce it about 4 times 37to around 0.93 GiB of RAM usage. In this article I will describe: 38 39- how I did that 40- why the original code was consuming so much memory 41- why in the real world you probably should not optimise like I did here 42- why using ChatGPT to write benchmarking code sucks (TL;DR because that will 43 nerd snipe people like me) 44 45## What are Erlang processes? 46 47Erlang is ~~well~~ known of being language which support for concurrency is 48superb, and Erlang processes are the main reason for that. But what are these? 49 50In Erlang *process* is the common name for what other languages call *virtual 51threads* or *green threads*, but in Erlang these have small neat twist - each of 52the process is isolated from the rest and these processes can communicate only 53via message passing. That gives Erlang processes 2 features that are rarely 54spotted in other implementations: 55 56- Failure isolation - bug, unhandled case, or other issue in single process will 57 not directly affect any other process in the system. VM can send some messages 58 due to process shutdown, and other processes may be killed because of that, 59 but by itself shutting down single process will not cause problems in any 60 process not related to that. 61- Location transparency - process can be spawned locally or on different 62 machine, but from the viewpoint of the programmer, there is no difference. 63 64The above features and requirements results in some design choices, but for our 65purpose only one is truly needed today - each process have separate and (almost) 66independent memory stack from any other process. 67 68### Process dictionary 69 70Each process in Erlang VM has dedicated *mutable* memory space for their 71internal uses. Most people do not use it for anything because in general it 72should not be used unless you know exactly what you are doing (in my case, a bad 73carpenter could count cases when I needed it, on single hand). In general it's 74*here be dragons* area. 75 76How it's relevant to us? 77 78Well, OTP internally uses process dictionary (`pdict` for short) to store 79metadata about given process that can be later used for debugging purposes. Some 80data that it store are: 81 82- Initial function that was run by the given process 83- PIDs to all ancestors of the given process 84 85Different processes abstractions (like `gen_server`/`GenServer`, Elixir's 86`Task`, etc.) can store even more metadata there, `logger` store process 87metadata in process dictionary, `rand` store state of the PRNGs in the process 88dictionary. it's used quite extensively by some OTP features. 89 90### "Well behaved" OTP process 91 92In addition to the above metadata if the process is meant to be "well behaved" 93process in OTP system, i.e. process that can be observed and debugged using OTP 94facilities, it must respond to some additional messages defined by [`sys`][] 95module. Without that the features like [`observer`][] would not be able to "see" 96the content of the process state. 97 98[`sys`]: https://erlang.org/doc/man/sys.html 99[`observer`]: https://erlang.org/doc/man/observer.html 100 101## Process memory usage 102 103As we have seen above, the `Task.async/1` function form Elixir **must** do 104much more than just simple "start process and live with it". That was one of the 105most important problems with the original process, it was using system, that was 106allocating quite substantial memory alongside of the process itself, just to 107operate this process. In general, that would be desirable approach (as you 108**really, really, want the debugging facilities**), but in synthetic benchmarks, 109it reduce the feasibility of such benchmark. 110 111If we want to avoid that additional memory overhead in our spawned processes we 112need to go back to more primitive functions in Erlang, namely `erlang:spawn/1` 113(`Kernel.spawn/1` in Elixir). But that mean that we cannot use 114`Task.await_many/2` anymore, so we need to workaround it by using custom 115function: 116 117```elixir 118defmodule Bench do 119 def await(pid) when is_pid(pid) do 120 # Monitor is internal feature of Erlang that will inform you (by sending 121 # message) when process you monitor die. The returned value is type called 122 # "reference" which is just simply unique value returned by the VM. 123 # If the process is already dead, then message will be delivered 124 # immediately. 125 ref = Process.monitor(pid) 126 127 receive do 128 {:DOWN, ^ref, :process, _, _} -> :ok 129 end 130 end 131 132 def await_many(pids) do 133 Enum.each(pids, &await/1) 134 end 135end 136 137tasks = 138 for _ <- 1..num_tasks do 139 # `Kernel` module is imported by default, so no need for `Kernel.` prefix 140 spawn(fn -> 141 :timer.sleep(10000) 142 end) 143 end 144 145Bench.await_many(tasks) 146``` 147 148We already removed one problem (well, two in fact, but we will go into 149details in next section). 150 151## All your lists are belongs to us now 152 153Erlang, like most of the functional programming languages, have 2 built-in 154sequence types: 155 156- Tuples - which are non-growable product type of the values, so you can access 157 any field quite fast, but adding more values is performance no-no 158- (Singly) linked lists - growable type (in most case it will have single type 159 values in it, but in Erlang that is not always the case), which is fast to 160 prepend or pop data from the beginning, but do not try to do anything else if 161 you care about performance. 162 163In this case we will focus on the 2nd one, as there tuples aren't important at 164all. 165 166Singly linked list is simple data structure. It's either special value `[]` 167(an empty list) or it's something called "cons-cell". Cons-cells are also 168simple structures - it's 2ary tuple (tuple with 2 elements) where first value 169is head - the value in the list cell, and another one is the "tail" of the list (aka 170rest of the list). In Elixir the cons-cell is denoted like that `[head | tail]`. 171Super simple structure as you can see, and perfect for the functional 172programming as you can add new values to the list without modifying existing 173values, so you can be immutable and fast. However if you need to construct the 174sequence of a lot of values (like our list of all tasks) then we have problem. 175Because Elixir promises that list returned from the `for` will be **in-order** 176of the values passed to it. That mean that we either need to process our data 177like that: 178 179```elixir 180def map([], _), do: [] 181 182def map([head | tail], func) do 183 [func.(head) | map(tail, func)] 184end 185``` 186 187Where we build call stack (as we cannot have tail call optimisation there, of 188course sans compiler optimisations). Or we need to build our list in reverse 189order, and then reverse it before returning (so we can have TCO): 190 191```elixir 192def map(list, func), do: do_map(list, func, []) 193 194def map([], _func, agg), do: :lists.reverse(agg) 195 196def map([head | tail], func, agg) do 197 map(tail, func, [func.(head) | agg]) 198end 199``` 200 201Which one of these approaches is more performant is irrelevant[^erlang-perf], 202what is relevant is that we need either build call stack or construct our list 203*twice* to be able to conform to the Elixir promises (even if in this case we do 204not care about order of the list returned by the `for`). 205 206[^erlang-perf]: Sometimes body recursion will be faster, sometimes TCO will be 207faster. it's impossible to tell without more benchmarking. For more info check 208out [superb article by Ferd Herbert](https://ferd.ca/erlang-s-tail-recursion-is-not-a-silver-bullet.html). 209 210Of course we could mitigate our problem by using `Enum.reduce/3` function (or 211writing it on our own) and end with code like: 212 213```elixir 214defmodule Bench do 215 def await(pid) when is_pid(pid) do 216 ref = Process.monitor(pid) 217 218 receive do 219 {:DOWN, ^ref, :process, _, _} -> :ok 220 end 221 end 222 223 def await_many(pids) do 224 Enum.each(pids, &await/1) 225 end 226end 227 228tasks = 229 Enum.reduce(1..num_tasks, [], fn _, agg -> 230 # `Kernel` module is imported by default, so no need for `Kernel.` prefix 231 pid = 232 spawn(fn -> :timer.sleep(10000) end) 233 234 [pid | agg] 235 end) 236 237Bench.await_many(tasks) 238``` 239 240Even then we build list of all PIDs. 241 242Here I can also go back to the "second problem* I have mentioned above. 243`Task.await_many/1` *also construct a list*. it's list of return value from all 244the processes in the list, so not only we constructed list for the tasks' PIDs, 245we also constructed list of return values (which will be `:ok` for all processes 246as it's what `:timer.sleep/1` returns), and immediately discarded all of that. 247 248How we can better? See that **all** we care is that all `num_task` processes 249have gone down. We do not care about any of the return values, all what we want 250is to know that all processes that we started went down. For that we can just 251send messages from the spawned processes and count the received messages count: 252 253```elixir 254defmodule Bench do 255 def worker(parent) do 256 :timer.sleep(10000) 257 send(parent, :done) 258 end 259 260 def start(0), do: :ok 261 def start(n) when n > 0 do 262 this = self() 263 spawn(fn -> worker(this) end) 264 265 start(n - 1) 266 end 267 268 def await(0), do: :ok 269 def await(n) when n > 0 do 270 receive do 271 :done -> await(n - 1) 272 end 273 end 274end 275 276Bench.start(num_tasks) 277Bench.await(num_tasks) 278``` 279 280Now we do not have any lists involved and we still do what the original task 281meant to do - spawn `num_tasks` processes and wait till all go down. 282 283## Arguments copying 284 285One another thing that we can account there - lambda context and data passing 286between processes. 287 288You see, we need to pass `this` (which is PID of the parent) to our newly 289spawned process. That is suboptimal, as we are looking for the way to reduce 290amount of the memory (and ignore all other metrics at the same time). As Erlang 291processes are meant to be "share nothing" type of processes there is problem - 292we need to copy that PID to all processes. it's just 1 word (which mean 8 bytes 293on 64-bit architectures, 4 bytes on 32-bit), but hey, we are microbenchmarking, 294so we cut whatever we can (with 1M processes, this adds up to 8 MiBs). 295 296Hey, we can avoid that by using yet another feature of Erlang, called 297*registry*. This is yet another simple feature that allows us to assign PID of 298the process to the atom, which allows us then to send messages to that process 299using just name, we have given. While atoms are also 1 word that wouldn't make 300sense to send it as well, but instead we can do what any reasonable 301microbenchmarker would do - *hardcode stuff*: 302 303```elixir 304defmodule Bench do 305 def worker do 306 :timer.sleep(10000) 307 send(:parent, :done) 308 end 309 310 def start(0), do: :ok 311 def start(n) when n > 0 do 312 spawn(fn -> worker() end) 313 314 start(n - 1) 315 end 316 317 def await(0), do: :ok 318 def await(n) when n > 0 do 319 receive do 320 :done -> await(n - 1) 321 end 322 end 323end 324 325Process.register(self(), :parent) 326 327Bench.start(num_tasks) 328Bench.await(num_tasks) 329``` 330 331Now we do not pass any arguments, and instead rely on the registry to dispatch 332our messages to respective processes. 333 334## One more thing 335 336As you may have already noticed we are passing lambda to the `spawn/1`. That is 337also quite suboptimal, because of [difference between remote and local call][remote-vs-local]. 338This mean that we are paying slight memory cost for these processes to keep the 339old module in memory. Instead we can use either fully qualified function capture 340or `spawn/3` function that accepts MFA (module, function name, arguments list) 341argument. We end with: 342 343[remote-vs-local]: https://www.erlang.org/doc/reference_manual/code_loading.html#code-replacement 344 345```elixir 346defmodule Bench do 347 def worker do 348 :timer.sleep(10000) 349 send(:parent, :done) 350 end 351 352 def start(0), do: :ok 353 def start(n) when n > 0 do 354 spawn(&__MODULE__.worker/0) 355 356 start(n - 1) 357 end 358 359 def await(0), do: :ok 360 def await(n) when n > 0 do 361 receive do 362 :done -> await(n - 1) 363 end 364 end 365end 366 367Process.register(self(), :parent) 368 369Bench.start(num_tasks) 370Bench.await(num_tasks) 371``` 372 373## Results 374 375With given Erlang compilation: 376 377```txt 378Erlang/OTP 25 [erts-13.2.2.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] 379 380Elixir 1.14.5 (compiled with Erlang/OTP 25) 381``` 382 383> Note no JIT as Nix on macOS currently[^currently] disable it and I didn't bother to enable 384> it in the derivation (it was disabled because there were some issues, but IIRC 385> these are resolved now). 386 387[^currently]: Nixpkgs rev `bc3ec5ea` 388 389The results are as follow (in bytes of peak memory footprint returned by 390`/usr/bin/time` on macOS): 391 392| Implementation | 1k | 100k | 1M | 393| -------------- | -------: | --------: | ---------: | 394| Original | 45047808 | 452837376 | 4227715072 | 395| Spawn | 43728896 | 318230528 | 2869723136 | 396| Reduce | 43552768 | 314798080 | 2849304576 | 397| Count | 43732992 | 313507840 | 2780540928 | 398| Registry | 44453888 | 311988224 | 2787237888 | 399| RemoteCall | 43597824 | 310595584 | 2771525632 | 400 401As we can see we have reduced the memory use by about 30% by just changing 402from `Task.async/1` to `spawn/1`. Further optimisations reduced memory usage 403slightly, but with no such drastic changes. 404 405Can we do better? 406 407Well, with some VM flags tinkering - of course. 408 409You see, by default Erlang VM will not only create some data required for 410handling process itself[^word]: 411 412[^word]: Again, word here mean 8 bytes on 64-bit and 4 bytes on 32-bit architectures. 413 414> | Data Type | Memory Size | 415> | - | - | 416> | … | … | 417> | Erlang process | 338 words when spawned, including a heap of 233 words. | 418> 419> -- [Erlang Efficiency Guide: 11. Advanced](https://erlang.org/doc/efficiency_guide/advanced.html#Advanced) 420 421As we can see, there are 105 words that are required and 233 words which are 422used for preallocated heap. But this is microbenchmarking, so as we do not need 423that much of memory (because our processes basically does nothing), we can 424reduce it. We do not care about time performance anyway. For that we can use 425`+hms` flag and set it to some small value, for example `1`. 426 427In addition to heap size Erlang by default load some additional data from the 428BEAM files. That data is used for debugging and error reporting, but again, we 429are microbenchmarking, and who need debugging support anyway (answer: everyone, 430so **do not** do it in production). Luckily for us, the VM has yet another flag 431for that purpose `+L`. 432 433Erlang also uses some [ETS][] (Erlang Term Storage) tables by default (for 434example to support process registry we have mentioned above). ETS tables can be 435compressed, but by default it's not done, as it can slow down some kinds of 436operations on such tables. Fortunately there is, another, flag `+ec` that has 437description: 438 439> Forces option compressed on all ETS tables. Only intended for test and 440> evaluation. 441 442[ETS]: https://erlang.org/doc/man/ets.html 443 444Sounds good enough for me. 445 446With all these flags enabled we get peak memory footprint at 996257792 bytes. 447 448Compare it in more human readable units. 449 450| | Peak Memory Footprint for 1M processes | 451| ------------------------ | -------------------------------------- | 452| Original code | 3.94 GiB | 453| Improved code | 2.58 GiB | 454| Improved code with flags | 0.93 GiB | 455 456Result - about 76% of the peak memory usage reduction. Not bad. 457 458## Summary 459 460First of all: 461 462> Please, do not use ChatGPT for writing code for microbenchmarks. 463 464The thing about *micro*benchmarking is that we write code that does as little as 465possible to show (mostly) meaningless features of the given technology in 466abstract environment. ChatGPT cannot do that, not out of malice or incompetence, 467but because it used (mostly) *good* and idiomatic code to teach itself, 468microbenchmarks rarely are something that people will consider to have these 469qualities. It also cannot consider other features that [wetware][] can take into 470account (like our "we do not need lists there" thing). 471 472[wetware]: https://en.wikipedia.org/wiki/Wetware_(brain)