this repo has no description
1+++ 2date = 2023-06-10 3title = "How much memory is needed to run 1M Erlang processes?" 4description = "How to not write benchmarks" 5 6[taxonomies] 7tags = [ 8 "beam", 9 "elixir", 10 "erlang", 11 "benchmarks", 12 "programming" 13] 14+++ 15 16Recently [benchmark for concurrency implementation in different 17languages][benchmark]. In this article [Piotr Kołaczkowski][] used Chat GPT to 18generate the examples in the different languages and benchmarked them. This was 19poor choice as I have found this article and read the Elixir example: 20 21[benchmark]: https://pkolaczk.github.io/memory-consumption-of-async/ "How Much Memory Do You Need to Run 1 Million Concurrent Tasks?" 22[Piotr Kołaczkowski]: https://github.com/pkolaczk 23 24```elixir 25tasks = 26 for _ <- 1..num_tasks do 27 Task.async(fn -> 28 :timer.sleep(10000) 29 end) 30 end 31 32Task.await_many(tasks, :infinity) 33``` 34 35And, well, it's pretty poor example of BEAM's process memory usage, and I am 36not talking about the fact that it uses 4 spaces for indentation. 37 38For 1 million processes this code reported 3.94 GiB of memory used by the process 39in Piotr's benchmark, but with little work I managed to reduce it about 4 times 40to around 0.93 GiB of RAM usage. In this article I will describe: 41 42- how I did that 43- why the original code was consuming so much memory 44- why in the real world you probably should not optimise like I did here 45- why using ChatGPT to write benchmarking code sucks (TL;DR because that will 46 nerd snipe people like me) 47 48## What are Erlang processes? 49 50Erlang is ~~well~~ known of being language which support for concurrency is 51superb, and Erlang processes are the main reason for that. But what are these? 52 53In Erlang *process* is the common name for what other languages call *virtual 54threads* or *green threads*, but in Erlang these have small neat twist - each of 55the process is isolated from the rest and these processes can communicate only 56via message passing. That gives Erlang processes 2 features that are rarely 57spotted in other implementations: 58 59- Failure isolation - bug, unhandled case, or other issue in single process will 60 not directly affect any other process in the system. VM can send some messages 61 due to process shutdown, and other processes may be killed because of that, 62 but by itself shutting down single process will not cause problems in any 63 process not related to that. 64- Location transparency - process can be spawned locally or on different 65 machine, but from the viewpoint of the programmer, there is no difference. 66 67The above features and requirements results in some design choices, but for our 68purpose only one is truly needed today - each process have separate and (almost) 69independent memory stack from any other process. 70 71### Process dictionary 72 73Each process in Erlang VM has dedicated *mutable* memory space for their 74internal uses. Most people do not use it for anything because in general it 75should not be used unless you know exactly what you are doing (in my case, a bad 76carpenter could count cases when I needed it, on single hand). In general it's 77*here be dragons* area. 78 79How it's relevant to us? 80 81Well, OTP internally uses process dictionary (`pdict` for short) to store 82metadata about given process that can be later used for debugging purposes. Some 83data that it store are: 84 85- Initial function that was run by the given process 86- PIDs to all ancestors of the given process 87 88Different processes abstractions (like `gen_server`/`GenServer`, Elixir's 89`Task`, etc.) can store even more metadata there, `logger` store process 90metadata in process dictionary, `rand` store state of the PRNGs in the process 91dictionary. it's used quite extensively by some OTP features. 92 93### "Well behaved" OTP process 94 95In addition to the above metadata if the process is meant to be "well behaved" 96process in OTP system, i.e. process that can be observed and debugged using OTP 97facilities, it must respond to some additional messages defined by [`sys`][] 98module. Without that the features like [`observer`][] would not be able to "see" 99the content of the process state. 100 101[`sys`]: https://erlang.org/doc/man/sys.html 102[`observer`]: https://erlang.org/doc/man/observer.html 103 104## Process memory usage 105 106As we have seen above, the `Task.async/1` function form Elixir **must** do 107much more than just simple "start process and live with it". That was one of the 108most important problems with the original process, it was using system, that was 109allocating quite substantial memory alongside of the process itself, just to 110operate this process. In general, that would be desirable approach (as you 111**really, really, want the debugging facilities**), but in synthetic benchmarks, 112it reduce the feasibility of such benchmark. 113 114If we want to avoid that additional memory overhead in our spawned processes we 115need to go back to more primitive functions in Erlang, namely `erlang:spawn/1` 116(`Kernel.spawn/1` in Elixir). But that mean that we cannot use 117`Task.await_many/2` anymore, so we need to workaround it by using custom 118function: 119 120```elixir 121defmodule Bench do 122 def await(pid) when is_pid(pid) do 123 # Monitor is internal feature of Erlang that will inform you (by sending 124 # message) when process you monitor die. The returned value is type called 125 # "reference" which is just simply unique value returned by the VM. 126 # If the process is already dead, then message will be delivered 127 # immediately. 128 ref = Process.monitor(pid) 129 130 receive do 131 {:DOWN, ^ref, :process, _, _} -> :ok 132 end 133 end 134 135 def await_many(pids) do 136 Enum.each(pids, &await/1) 137 end 138end 139 140tasks = 141 for _ <- 1..num_tasks do 142 # `Kernel` module is imported by default, so no need for `Kernel.` prefix 143 spawn(fn -> 144 :timer.sleep(10000) 145 end) 146 end 147 148Bench.await_many(tasks) 149``` 150 151We already removed one problem (well, two in fact, but we will go into 152details in next section). 153 154## All your lists are belongs to us now 155 156Erlang, like most of the functional programming languages, have 2 built-in 157sequence types: 158 159- Tuples - which are non-growable product type of the values, so you can access 160 any field quite fast, but adding more values is performance no-no 161- (Singly) linked lists - growable type (in most case it will have single type 162 values in it, but in Erlang that is not always the case), which is fast to 163 prepend or pop data from the beginning, but do not try to do anything else if 164 you care about performance. 165 166In this case we will focus on the 2nd one, as there tuples aren't important at 167all. 168 169Singly linked list is simple data structure. It's either special value `[]` 170(an empty list) or it's something called "cons-cell". Cons-cells are also 171simple structures - it's 2ary tuple (tuple with 2 elements) where first value 172is head - the value in the list cell, and another one is the "tail" of the list (aka 173rest of the list). In Elixir the cons-cell is denoted like that `[head | tail]`. 174Super simple structure as you can see, and perfect for the functional 175programming as you can add new values to the list without modifying existing 176values, so you can be immutable and fast. However if you need to construct the 177sequence of a lot of values (like our list of all tasks) then we have problem. 178Because Elixir promises that list returned from the `for` will be **in-order** 179of the values passed to it. That mean that we either need to process our data 180like that: 181 182```elixir 183def map([], _), do: [] 184 185def map([head | tail], func) do 186 [func.(head) | map(tail, func)] 187end 188``` 189 190Where we build call stack (as we cannot have tail call optimisation there, of 191course sans compiler optimisations). Or we need to build our list in reverse 192order, and then reverse it before returning (so we can have TCO): 193 194```elixir 195def map(list, func), do: do_map(list, func, []) 196 197def map([], _func, agg), do: :lists.reverse(agg) 198 199def map([head | tail], func, agg) do 200 map(tail, func, [func.(head) | agg]) 201end 202``` 203 204Which one of these approaches is more performant is irrelevant[^erlang-perf], 205what is relevant is that we need either build call stack or construct our list 206*twice* to be able to conform to the Elixir promises (even if in this case we do 207not care about order of the list returned by the `for`). 208 209[^erlang-perf]: Sometimes body recursion will be faster, sometimes TCO will be 210faster. it's impossible to tell without more benchmarking. For more info check 211out [superb article by Ferd Herbert](https://ferd.ca/erlang-s-tail-recursion-is-not-a-silver-bullet.html). 212 213Of course we could mitigate our problem by using `Enum.reduce/3` function (or 214writing it on our own) and end with code like: 215 216```elixir 217defmodule Bench do 218 def await(pid) when is_pid(pid) do 219 ref = Process.monitor(pid) 220 221 receive do 222 {:DOWN, ^ref, :process, _, _} -> :ok 223 end 224 end 225 226 def await_many(pids) do 227 Enum.each(pids, &await/1) 228 end 229end 230 231tasks = 232 Enum.reduce(1..num_tasks, [], fn _, agg -> 233 # `Kernel` module is imported by default, so no need for `Kernel.` prefix 234 pid = 235 spawn(fn -> :timer.sleep(10000) end) 236 237 [pid | agg] 238 end) 239 240Bench.await_many(tasks) 241``` 242 243Even then we build list of all PIDs. 244 245Here I can also go back to the "second problem* I have mentioned above. 246`Task.await_many/1` *also construct a list*. it's list of return value from all 247the processes in the list, so not only we constructed list for the tasks' PIDs, 248we also constructed list of return values (which will be `:ok` for all processes 249as it's what `:timer.sleep/1` returns), and immediately discarded all of that. 250 251How we can better? See that **all** we care is that all `num_task` processes 252have gone down. We do not care about any of the return values, all what we want 253is to know that all processes that we started went down. For that we can just 254send messages from the spawned processes and count the received messages count: 255 256```elixir 257defmodule Bench do 258 def worker(parent) do 259 :timer.sleep(10000) 260 send(parent, :done) 261 end 262 263 def start(0), do: :ok 264 def start(n) when n > 0 do 265 this = self() 266 spawn(fn -> worker(this) end) 267 268 start(n - 1) 269 end 270 271 def await(0), do: :ok 272 def await(n) when n > 0 do 273 receive do 274 :done -> await(n - 1) 275 end 276 end 277end 278 279Bench.start(num_tasks) 280Bench.await(num_tasks) 281``` 282 283Now we do not have any lists involved and we still do what the original task 284meant to do - spawn `num_tasks` processes and wait till all go down. 285 286## Arguments copying 287 288One another thing that we can account there - lambda context and data passing 289between processes. 290 291You see, we need to pass `this` (which is PID of the parent) to our newly 292spawned process. That is suboptimal, as we are looking for the way to reduce 293amount of the memory (and ignore all other metrics at the same time). As Erlang 294processes are meant to be "share nothing" type of processes there is problem - 295we need to copy that PID to all processes. it's just 1 word (which mean 8 bytes 296on 64-bit architectures, 4 bytes on 32-bit), but hey, we are microbenchmarking, 297so we cut whatever we can (with 1M processes, this adds up to 8 MiBs). 298 299Hey, we can avoid that by using yet another feature of Erlang, called 300*registry*. This is yet another simple feature that allows us to assign PID of 301the process to the atom, which allows us then to send messages to that process 302using just name, we have given. While atoms are also 1 word that wouldn't make 303sense to send it as well, but instead we can do what any reasonable 304microbenchmarker would do - *hardcode stuff*: 305 306```elixir 307defmodule Bench do 308 def worker do 309 :timer.sleep(10000) 310 send(:parent, :done) 311 end 312 313 def start(0), do: :ok 314 def start(n) when n > 0 do 315 spawn(fn -> worker() end) 316 317 start(n - 1) 318 end 319 320 def await(0), do: :ok 321 def await(n) when n > 0 do 322 receive do 323 :done -> await(n - 1) 324 end 325 end 326end 327 328Process.register(self(), :parent) 329 330Bench.start(num_tasks) 331Bench.await(num_tasks) 332``` 333 334Now we do not pass any arguments, and instead rely on the registry to dispatch 335our messages to respective processes. 336 337## One more thing 338 339As you may have already noticed we are passing lambda to the `spawn/1`. That is 340also quite suboptimal, because of [difference between remote and local call][remote-vs-local]. 341This mean that we are paying slight memory cost for these processes to keep the 342old module in memory. Instead we can use either fully qualified function capture 343or `spawn/3` function that accepts MFA (module, function name, arguments list) 344argument. We end with: 345 346[remote-vs-local]: https://www.erlang.org/doc/reference_manual/code_loading.html#code-replacement 347 348```elixir 349defmodule Bench do 350 def worker do 351 :timer.sleep(10000) 352 send(:parent, :done) 353 end 354 355 def start(0), do: :ok 356 def start(n) when n > 0 do 357 spawn(&__MODULE__.worker/0) 358 359 start(n - 1) 360 end 361 362 def await(0), do: :ok 363 def await(n) when n > 0 do 364 receive do 365 :done -> await(n - 1) 366 end 367 end 368end 369 370Process.register(self(), :parent) 371 372Bench.start(num_tasks) 373Bench.await(num_tasks) 374``` 375 376## Results 377 378With given Erlang compilation: 379 380```txt 381Erlang/OTP 25 [erts-13.2.2.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] 382 383Elixir 1.14.5 (compiled with Erlang/OTP 25) 384``` 385 386> Note no JIT as Nix on macOS currently[^currently] disable it and I didn't bother to enable 387> it in the derivation (it was disabled because there were some issues, but IIRC 388> these are resolved now). 389 390[^currently]: Nixpkgs rev `bc3ec5ea` 391 392The results are as follow (in bytes of peak memory footprint returned by 393`/usr/bin/time` on macOS): 394 395| Implementation | 1k | 100k | 1M | 396| -------------- | -------: | --------: | ---------: | 397| Original | 45047808 | 452837376 | 4227715072 | 398| Spawn | 43728896 | 318230528 | 2869723136 | 399| Reduce | 43552768 | 314798080 | 2849304576 | 400| Count | 43732992 | 313507840 | 2780540928 | 401| Registry | 44453888 | 311988224 | 2787237888 | 402| RemoteCall | 43597824 | 310595584 | 2771525632 | 403 404As we can see we have reduced the memory use by about 30% by just changing 405from `Task.async/1` to `spawn/1`. Further optimisations reduced memory usage 406slightly, but with no such drastic changes. 407 408Can we do better? 409 410Well, with some VM flags tinkering - of course. 411 412You see, by default Erlang VM will not only create some data required for 413handling process itself[^word]: 414 415[^word]: Again, word here mean 8 bytes on 64-bit and 4 bytes on 32-bit architectures. 416 417> | Data Type | Memory Size | 418> | - | - | 419> | … | … | 420> | Erlang process | 338 words when spawned, including a heap of 233 words. | 421> 422> -- [Erlang Efficiency Guide: 11. Advanced](https://erlang.org/doc/efficiency_guide/advanced.html#Advanced) 423 424As we can see, there are 105 words that are required and 233 words which are 425used for preallocated heap. But this is microbenchmarking, so as we do not need 426that much of memory (because our processes basically does nothing), we can 427reduce it. We do not care about time performance anyway. For that we can use 428`+hms` flag and set it to some small value, for example `1`. 429 430In addition to heap size Erlang by default load some additional data from the 431BEAM files. That data is used for debugging and error reporting, but again, we 432are microbenchmarking, and who need debugging support anyway (answer: everyone, 433so **do not** do it in production). Luckily for us, the VM has yet another flag 434for that purpose `+L`. 435 436Erlang also uses some [ETS][] (Erlang Term Storage) tables by default (for 437example to support process registry we have mentioned above). ETS tables can be 438compressed, but by default it's not done, as it can slow down some kinds of 439operations on such tables. Fortunately there is, another, flag `+ec` that has 440description: 441 442> Forces option compressed on all ETS tables. Only intended for test and 443> evaluation. 444 445[ETS]: https://erlang.org/doc/man/ets.html 446 447Sounds good enough for me. 448 449With all these flags enabled we get peak memory footprint at 996257792 bytes. 450 451Compare it in more human readable units. 452 453| | Peak Memory Footprint for 1M processes | 454| ------------------------ | -------------------------------------- | 455| Original code | 3.94 GiB | 456| Improved code | 2.58 GiB | 457| Improved code with flags | 0.93 GiB | 458 459Result - about 76% of the peak memory usage reduction. Not bad. 460 461## Summary 462 463First of all: 464 465> Please, do not use ChatGPT for writing code for microbenchmarks. 466 467The thing about *micro*benchmarking is that we write code that does as little as 468possible to show (mostly) meaningless features of the given technology in 469abstract environment. ChatGPT cannot do that, not out of malice or incompetence, 470but because it used (mostly) *good* and idiomatic code to teach itself, 471microbenchmarks rarely are something that people will consider to have these 472qualities. It also cannot consider other features that [wetware][] can take into 473account (like our "we do not need lists there" thing). 474 475[wetware]: https://en.wikipedia.org/wiki/Wetware_(brain)