···
2
+
title = "Who Watches Watchmen? - Part 1"
3
+
date = 2022-01-17T21:22:18+01:00
7
+
A lot of application use systems like Kubernetes for their deployment. In my
8
+
humble opinion it is often overkill as system ,that offers most of the stuff such
9
+
thing provide, is already present in your OS. In this article I will try to
10
+
present how to utilise the most popular system supervisor from Elixir
23
+
I gave talk about this topic on CODE Beam V Americas, but I wasn't really
24
+
satisfied with it. In this post I will try to describe what my presentation was
27
+
If you are wondering about the presentation, [the slides are on SpeakerDeck][slides].
29
+
[slides]: https://speakerdeck.com/hauleth/who-supervises-supervisors
33
+
Most of the operating systems are multi-process and multi-user operating
34
+
systems. This has a lot of positive aspects, like to be able to do more than one
35
+
thing at the time at our devices, but it introduces a lot of complexities that
36
+
in most cases are hidden from the users and developers. These things still need
37
+
to be handled in one or another way. The most basic problems are:
39
+
- some processes need to be started before user can interact with the OS
40
+
in meaningful (for them) way (for example mounting filesystems, logging,
42
+
- some processes require strict startup ordering, for example you may need
43
+
logging to be started before starting HTTP server
44
+
- system operator somehow need to know when the process is ready to do their
45
+
work, which is often some time after process start
46
+
- system operator should be able to check process state in case when debugging
47
+
is needed, most commonly via logs
48
+
- shutdown of the processes should be handled in a way, that will allow other
49
+
processes to be shut down cleanly (for example application that uses DB should
50
+
be down before DB itself)
52
+
## Why we need system supervisor?
54
+
System supervisor is a process started early in the OS boot, that should handle
55
+
starting and managing all other processes that will be run on our system. It is
56
+
often the init process (first process started by the OS that is running with PID
57
+
1\) or it is first (and sometimes only) process started by the init process.
58
+
Popular examples of such supervisors (often integrated with init systems):
60
+
- SysV which is "traditional" implementation that originates at UNIX System
62
+
- BSD init that with some variations is used in BSD-based OSes (NetBSD,
63
+
FreeBSD), it shares some similarities to SysV init and services description is
64
+
provided by shell scripts
65
+
- OpenRC that also uses shell-based scripts for service description, used by
66
+
Linux distributions like Gentoo or Alpine
67
+
- `launchd` that is used on Darwin (macOS, iPadOS, iOS, watchOS) systems that uses
68
+
XML-based `plists` for services description
69
+
- `runit` which is small init and supervisor, but quite capable, for example
71
+
- Upstart created by Canonical Ltd. as a replacement for SysV-like init system
72
+
in Ubuntu (no longer in use in Ubuntu), still used in some distributions like
73
+
ChromeOS or Synology NAS
74
+
- `systemd` (this is the name, not "SystemD") that was created by Red Hat
75
+
employee, (in)famous Lennart Poettering, and later was adopted by almost all
76
+
major Linux distributions which spawned some heated discussion about it
78
+
In this article I will focus on systemd, and its approach to "new-style system
85
+
Each of the solutions mentioned above has its strong and weak points. I do not
86
+
want to start another flame war whether it is good or not. It has some good in
87
+
it, and it has some bad in it, but we can say that it "won" over the most used
88
+
distributions, and despite our love or hate towards it, we need to learn how to
95
+
`systemd` became a thing because SysV approach to ordering services' startup was
96
+
mildly irritating and non-parallelizable. In short, SysV is starting processes
97
+
exactly in lexicographical order of files in given directory. This meant, that
98
+
even if your service didn't need the DB at all, but it somehow ended further in
99
+
the directory listing, you ended in waiting for the DB startup. Additionally,
100
+
SysV wasn't really monitoring services, it just assumed that when process forked
101
+
itself to the background, then it is "done" with the startup, and we can
102
+
continue. This is obviously not true in many cases, for example, if your
103
+
previous shutdown wasn't clean because of power shortage or other issue, then
104
+
your DB probably need a bit of time to rebuild state from journal. This causes
105
+
even more slowdown for the processes further in the list. This is highly
106
+
undesired in modern, cloud-based, environment, where you can often start the
107
+
machines on-demand during autoscaling actions. When there is a spike in the
108
+
traffic that need autoscaling, then the sooner new machine is in usable state
109
+
the sooner it can take load from other machines.
111
+
Different tools take different approach to solve that issue there. `systemd`
112
+
take approach that is derived from `launchd` - do not do stuff, that is not
113
+
needed. It achieved that by merging D-Bus into the `systemd` itself, and then
114
+
making all service to be D-Bus daemons (which are started on request), and
115
+
additionally it provides a bunch of triggers for that daemons. We can trigger on
116
+
action of other services (obviously), but also on stuff like socket activity,
117
+
path creation/modification, mounts, connection or disconnection of device,
124
+
This is exactly the reason why `systemd` has its infamous "feature creep", it
125
+
doesn't "digest" all services like Cron or `udev`. It is not that these are
126
+
"tightly" intertwined into `systemd`. You can still replace them with their
127
+
older counterparts, you will just lose all the features these bring with them.
131
+
Such lazy approach sometimes require changes into the service itself. For
132
+
example to let supervisor know, that you are ready (not just started), you need
133
+
some way to communicate with supervisor. In `systemd` you can do so via UNIX
134
+
socket pointed by `NOTIFY_SOCKET` environment variable passed to your
135
+
application. With the same socket you can implement another useful feature
136
+
\- watchdog/heartbeat process. This mean that if for any reason your process
137
+
became non-responsive (but it will refuse to die), then supervisor will
138
+
forcefully bring process down and restart it, assuming that the error was
141
+
About restarting, we can define behaviour of service after main process die. It
142
+
can be restarted regardless of the exit code, it can be restarted on abnormal
143
+
exit, it can remain shut, etc. Does this ring a bell? This works similarly to
144
+
OTP supervisors, but "one level above". If your service utilize system
145
+
supervisor right, you can make your application almost ultimately self-healing
150
+
Now, when we know a little about how and why `systemd` works as it works, we
151
+
now can go to details on how to utilize that with services in Elixir.
153
+
As a base we will implement super simple Plug application:
156
+
# hello/application.ex
157
+
defmodule Hello.Application do
160
+
def start(_type, _opts) do
162
+
{Plug.Cowboy, [scheme: :http, plug: Hello.Router] ++ cowboy_opts()},
163
+
{Plug.Cowboy.Drainer, refs: :all}
166
+
Supervisor.start_link(children, strategy: :one_for_one)
169
+
defp cowboy_opts do
171
+
port: String.to_integer(System.get_env("PORT", "4000"))
179
+
defmodule Hello.Router do
186
+
send_resp(conn, 200, "Hello World!")
191
+
I will also assume that we are using [Mix release][mix-release] named `hello`
192
+
that we later copy to `/opt/hello`.
194
+
[mix-release]: https://hexdocs.pm/mix/Mix.Tasks.Release.html
196
+
### systemd unit file
198
+
We have only one thing left, we need to define our [`hello.service`][systemd.service]:
202
+
Description=Hello World service
205
+
Environment=PORT=80
206
+
ExecStart=/opt/plug/bin/plug start
209
+
Now you can create file with that content in
210
+
`/usr/local/lib/systemd/system/hello.service` and then start it with:
213
+
# systemctl start hello.service
216
+
This is the simplest service imaginable, however from the start we have few
219
+
- It will run service as user running supervisor, so if it is run using global
220
+
supervisor, then it will run as `root`. You do not want to run anything as
222
+
- On error it will produce (BEAM) core dump, which may contain sensitive data.
223
+
- It can read (and, due to being run as `root`, write) everything in the system,
224
+
like private data of other processes.
226
+
[systemd.service]: https://www.freedesktop.org/software/systemd/man/systemd.service.html#
228
+
## Service readiness
230
+
Erlang VM isn't really the best tool out there wrt the startup times. In
231
+
addition to that our application may need some preparation steps before it can
232
+
be marked as "ready". This is problem that I sometimes encounter in Docker,
233
+
where some containers do not really have any health check, and then I need to
234
+
have loop with check in some of the containers that depend on another one. This
235
+
"workaround" is frustrating, error prone, and can cause nasty Heisenbugs when
236
+
the timing will be wrong.
238
+
Two possible solutions for this problem are:
240
+
- Readiness probe - another program that is ran after the main process is
241
+
started, that checks whether our application is ready to work.
242
+
- Notification system where our application uses some common protocol to inform
243
+
the supervisor that it finished setup and is ready for work.
245
+
systemd supports the second approach via [`sd_notify`][sd_notify]. The approach
246
+
there is simple - we have `NOTIFY_SOCKET` environment variable that contain path
247
+
to the Unix datagram socket, that we can use to send informations about state of
248
+
our application. This socket accept set of different messages, but right now,
249
+
for our purposes, we will focus only on few of them:
251
+
- `READY=1` - marks our service as ready, aka it is ready to do its work (for
252
+
example accept incoming HTTP connections in our example). It need to be sent
253
+
withing given timespan after start of the VM, otherwise the process will be
254
+
killed and possibly restarted
255
+
- `STATUS=name` - sets status of our application that can be checked via
256
+
`systemctl status hello.service`, this allows us to have better insight into
257
+
what is the high level state without manually traversing through logs
258
+
- `RELOADING=1` - marks, that our application is reloading, which in general may
259
+
mean a lot of things, but there it will be used to mark `:init.restart/0`-like
260
+
behaviour (due to [erlang/otp#4698][] there is wrapper for that function in
261
+
`systemd` library). The process need then to send `READY=1` within given
262
+
timespan, or the process will be marked as a malfunctioning, and will be
263
+
forcefully killed and possibly restarted
264
+
- `STOPPING=1` - marks, that our application began shutting down process, and
265
+
will be closing soon. If the process will not close within given timespan, it
266
+
will be forcefully killed
268
+
These messages provide us enough power to not only mark the service as ready,
269
+
but also provides additional information about system state, so even operator,
270
+
who knows a little about Erlang or our application runtime, will be able to
271
+
understand what is going on.
273
+
The main thing is that systemd will wait with activation of the dependants of
274
+
our system as well as the `systemctl start` and `systemctl restart` commands
275
+
will wait until our service declare that it is ready.
277
+
Usage of such feature is quite simple:
281
+
Description=Hello World service
284
+
# Define `Type=` to `notify`
286
+
Environment=PORT=80
287
+
ExecStart=/opt/plug/bin/plug start
291
+
And then in our supervisor tree we need add `:systemd.ready()` **after** last
292
+
process needed for proper functioning of our application, in our simple example
293
+
it is after `Plug.Cowboy`:
296
+
# hello/application.ex
297
+
defmodule Hello.Application do
300
+
def start(_type, _opts) do
302
+
{Plug.Cowboy, [scheme: :http, plug: Hello.Router] ++ cowboy_opts()},
303
+
:systemd.ready(), # <-- it is function call, as it returns proper
305
+
{Plug.Cowboy.Drainer, refs: :all}
308
+
Supervisor.start_link(children, strategy: :one_for_one)
311
+
defp cowboy_opts do
313
+
port: String.to_integer(System.get_env("PORT", "4000"))
319
+
Now restarting our service will not finish immediately, but will wait until our
320
+
service will declare that it is ready.
323
+
# systemctl restart hello.service
326
+
About `STOPPING=1` - the magic thing is that the `systemd` library takes care of
327
+
it for you. As soon as the system will be scheduled to shutdown this message
328
+
will be automatically sent, and the operator will be notified about this fact.
330
+
We can also provide more information about state of our application. As you may
331
+
have already noticed, we have [`Plug.Cowboy.Drainer`][] there. It is process that
332
+
will delay shutdown of our application while there are still open connections.
333
+
This can take some time, so it would be handy if the operator would see that the
334
+
draining is in progress. We can easily achieve that by again changing our
335
+
supervision tree to:
338
+
# hello/application.ex
339
+
defmodule Hello.Application do
342
+
def start(_type, _opts) do
344
+
{Plug.Cowboy, [scheme: :http, plug: Hello.Router] ++ cowboy_opts()},
346
+
:systemd.set_status(down: [status: "drained"]),
347
+
{Plug.Cowboy.Drainer, refs: :all, shutdown: 10_000},
348
+
:systemd.set_status(down: [status: "draining"])
351
+
Supervisor.start_link(children, strategy: :one_for_one)
354
+
defp cowboy_opts do
356
+
port: String.to_integer(System.get_env("PORT", "4000"))
362
+
Now when we will shutdown our application by:
365
+
# systemctl stop hello.service
368
+
And we have some connections open to our service (you can simulate that with
369
+
`wrk`) then when we ran `systemctl status hello.service` in separate terminal
370
+
(previous will be blocked until our service shuts down) then you will be able to
371
+
see something like:
374
+
● hello.service - Example Plug application
375
+
Loaded: loaded (/usr/local/lib/systemd/system/hello.service; static; vendor preset: enabled)
376
+
Active: deactivating (stop-sigterm) since Sat 2022-01-15 17:46:30 CET;
378
+
Main PID: 1327 (beam.smp)
380
+
Tasks: 19 (limit: 1136)
384
+
You can notice that the `Status` is set to `"draining"`. As soon as all
385
+
connections will be drained it will change to `"drained"` and then the
386
+
application will shut down and service will be marked as `inactive`.
388
+
[sd_notify]: https://www.freedesktop.org/software/systemd/man/sd_notify.html
389
+
[erlang/otp#4698]: https://github.com/erlang/otp/issues/4698
390
+
[`Plug.Cowboy.Drainer`]: https://hexdocs.pm/plug_cowboy/2.5.2/Plug.Cowboy.Drainer.html
394
+
Watchdog allows us to monitor our application for responsiveness (as mentioned
395
+
above). It is simple feature that requires our application to ping systemd
396
+
within specified interval, otherwise the application will be forcibly shut down
397
+
as malfunctioning. Fortunately for us, the `systemd` library that provides our
398
+
integration, have that feature out of the box, so all we need to do to achieve
399
+
expected result is set `WatchdogSec=` option in our `systemd.service` file:
403
+
Description=Hello World service
406
+
Environment=PORT=80
408
+
ExecStart=/opt/plug/bin/plug start
412
+
This configuration says that if the VM will not send healthy message each 1
413
+
minute interval, then the service will be marked as malfunctioning. From the
414
+
application side we can manage state of the watchdog in several ways:
416
+
- By setting `systemd.watchdog_check` configuration option we can configure the
417
+
function that will be called on each check, if that function return `true`
418
+
then it mean that application is healthy and the systemd should be notified
419
+
with ping, if it returns `false` or fail, then the check will be omitted.
420
+
- Manually sending trigger message in case of detected problems via
421
+
`:systemd.watchdog(trigger)`, it will immediately mark service as
422
+
malfunctioning and will trigger action defined in service unit file (by
423
+
default it will restart application)
424
+
- Disabling built in watchdog process via `:systemd.watchdog(:disable)` and then
425
+
manually sending `:systemd.watchdog(:ping)` within expected intervals
430
+
We should start with changing default user and group which is assigned to our
431
+
process. We can do so in 2 different ways:
433
+
1. Use some existing user and group by defining `User=` and `Group=` directives
434
+
in our service definition; or
435
+
2. Create ephemeral user on-demand before our service starts, by using directive
436
+
`DynamicUser=true` in service definition.
438
+
I prefer second option, as it additionally provides a lot of other security
439
+
related options, like creating private `/tmp` directory, making system
440
+
read-only, etc. This has also some disadvantages, like removing all of given
441
+
data on service shutdown, however there are options to keep some data between
444
+
In addition to that we can add `PrivateDevices=true` that will hide all
445
+
physical devices from `/dev` leaving only pseudo devices like `/dev/null` or
446
+
`/dev/urandom` (so you will be able to use `:crypto` and `:ssl` modules without
449
+
Next thing is that we can do, is to [disable crash dumps generated by BEAM][crash].
450
+
While not strictly needed in this case, it is worth remembering, that it isn't
451
+
hard to achieve, it is just using `Environment=ERL_CRASH_DUMP_SECONDS=0`.
453
+
Our new, more secure, `hello.service` will look like:
457
+
Description=Hello World service
458
+
Requires=network.target
462
+
Environment=PORT=80
463
+
ExecStart=/opt/plug/bin/plug start
466
+
# We need to add capability to be able to bind on port 80
467
+
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
471
+
PrivateDevices=true
472
+
Environment=ERL_CRASH_DUMP_SECONDS=0
475
+
The problem with that configuration is that our service is now capable on
476
+
binding **any** port under 1024, so for example, if there is some security
477
+
issue, then the malicious party can open any of the restricted ports and then
478
+
serve whatever data they want there. This can be quite problematic, and the
479
+
solution for that problem will be covered in Part 2, where we will cover socket
480
+
passing and socket activation for our service.
482
+
With that we achieved quite basic level of isolation to what Docker (or other
483
+
container runtime) is providing, but it do not require `overlayfs` or anything
484
+
more, than what you already have on your machine. That means, updates done by
485
+
your system package manager will be applied to all running services. With that
486
+
you do not need to rebuild all your containers when there is security patch
487
+
issued for any of your dependencies.
489
+
Of course it only scratches the surface of what is possible with systemd wrt
490
+
the hardening of the services. More information can be found in [RedHat
491
+
article][rh-systemd-hardening] and in [`systemd-analyze security` command
492
+
output][systemd-analyze-security]. Possible features are:
494
+
- creation of the private networks for your services
495
+
- disallowing creation of socket connections that are outside of the specified
497
+
- make only some paths readable
498
+
- hide some paths from the process
501
+
Coverage of just that topic is a little bit out of scope for this blog post, so
502
+
I encourage you to read the documentation of [`systemd.exec`][systemd.exec] and
503
+
articles mentioned above for more details.
505
+
[crash]: https://erlef.github.io/security-wg/secure_coding_and_deployment_hardening/crash_dumps
506
+
[rh-systemd-hardening]: https://www.redhat.com/sysadmin/mastering-systemd
507
+
[systemd-analyze-security]: https://www.freedesktop.org/software/systemd/man/systemd-analyze.html#systemd-analyze%20security%20%5BUNIT...%5D
508
+
[systemd.exec]: https://www.freedesktop.org/software/systemd/man/systemd.exec.html
512
+
This blog post is already quite lengthy, so I will split it into separate parts.
513
+
There probably will be 3 of them:
515
+
- [Part 1 - Basics, security, and FD passing (this one)](?1)
516
+
- Part 2 - Socket activation