1# File set library
2
3This is the internal contributor documentation.
4The user documentation is [in the Nixpkgs manual](https://nixos.org/manual/nixpkgs/unstable/#sec-fileset).
5
6## Goals
7
8The main goal of the file set library is to be able to select local files that should be added to the Nix store.
9It should have the following properties:
10- Easy:
11 The functions should have obvious semantics, be low in number and be composable.
12- Safe:
13 Throw early and helpful errors when mistakes are detected.
14- Lazy:
15 Only compute values when necessary.
16
17Non-goals are:
18- Efficient:
19 If the abstraction proves itself worthwhile but too slow, it can be still be optimized further.
20
21## Tests
22
23Tests are declared in [`tests.sh`](./tests.sh) and can be run using
24```
25./tests.sh
26```
27
28## Benchmark
29
30A simple benchmark against the HEAD commit can be run using
31```
32./benchmark.sh HEAD
33```
34
35This is intended to be run manually and is not checked by CI.
36
37## Internal representation
38
39The internal representation is versioned in order to allow file sets from different Nixpkgs versions to be composed with each other, see [`internal.nix`](./internal.nix) for the versions and conversions between them.
40This section describes only the current representation, but past versions will have to be supported by the code.
41
42### `fileset`
43
44An attribute set with these values:
45
46- `_type` (constant string `"fileset"`):
47 Tag to indicate this value is a file set.
48
49- `_internalVersion` (constant `3`, the current version):
50 Version of the representation.
51
52- `_internalIsEmptyWithoutBase` (bool):
53 Whether this file set is the empty file set without a base path.
54 If `true`, `_internalBase*` and `_internalTree` are not set.
55 This is the only way to represent an empty file set without needing a base path.
56
57 Such a value can be used as the identity element for `union` and the return value of `unions []` and co.
58
59- `_internalBase` (path):
60 Any files outside of this path cannot influence the set of files.
61 This is always a directory and should be as long as possible.
62 This is used by `lib.fileset.toSource` to check that all files are under the `root` argument
63
64- `_internalBaseRoot` (path):
65 The filesystem root of `_internalBase`, same as `(lib.path.splitRoot _internalBase).root`.
66 This is here because this needs to be computed anyway, and this computation shouldn't be duplicated.
67
68- `_internalBaseComponents` (list of strings):
69 The path components of `_internalBase`, same as `lib.path.subpath.components (lib.path.splitRoot _internalBase).subpath`.
70 This is here because this needs to be computed anyway, and this computation shouldn't be duplicated.
71
72- `_internalTree` ([filesetTree](#filesettree)):
73 A tree representation of all included files under `_internalBase`.
74
75- `__noEval` (error):
76 An error indicating that directly evaluating file sets is not supported.
77
78## `filesetTree`
79
80One of the following:
81
82- `{ <name> = filesetTree; }`:
83 A directory with a nested `filesetTree` value for directory entries.
84 Entries not included may either be omitted or set to `null`, as necessary to improve efficiency or laziness.
85
86- `"directory"`:
87 A directory with all its files included recursively, allowing early cutoff for some operations.
88 This specific string is chosen to be compatible with `builtins.readDir` for a simpler implementation.
89
90- `"regular"`, `"symlink"`, `"unknown"` or any other non-`"directory"` string:
91 A nested file with its file type.
92 These specific strings are chosen to be compatible with `builtins.readDir` for a simpler implementation.
93 Distinguishing between different file types is not strictly necessary for the functionality this library,
94 but it does allow nicer printing of file sets.
95
96- `null`:
97 A file or directory that is excluded from the tree.
98 It may still exist on the file system.
99
100## API design decisions
101
102This section justifies API design decisions.
103
104### Internal structure
105
106The representation of the file set data type is internal and can be changed over time.
107
108Arguments:
109- (+) The point of this library is to provide high-level functions, users don't need to be concerned with how it's implemented
110- (+) It allows adjustments to the representation, which is especially useful in the early days of the library.
111- (+) It still allows the representation to be stabilized later if necessary and if it has proven itself
112
113### Influence tracking
114
115File set operations internally track the top-most directory that could influence the exact contents of a file set.
116Specifically, `toSource` requires that the given `fileset` is completely determined by files within the directory specified by the `root` argument.
117For example, even with `dir/file.txt` being the only file in `./.`, `toSource { root = ./dir; fileset = ./.; }` gives an error.
118This is because `fileset` may as well be the result of filtering `./.` in a way that excludes `dir`.
119
120Arguments:
121- (+) This gives us the guarantee that adding new files to a project never breaks a file set expression.
122 This is also true in a lesser form for removed files:
123 only removing files explicitly referenced by paths can break a file set expression.
124- (+) This can be removed later, if we discover it's too restrictive
125- (-) It leads to errors when a sensible result could sometimes be returned, such as in the above example.
126
127### Empty file set without a base
128
129There is a special representation for an empty file set without a base path.
130This is used for return values that should be empty but when there's no base path that would makes sense.
131
132Arguments:
133- Alternative: This could also be represented using `_internalBase = /.` and `_internalTree = null`.
134 - (+) Removes the need for a special representation.
135 - (-) Due to [influence tracking](#influence-tracking),
136 `union empty ./.` would have `/.` as the base path,
137 which would then prevent `toSource { root = ./.; fileset = union empty ./.; }` from working,
138 which is not as one would expect.
139 - (-) With the assumption that there can be multiple filesystem roots (as established with the [path library](../path/README.md)),
140 this would have to cause an error with `union empty pathWithAnotherFilesystemRoot`,
141 which is not as one would expect.
142- Alternative: Do not have such a value and error when it would be needed as a return value
143 - (+) Removes the need for a special representation.
144 - (-) Leaves us with no identity element for `union` and no reasonable return value for `unions []`.
145 From a set theory perspective, which has a well-known notion of empty sets, this is unintuitive.
146
147### No intersection for lists
148
149While there is `intersection a b`, there is no function `intersections [ a b c ]`.
150
151Arguments:
152- (+) There is no known use case for such a function, it can be added later if a use case arises
153- (+) There is no suitable return value for `intersections [ ]`, see also "Nullary intersections" [here](https://en.wikipedia.org/w/index.php?title=List_of_set_identities_and_relations&oldid=1177174035#Definitions)
154 - (-) Could throw an error for that case
155 - (-) Create a special value to represent "all the files" and return that
156 - (+) Such a value could then not be used with `fileFilter` unless the internal representation is changed considerably
157 - (-) Could return the empty file set
158 - (+) This would be wrong in set theory
159- (-) Inconsistent with `union` and `unions`
160
161### Intersection base path
162
163The base path of the result of an `intersection` is the longest base path of the arguments.
164E.g. the base path of `intersection ./foo ./foo/bar` is `./foo/bar`.
165Meanwhile `intersection ./foo ./bar` returns the empty file set without a base path.
166
167Arguments:
168- Alternative: Use the common prefix of all base paths as the resulting base path
169 - (-) This is unnecessarily strict, because the purpose of the base path is to track the directory under which files _could_ be in the file set. It should be as long as possible.
170 All files contained in `intersection ./foo ./foo/bar` will be under `./foo/bar` (never just under `./foo`), and `intersection ./foo ./bar` will never contain any files (never under `./.`).
171 This would lead to `toSource` having to unexpectedly throw errors for cases such as `toSource { root = ./foo; fileset = intersect ./foo base; }`, where `base` may be `./bar` or `./.`.
172 - (-) There is no benefit to the user, since base path is not directly exposed in the interface
173
174### Empty directories
175
176File sets can only represent a _set_ of local files.
177Directories on their own are not representable.
178
179Arguments:
180- (+) There does not seem to be a sensible set of combinators when directories can be represented on their own.
181 Here's some possibilities:
182 - `./.` represents the files in `./.` _and_ the directory itself including its subdirectories, meaning that even if there's no files, the entire structure of `./.` is preserved
183
184 In that case, what should `fileFilter (file: false) ./.` return?
185 It could return the entire directory structure unchanged, but with all files removed, which would not be what one would expect.
186
187 Trying to have a filter function that also supports directories will lead to the question of:
188 What should the behavior be if `./foo` itself is excluded but all of its contents are included?
189 It leads to having to define when directories are recursed into, but then we're effectively back at how the `builtins.path`-based filters work.
190
191 - `./.` represents all files in `./.` _and_ the directory itself, but not its subdirectories, meaning that at least `./.` will be preserved even if it's empty.
192
193 In that case, `intersection ./. ./foo` should only include files and no directories themselves, since `./.` includes only `./.` as a directory, and same for `./foo`, so there's no overlap in directories.
194 But intuitively this operation should result in the same as `./foo` – everything else is just confusing.
195- (+) This matches how Git only supports files, so developers should already be used to it.
196- (-) Empty directories (even if they contain nested directories) are neither representable nor preserved when coercing from paths.
197 - (+) It is very rare that empty directories are necessary.
198 - (+) We can implement a workaround, allowing `toSource` to take an extra argument for ensuring certain extra directories exist in the result.
199- (-) It slows down store imports, since the evaluator needs to traverse the entire tree to remove any empty directories
200 - (+) This can still be optimized by introducing more Nix builtins if necessary
201
202### String paths
203
204File sets do not support Nix store paths in strings such as `"/nix/store/...-source"`.
205
206Arguments:
207- (+) Such paths are usually produced by derivations, which means `toSource` would either:
208 - Require [Import From Derivation](https://nixos.org/manual/nix/unstable/language/import-from-derivation) (IFD) if `builtins.path` is used as the underlying primitive
209 - Require importing the entire `root` into the store such that derivations can be used to do the filtering
210- (+) The convenient path coercion like `union ./foo ./bar` wouldn't work for absolute paths, requiring more verbose alternate interfaces:
211 - `let root = "/nix/store/...-source"; in union "${root}/foo" "${root}/bar"`
212
213 Verbose and dangerous because if `root` was a path, the entire path would get imported into the store.
214
215 - `toSource { root = "/nix/store/...-source"; fileset = union "./foo" "./bar"; }`
216
217 Does not allow debug printing intermediate file set contents, since we don't know the paths contents before having a `root`.
218
219 - `let fs = lib.fileset.withRoot "/nix/store/...-source"; in fs.union "./foo" "./bar"`
220
221 Makes library functions impure since they depend on the contextual root path, questionable composability.
222
223- (+) The point of the file set abstraction is to specify which files should get imported into the store.
224
225 This use case makes little sense for files that are already in the store.
226 This should be a separate abstraction as e.g. `pkgs.drvLayout` instead, which could have a similar interface but be specific to derivations.
227 Additional capabilities could be supported that can't be done at evaluation time, such as renaming files, creating new directories, setting executable bits, etc.
228- (+) An API for filtering/transforming Nix store paths could be much more powerful,
229 because it's not limited to just what is possible at evaluation time with `builtins.path`.
230 Operations such as moving and adding files would be supported.
231
232### Single files
233
234File sets cannot add single files to the store, they can only import files under directories.
235
236Arguments:
237- (+) There's no point in using this library for a single file, since you can't do anything other than add it to the store or not.
238 And it would be unclear how the library should behave if the one file wouldn't be added to the store:
239 `toSource { root = ./file.nix; fileset = <empty>; }` has no reasonable result because returing an empty store path wouldn't match the file type, and there's no way to have an empty file store path, whatever that would mean.
240
241### `fileFilter` takes a path
242
243The `fileFilter` function takes a path, and not a file set, as its second argument.
244
245- (-) Makes it harder to compose functions, since the file set type, the return value, can't be passed to the function itself like `fileFilter predicate fileset`
246 - (+) It's still possible to use `intersection` to filter on file sets: `intersection fileset (fileFilter predicate ./.)`
247 - (-) This does need an extra `./.` argument that's not obvious
248 - (+) This could always be `/.` or the project directory, `intersection` will make it lazy
249- (+) In the future this will allow `fileFilter` to support a predicate property like `subpath` and/or `components` in a reproducible way.
250 This wouldn't be possible if it took a file set, because file sets don't have a predictable absolute path.
251 - (-) What about the base path?
252 - (+) That can change depending on which files are included, so if it's used for `fileFilter`
253 it would change the `subpath`/`components` value depending on which files are included.
254- (+) If necessary, this restriction can be relaxed later, the opposite wouldn't be possible
255
256## To update in the future
257
258Here's a list of places in the library that need to be updated in the future:
259- If/Once a function exists that can optionally include a path depending on whether it exists, the error message for the path not existing in `_coerce` should mention the new function