Thicket data repository for the EEG
1{
2 "id": "https://lucasma8795.github.io/blog/2025/08/01/effects-scheduling-w05",
3 "title": "Effects-based scheduling for the OCaml compiler - w05",
4 "link": "https://lucasma8795.github.io/blog/2025/08/01/effects-scheduling-w05.html",
5 "updated": "2025-08-01T08:00:00",
6 "published": "2025-08-01T08:00:00",
7 "summary": "I started the week off by fixing my parallel scheduler that I\u2019ve started writing end of last week. There was this one bug that simply refused to budge, no matter how many things I\u2019ve thrown at it (you can find the setup from last week\u2019s notes here):",
8 "content": "<p>I started the week off by fixing my parallel scheduler that I\u2019ve started writing end of last week. There was this one bug that simply refused to budge, no matter how many things I\u2019ve thrown at it (you can find the setup from <a href=\"https://lucasma8795.github.io/blog/2025/07/25/effects-scheduling-w04.html\">last week\u2019s notes here</a>):</p>\n\n<div><div><pre><code>>> Fatal error: Cannot find address for: C.baz\nFatal error: exception Misc.Fatal_error\nRaised at Custom_ocamlc.handle.(fun) in file \"custom_ocamlc.ml\", line 365, characters 45-55\nCalled from Custom_ocamlc in file \"custom_ocamlc.ml\", line 685, characters 2-12\n</code></pre></div></div>\n\n<p>This happened after step 10 of the diagram from last week, during compilation of <code>A.ml</code>.</p>\n\n<p>Continuations capture everything on the call stack, but what they don\u2019t capture is the <em>global state</em> of the compiler. Thankfully, some <a href=\"https://github.com/ocaml/ocaml/pull/9963\">people</a> over at <a href=\"https://github.com/ocaml/merlin\">Merlin</a> have already added a module (<a href=\"https://ocaml.org/manual/5.2/api/compilerlibref/Local_store.html\">Local_store</a>) to the compiler, for them to \u201csnapshot\u201d the global state of the type-checker to move back and forth to type different files. They do this by explicitly registering all global state with <code>s_ref: 'a -> 'a ref</code> in place of <code>ref</code>, which then registers the reference in a list of global bindings. Before we start any compilation, we call <code>fresh: unit -> store</code> once, which <em>snapshots</em> the current global state as the \u201cinitial state\u201d and returns an opaque <code>store</code> type capable of storing a set of global states, initialized to the fresh state. This is then used in <code>with_store : store -> (unit -> 'a) -> 'a</code> to restore the global state to the state of the <code>store</code> during the run of the function, and saving any changes to the <code>store</code>. Subsequent calls to <code>fresh</code> will return a fresh <code>store</code> with values obtained from the snapshot taken at the first instance of <code>fresh ()</code>.</p>\n\n<p>This is huge news, because all the missing dependencies would have already been discovered by the time the file has finished type-checking, so most if not all of the global state has already been registered for us. This is what my scheduler looked like, stripping away all unnecessary details:</p>\n\n<div><div><pre><code><span>let</span> <span>suspended_tasks</span> <span>=</span> <span>Queue</span><span>.</span><span>create</span> <span>()</span>\n<span>type</span> <span>_</span> <span>Effect</span><span>.</span><span>t</span> <span>+=</span> <span>Load_path</span> <span>:</span> <span>string</span> <span>-></span> <span>string</span> <span>Effect</span><span>.</span><span>t</span>\n\n<span>(* start compilation of all .ml files *)</span>\n<span>List</span><span>.</span><span>iter</span> <span>(</span><span>fun</span> <span>ml_file</span> <span>-></span>\n <span>let</span> <span>store</span> <span>=</span> <span>fresh</span> <span>()</span><span>;</span>\n <span>match</span> <span>with_store</span> <span>store</span> <span>(</span><span>fun</span> <span>()</span> <span>-></span> <span>compile</span> <span>ml_file</span><span>)</span> <span>with</span>\n <span>|</span> <span>()</span> <span>-></span> <span>()</span> <span>(* file compiled successfully *)</span>\n <span>|</span> <span>effect</span> <span>(</span><span>Load_path</span> <span>dep</span><span>)</span><span>,</span> <span>cont</span> <span>-></span> <span>(* dep will be a .cmi file *)</span>\n <span>begin</span> <span>try</span>\n <span>continue</span> <span>cont</span> <span>(</span><span>resolve_full_filename</span> <span>dep</span><span>)</span>\n <span>with</span> <span>Not_found</span> <span>-></span>\n <span>(* we hit a missing dependency, suspend the task *)</span>\n <span>let</span> <span>full_mli_file</span> <span>=</span> <span>find_interface_source</span> <span>dep</span> <span>in</span>\n <span>let</span> <span>dep</span> <span>=</span> <span>(</span><span>remove_suffix</span> <span>mli_file</span> <span>\".mli\"</span><span>)</span> <span>^</span> <span>\".cmi\"</span> <span>in</span>\n <span>let</span> <span>pid</span> <span>=</span> <span>compile_process_parallel</span> <span>full_mli_file</span> <span>in</span>\n <span>Queue</span><span>.</span><span>add</span> <span>(</span><span>pid</span><span>,</span> <span>cont</span><span>,</span> <span>dep</span><span>,</span> <span>store</span><span>)</span> <span>suspended_tasks</span>\n <span>end</span>\n<span>)</span> <span>files_to_compile</span>\n\n<span>(* fold on suspended tasks until we are done *)</span>\n<span>while</span> <span>not</span> <span>(</span><span>Queue</span><span>.</span><span>is_empty</span> <span>suspended_tasks</span><span>)</span> <span>do</span>\n <span>let</span> <span>(</span><span>pid</span><span>,</span> <span>cont</span><span>,</span> <span>dep</span><span>,</span> <span>store</span><span>)</span> <span>=</span> <span>Queue</span><span>.</span><span>take</span> <span>suspended_tasks</span> <span>in</span>\n <span>if</span> <span>process_finished</span> <span>pid</span> <span>then</span>\n <span>(* dependency has finished compiling, we can resume the task *)</span>\n <span>add_to_load_path</span> <span>dep</span><span>;</span>\n <span>with_store</span> <span>store</span> <span>(</span><span>fun</span> <span>()</span> <span>-></span> <span>continue</span> <span>cont</span> <span>dep</span><span>)</span>\n <span>else</span>\n <span>(* re-add the task to the queue *)</span>\n <span>Queue</span><span>.</span><span>add</span> <span>(</span><span>pid</span><span>,</span> <span>cont</span><span>,</span> <span>dep</span><span>,</span> <span>store</span><span>)</span> <span>suspended_tasks</span>\n<span>done</span>\n</code></pre></div></div>\n\n<p>I\u2019m sure this was necessary anyway, but this somehow did not fix the issue! I then spent the good part of two whole days adding print statements all over the type-checker and staring at ridiculously long call stacks, until I came across a fairly innocuous piece of code, in <code>typing/env.ml</code>:</p>\n\n<div><div><pre><code><span>let</span> <span>find_same_module</span> <span>id</span> <span>tbl</span> <span>=</span>\n <span>match</span> <span>IdTbl</span><span>.</span><span>find_same</span> <span>id</span> <span>tbl</span> <span>with</span>\n <span>|</span> <span>x</span> <span>-></span> <span>x</span>\n <span>|</span> <span>exception</span> <span>Not_found</span>\n <span>when</span> <span>Ident</span><span>.</span><span>persistent</span> <span>id</span> <span>&&</span> <span>not</span> <span>(</span><span>Current_unit</span><span>.</span><span>Name</span><span>.</span><span>is_ident</span> <span>id</span><span>)</span> <span>-></span>\n <span>Mod_persistent</span>\n</code></pre></div></div>\n\n<p>At this point I had realized that <code>B</code> was being opened successfully in <code>A</code>, going through the <code>Mod_persistent</code> code path above, but somehow <code>C</code> kept on raising <code>Not_found</code> here no matter what I did, and this was quite suspicious as their behaviour should be virtually identical. The first predicate in line 5 couldn\u2019t have been the issue, so it must have been the second that was failing. <code>Current_unit.Name</code> sounds like some mutable global state, and surely something as simple as that that must have been captured by <code>Local_store</code>.</p>\n\n<p>It wasn\u2019t! So when we resumed compilation of <code>A</code> (in step 10), the compiler thinks it\u2019s in <code>C</code>, and it makes sense that it couldn\u2019t find <code>C</code>, because it thinks we are already in the module <code>C</code>. The fix was:</p>\n\n<div><div><pre><code><span>- let current_unit : Unit_info.t option ref = ref None\n</span><span>+ let current_unit : Unit_info.t option ref = s_ref None\n</span></code></pre></div></div>\n\n<p>It took me two days to add two characters to the compiler! (<a href=\"https://github.com/dra27\">David</a> told me that he once took 5 days to fix a GC bug that changed only a couple of characters, so I guess this was bound to happen at some point\u2026)</p>\n\n<p>At this point, the entry point of the compiler was turning into a 800-line monster, so I decided to spend the rest of the week doing refactoring and logging improvements, in preparation of using domains as the next step.</p>",
9 "content_type": "html",
10 "author": {
11 "name": "",
12 "email": null,
13 "uri": null
14 },
15 "categories": [
16 "ocaml-effects-scheduling"
17 ],
18 "source": "https://lucasma8795.github.io/blog/feed/ocaml-effects-scheduling.xml"
19}