Thicket data repository for the EEG
at main 8.6 kB view raw
1{ 2 "id": "https://www.tunbury.org/2025/07/08/unix-or-sys", 3 "title": "Sys.readdir or Unix.readdir", 4 "link": "https://www.tunbury.org/2025/07/08/unix-or-sys/", 5 "updated": "2025-07-08T00:00:00", 6 "published": "2025-07-08T00:00:00", 7 "summary": "When you recursively scan a massive directory tree, would you use Sys.readdir or Unix.readdir? My inclination is that Sys.readdir feels more convenient to use, and thus the lower-level Unix.readdir would have the performance edge. Is it significant enough to bother with?", 8 "content": "<p>When you recursively scan a massive directory tree, would you use <code>Sys.readdir</code> or <code>Unix.readdir</code>? My inclination is that <code>Sys.readdir</code> feels more convenient to use, and thus the lower-level <code>Unix.readdir</code> would have the performance edge. Is it significant enough to bother with?</p>\n\n<p>Quickly coding up the two different options for comparison. Here’s the <code>Unix.readdir</code> version, running <code>Unix.opendir</code> then recursively calling <code>Unix.readdir</code> until the <code>End_of_file</code> exception is raised.</p>\n\n<div><div><pre><code><span>let</span> <span>rec</span> <span>traverse_directory_unix</span> <span>path</span> <span>x</span> <span>=</span>\n <span>let</span> <span>stats</span> <span>=</span> <span>Unix</span><span>.</span><span>lstat</span> <span>path</span> <span>in</span>\n <span>match</span> <span>stats</span><span>.</span><span>st_kind</span> <span>with</span>\n <span>|</span> <span>Unix</span><span>.</span><span>S_REG</span> <span>-&gt;</span> <span>x</span> <span>+</span> <span>1</span>\n <span>|</span> <span>S_LNK</span> <span>|</span> <span>S_CHR</span> <span>|</span> <span>S_BLK</span> <span>|</span> <span>S_FIFO</span> <span>|</span> <span>S_SOCK</span> <span>-&gt;</span> <span>x</span>\n <span>|</span> <span>S_DIR</span> <span>-&gt;</span>\n <span>try</span>\n <span>let</span> <span>dir_handle</span> <span>=</span> <span>Unix</span><span>.</span><span>opendir</span> <span>path</span> <span>in</span>\n <span>let</span> <span>rec</span> <span>read_entries</span> <span>acc</span> <span>=</span>\n <span>try</span>\n <span>match</span> <span>Unix</span><span>.</span><span>readdir</span> <span>dir_handle</span> <span>with</span>\n <span>|</span> <span>\".\"</span> <span>|</span> <span>\"..\"</span> <span>-&gt;</span> <span>read_entries</span> <span>acc</span>\n <span>|</span> <span>entry</span> <span>-&gt;</span>\n <span>let</span> <span>full_path</span> <span>=</span> <span>Filename</span><span>.</span><span>concat</span> <span>path</span> <span>entry</span> <span>in</span>\n <span>read_entries</span> <span>(</span><span>traverse_directory_unix</span> <span>full_path</span> <span>acc</span><span>)</span>\n <span>with</span> <span>End_of_file</span> <span>-&gt;</span>\n <span>Unix</span><span>.</span><span>closedir</span> <span>dir_handle</span><span>;</span>\n <span>acc</span>\n <span>in</span>\n <span>read_entries</span> <span>x</span>\n <span>with</span> <span>_</span> <span>-&gt;</span> <span>x</span>\n</code></pre></div></div>\n\n<p>The <code>Sys.readdir</code> version nicely gives us an array so we can idiomatically use <code>Array.fold_left</code>.</p>\n\n<div><div><pre><code><span>let</span> <span>traverse_directory_sys</span> <span>source</span> <span>=</span>\n <span>let</span> <span>rec</span> <span>process_directory</span> <span>s</span> <span>current_source</span> <span>=</span>\n <span>let</span> <span>entries</span> <span>=</span> <span>Sys</span><span>.</span><span>readdir</span> <span>current_source</span> <span>in</span>\n <span>Array</span><span>.</span><span>fold_left</span>\n <span>(</span><span>fun</span> <span>acc</span> <span>entry</span> <span>-&gt;</span>\n <span>let</span> <span>source</span> <span>=</span> <span>Filename</span><span>.</span><span>concat</span> <span>current_source</span> <span>entry</span> <span>in</span>\n <span>try</span>\n <span>let</span> <span>stat</span> <span>=</span> <span>Unix</span><span>.</span><span>lstat</span> <span>source</span> <span>in</span>\n <span>match</span> <span>stat</span><span>.</span><span>st_kind</span> <span>with</span>\n <span>|</span> <span>Unix</span><span>.</span><span>S_REG</span> <span>-&gt;</span> <span>acc</span> <span>+</span> <span>1</span>\n <span>|</span> <span>Unix</span><span>.</span><span>S_DIR</span> <span>-&gt;</span> <span>process_directory</span> <span>acc</span> <span>source</span>\n <span>|</span> <span>S_LNK</span> <span>|</span> <span>S_CHR</span> <span>|</span> <span>S_BLK</span> <span>|</span> <span>S_FIFO</span> <span>|</span> <span>S_SOCK</span> <span>-&gt;</span> <span>acc</span>\n <span>with</span> <span>Unix</span><span>.</span><span>Unix_error</span> <span>_</span> <span>-&gt;</span> <span>acc</span><span>)</span>\n <span>s</span> <span>entries</span>\n <span>in</span>\n <span>process_directory</span> <span>0</span> <span>source</span>\n</code></pre></div></div>\n\n<p>The file system may have a big impact, so I tested NTFS, ReFS, and ext4, running each a couple of times to ensure the cache was primed.</p>\n\n<p><code>Sys.readdir</code> was quicker in my test cases up to 500,000 files. Reaching 750,000 files, <code>Unix.readdir</code> edged ahead. I was surprised by the outcome and wondered whether it was my code rather than the module I used.</p>\n\n<p>Pushing for the result I expected/wanted, I rewrote the function so it more closely mirrors the <code>Sys.readdir</code> version.</p>\n\n<div><div><pre><code><span>let</span> <span>traverse_directory_unix_2</span> <span>path</span> <span>=</span>\n <span>let</span> <span>rec</span> <span>process_directory</span> <span>s</span> <span>path</span> <span>=</span>\n <span>try</span>\n <span>let</span> <span>dir_handle</span> <span>=</span> <span>Unix</span><span>.</span><span>opendir</span> <span>path</span> <span>in</span>\n <span>let</span> <span>rec</span> <span>read_entries</span> <span>acc</span> <span>=</span>\n <span>try</span>\n <span>let</span> <span>entry</span> <span>=</span> <span>Unix</span><span>.</span><span>readdir</span> <span>dir_handle</span> <span>in</span>\n <span>match</span> <span>entry</span> <span>with</span>\n <span>|</span> <span>\".\"</span> <span>|</span> <span>\"..\"</span> <span>-&gt;</span> <span>read_entries</span> <span>acc</span>\n <span>|</span> <span>entry</span> <span>-&gt;</span>\n <span>let</span> <span>full_path</span> <span>=</span> <span>Filename</span><span>.</span><span>concat</span> <span>path</span> <span>entry</span> <span>in</span>\n <span>let</span> <span>stats</span> <span>=</span> <span>Unix</span><span>.</span><span>lstat</span> <span>full_path</span> <span>in</span>\n <span>match</span> <span>stats</span><span>.</span><span>st_kind</span> <span>with</span>\n <span>|</span> <span>Unix</span><span>.</span><span>S_REG</span> <span>-&gt;</span> <span>read_entries</span> <span>(</span><span>acc</span> <span>+</span> <span>1</span><span>)</span>\n <span>|</span> <span>S_LNK</span> <span>|</span> <span>S_CHR</span> <span>|</span> <span>S_BLK</span> <span>|</span> <span>S_FIFO</span> <span>|</span> <span>S_SOCK</span> <span>-&gt;</span> <span>read_entries</span> <span>acc</span>\n <span>|</span> <span>S_DIR</span> <span>-&gt;</span> <span>read_entries</span> <span>(</span><span>process_directory</span> <span>acc</span> <span>full_path</span><span>)</span>\n <span>with</span> <span>End_of_file</span> <span>-&gt;</span>\n <span>Unix</span><span>.</span><span>closedir</span> <span>dir_handle</span><span>;</span>\n <span>acc</span>\n <span>in</span>\n <span>read_entries</span> <span>s</span>\n <span>with</span> <span>_</span> <span>-&gt;</span> <span>s</span>\n <span>in</span>\n <span>process_directory</span> <span>0</span> <span>path</span>\n</code></pre></div></div>\n\n<p>This version is indeed faster than <code>Sys.readdir</code> in all cases. However, at 750,000 files the speed up was &lt; 0.5%.</p>", 9 "content_type": "html", 10 "author": { 11 "name": "Mark Elvers", 12 "email": "mark.elvers@tunbury.org", 13 "uri": null 14 }, 15 "categories": [ 16 "ocaml", 17 "tunbury.org" 18 ], 19 "source": "https://www.tunbury.org/atom.xml" 20}