Thicket data repository for the EEG
at main 7.7 kB view raw
1{ 2 "id": "https://www.tunbury.org/2025/04/04/opam-repo-ci", 3 "title": "opam repo ci job timeouts", 4 "link": "https://www.tunbury.org/2025/04/04/opam-repo-ci/", 5 "updated": "2025-04-04T00:00:00", 6 "published": "2025-04-04T00:00:00", 7 "summary": "It’s Tuesday morning, and virtually all opam repo ci jobs are failing with timeouts. This comes at a critical time as these are the first jobs following the update of ocurrent/ocaml-version noted on 24th March.", 8 "content": "<p>It’s Tuesday morning, and virtually all opam repo ci jobs are failing with timeouts. This comes at a critical time as these are the first jobs following the update of <a href=\"https://github.com/ocurrent/ocaml-version\">ocurrent/ocaml-version</a> <a href=\"https://www.tunbury.org/recent-ocaml-version/\">noted</a> on 24th March.</p>\n\n<p>The <a href=\"https://opam.ci.ocaml.org/github/ocaml/opam-repository\">opam repo ci</a> tests all PRs on <a href=\"https://github.com/ocaml/opam-repository\">opam-repository</a>. The pipeline downloads Docker images, which contain the root filesystem for various Linux distributions, architectures, and OCaml versions, which are used as the base environment to run the tests. These base images are created by the <a href=\"https://images.ci.ocaml.org\">base image builder</a>. <a href=\"https://github.com/ocurrent/docker-base-images/pull/317\">PR#317</a> update these base images in three ways:</p>\n\n<ul>\n <li>Images for OCaml &lt; 4.08 were removed.</li>\n <li>The <code>opam-repository-archive</code> overlay was removed as this contained the &lt; 4.08 opam packages.</li>\n <li>The <code>ocaml-patches-overlay</code> overlay was removed as this was only needed to build OCaml &lt; 4.08 on GCC 14.</li>\n</ul>\n\n<p>Given these changes, I immediately assumed some element of these was the culprit.</p>\n\n<p>Here’s an example of a failure as reported in the log.</p>\n\n<div><div><pre><code>2025-04-01 07:27.45 ---&gt; using \"9dd47386dd0565c83eac2e9d589d75bdd268a7f34f3c854d1db189e7a2e5f77b\" from cache\n\n/: (user (uid 1000) (gid 1000))\n\n/: (workdir /home/opam)\n\n/home/opam: (run (shell \"sudo ln -f /usr/bin/opam-dev /usr/bin/opam\"))\n2025-04-01 07:27.45 ---&gt; using \"132d861be153666fd67b2e16b21c4de16e15e26f8d7d42f3bcddf0360ad147be\" from cache\n\n/home/opam: (run (network host)\n (shell \"opam init --reinit --config .opamrc-sandbox -ni\"))\nConfiguring from /home/opam/.opamrc-sandbox, then /home/opam/.opamrc, and finally from built-in defaults.\nChecking for available remotes: rsync and local, git.\n - you won't be able to use mercurial repositories unless you install the hg command on your system.\n - you won't be able to use darcs repositories unless you install the darcs command on your system.\n\nThis development version of opam requires an update to the layout of /home/opam/.opam from version 2.0 to version 2.2, which can't be reverted.\nYou may want to back it up before going further.\n\nContinue? [Y/n] y\n[NOTE] The 'jobs' option was reset, its value was 39 and its new value will vary according to the current number of cores on your machine. You can restore the fixed value using:\n opam option jobs=39 --global\nFormat upgrade done.\n\n&lt;&gt;&lt;&gt; Updating repositories &gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;&lt;&gt;\n2025-04-01 09:27.34: Cancelling: Timeout (120.0 minutes)\nJob cancelled\n2025-04-01 09:27.40: Timeout (120.0 minutes)\n</code></pre></div></div>\n\n<p>With nearly all jobs taking 2 hours to run, the cluster was understandably backlogged!</p>\n\n<p>The issue could be reproduced with this Dockerfile:</p>\n\n<div><div><pre><code>cd $(mktemp -d)\ngit clone --recursive \"https://github.com/ocaml/opam-repository.git\" &amp;&amp; cd \"opam-repository\" &amp;&amp; git fetch origin \"refs/pull/27696/head\" &amp;&amp; git reset --hard 46b8cc5a\ngit fetch origin master\ngit merge --no-edit 4d8fa0fb8fce3b6c8b06f29ebcfa844c292d4f3e\ncat &gt; ../Dockerfile &lt;&lt;'END-OF-DOCKERFILE'\nFROM ocaml/opam:debian-12-ocaml-4.09@sha256:13bd7f0979922adb13049eecc387d65d7846a3058f7dd6509738933e88bc8d4a\nUSER 1000:1000\nWORKDIR /home/opam\nRUN sudo ln -f /usr/bin/opam-dev /usr/bin/opam\nRUN opam init --reinit -ni\nRUN opam option solver=builtin-0install &amp;&amp; opam config report\nENV OPAMDOWNLOADJOBS=\"1\"\nENV OPAMERRLOGLEN=\"0\"\nENV OPAMPRECISETRACKING=\"1\"\nENV CI=\"true\"\nENV OPAM_REPO_CI=\"true\"\nRUN rm -rf opam-repository/\nCOPY --chown=1000:1000 . opam-repository/\nRUN opam repository set-url --strict default opam-repository/\nRUN opam update --depexts || true\nRUN opam pin add -k version -yn chrome-trace.3.18.0~alpha0 3.18.0~alpha0\nRUN opam reinstall chrome-trace.3.18.0~alpha0; \\\n res=$?; \\\n test \"$res\" != 31 &amp;&amp; exit \"$res\"; \\\n export OPAMCLI=2.0; \\\n build_dir=$(opam var prefix)/.opam-switch/build; \\\n failed=$(ls \"$build_dir\"); \\\n partial_fails=\"\"; \\\n for pkg in $failed; do \\\n if opam show -f x-ci-accept-failures: \"$pkg\" | grep -qF \"\\\"debian-12\\\"\"; then \\\n echo \"A package failed and has been disabled for CI using the 'x-ci-accept-failures' field.\"; \\\n fi; \\\n test \"$pkg\" != 'chrome-trace.3.18.0~alpha0' &amp;&amp; partial_fails=\"$partial_fails $pkg\"; \\\n done; \\\n test \"${partial_fails}\" != \"\" &amp;&amp; echo \"opam-repo-ci detected dependencies failing: ${partial_fails}\"; \\\n exit 1\n\nEND-OF-DOCKERFILE\ndocker build -f ../Dockerfile .\n</code></pre></div></div>\n\n<p>It was interesting to note which jobs still work. For example, builds on macOS and FreeBSD ran normally. This makes sense as these architectures don’t use the Docker base images. Looking further, opam repo ci attempts builds on opam 2.0, 2.1, 2.2, and 2.3 on Debian. These builds succeeded. Interesting. All the other builds use the latest version of opam built from the head of the master branch.</p>\n\n<p>Taking the failing Dockerfile above and replacing <code>sudo ln -f /usr/bin/opam-dev /usr/bin/opam</code> with <code>sudo ln -f /usr/bin/opam-2.3 /usr/bin/opam</code> immediately fixed the issue!</p>\n\n<p>I pushed commit <a href=\"https://github.com/ocurrent/opam-repo-ci/commit/7174953145735a54ecf668c7387e57b3f2d2a411\">7174953</a> to force opam repo ci to use opam 2.3 and opened <a href=\"https://github.com/ocaml/opam/issues/6448\">issue#6448</a> on ocaml/opam. The working theory is that some change associated with <a href=\"https://github.com/ocaml/opam/pull/5892\">PR#5892</a>, which replace GNU patch with the OCaml patch library is the root cause.</p>\n\n<p>Musing on this issue with David, the idea of using the latest tag rather than head commit seemed like a good compromise. This allowed us to specifically test pre-release versions of opam when they were tagged but not be at the cutting edge with the risk of impacting a key service.</p>\n\n<p>We need the latest tag by version number, not by date, as we wouldn’t want to revert to testing on, for example, 2.1.7 if something caused a new release of the 2.1 series. The result was a function which runs <code>git tag --format %(objectname) %(refname:strip=2)</code> and semantically sorts the version numbers using <code>OpamVersion.compare</code>. See <a href=\"https://github.com/ocurrent/docker-base-images/pull/318\">PR#318</a>.</p>", 9 "content_type": "html", 10 "author": { 11 "name": "Mark Elvers", 12 "email": "mark.elvers@tunbury.org", 13 "uri": null 14 }, 15 "categories": [ 16 "opam", 17 "tunbury.org" 18 ], 19 "source": "https://www.tunbury.org/atom.xml" 20}