mte/2025_08_06_slurm-limits.json at main · anil.recoil.org/thicket-eeg

Thicket data repository for the EEG
thicket-eeg / mte / 2025_08_06_slurm-limits.json
at main 5.9 kB view raw
 1{
 2  "id": "https://www.tunbury.org/2025/08/06/slurm-limits",
 3  "title": "Further investigations with Slurm",
 4  "link": "https://www.tunbury.org/2025/08/06/slurm-limits/",
 5  "updated": "2025-08-06T00:00:00",
 6  "published": "2025-08-06T00:00:00",
 7  "summary": "Slurm uses cgroups to constrain jobs with the specified parameters and an accounting database to track job statistics.",
 8  "content": "<p>Slurm uses cgroups to constrain jobs with the specified parameters and an accounting database to track job statistics.</p>\n\n<p>After the initial <a href=\"https://www.tunbury.org/2025/04/14/slurm-workload-manager/\">configuration</a> and ensuring everything is at the same <a href=\"https://www.tunbury.org/2025/07/29/slurm-versions/\">version</a>, what we really need is some shared storage between the head node and the cluster machine(s). I’m going to quickly share <code>/home</code> over NFS.</p>\n\n<p>Install an NFS server on the head node with <code>apt install nfs-kernel-server</code> and set up <code>/etc/exports</code>:</p>\n\n<div><div><pre><code>/home    foo(rw,sync,no_subtree_check,no_root_squash)\n</code></pre></div></div>\n\n<p>On the cluster worker, install the NFS client, <code>apt install nfs-common</code> and mount the home directory:</p>\n\n<div><div><pre><code>mount -t nfs head:/home/mte24 /home/mte24\n</code></pre></div></div>\n\n<p>I have deleted my user account on the cluster worker and set my UID/GID on the head node to values that do not conflict with any of those on the worker.</p>\n\n<p>With the directory shared, and signed into the head node as my users, I can run <code>sbatch ./myscript</code></p>\n\n<p>Configure Slurm to use cgroups, create <code>/etc/slurm/cgroups.conf</code> containing the following:</p>\n\n<div><div><pre><code>ConstrainCores=yes\nConstrainDevices=yes\nConstrainRAMSpace=yes\nConstrainSwapSpace=yes\n</code></pre></div></div>\n\n<p>Set these values in <code>/etc/slurm/slurm.conf</code>:</p>\n\n<div><div><pre><code>ProctrackType=proctrack/cgroup\nTaskPlugin=task/cgroup,task/affinity\nJobAcctGatherType=jobacct_gather/cgroup\nDefMemPerNode=16384\n</code></pre></div></div>\n\n<p>For accounting, we need to install a database and another Slurm daemon.</p>\n\n<div><div><pre><code>apt <span>install </span>mariadb-server\n</code></pre></div></div>\n\n<p>And <code>slurmdbd</code> with:</p>\n\n<div><div><pre><code>dpkg <span>-i</span> slurm-smd-slurmdbd_25.05.1-1_amd64.deb\n</code></pre></div></div>\n\n<p>Set up a database in MariaDB:</p>\n\n<div><div><pre><code><span>mysql</span> <span>-</span><span>e</span> <span>\"CREATE DATABASE slurm_acct_db; CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'password'; GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';\"</span>\n</code></pre></div></div>\n\n<p>Create <code>/etc/slurm/slurmdbd.conf</code></p>\n\n<div><div><pre><code>DbdHost=localhost\nSlurmUser=slurm\nStorageType=accounting_storage/mysql\nStorageHost=localhost\nStorageUser=slurm\nStoragePass=password\nStorageLoc=slurm_acct_db\nLogFile=/var/log/slurm/slurmdbd.log\nPidFile=/var/run/slurmdbd/slurmdbd.pid\n</code></pre></div></div>\n\n<p>Secure the file as the password is in plain text:</p>\n\n<div><div><pre><code><span>chown </span>slurm:slurm /etc/slurm/slurmdbd.conf\n<span>chmod </span>600 /etc/slurm/slurmdbd.conf\n</code></pre></div></div>\n\n<p>Then add these lines to slurm.conf</p>\n\n<div><div><pre><code>AccountingStorageType=accounting_storage/slurmdbd\nAccountingStoragePort=6819\nAccountingStorageEnforce=limits,qos,safe\n</code></pre></div></div>\n\n<p>Finally, we need to configure a cluster with a name that matches the name in <code>slurm.conf</code>. An account is a logical grouping, such as a department name. It is not a user account. Actual user accounts are associated with a cluster and an account. Therefore, a minimum configuration might be:</p>\n\n<div><div><pre><code>sacctmgr add cluster cluster\nsacctmgr add account <span>name</span><span>=</span>eeg <span>Organization</span><span>=</span>EEG\nsacctmgr <span>-i</span> create user <span>name</span><span>=</span>mte24 <span>cluster</span><span>=</span>cluster <span>account</span><span>=</span>eeg\n</code></pre></div></div>\n\n<p>To test this out, create <code>script1</code> as follows:</p>\n\n<div><div><pre><code>#!/bin/bash\n# Test script\ndate\necho \"I am now running on compute node:\"\nhostname\nsleep 120\ndate\necho \"Done...\"\nexit 0 \n</code></pre></div></div>\n\n<p>Then submit the job with a timeout of 30 seconds.</p>\n\n<div><div><pre><code>~<span>$ </span>sbatch <span>-t</span> 00:00:30 script1\nSubmitted batch job 10\n</code></pre></div></div>\n\n<p>The job output is in <code>slurm-10.out</code>, and we can see the completion state with <code>sacct</code>:</p>\n\n<div><div><pre><code>~<span>$ </span>sacct <span>-j</span> 10\nJobID           JobName  Partition    Account  AllocCPUS      State ExitCode \n<span>------------</span> <span>----------</span> <span>----------</span> <span>----------</span> <span>----------</span> <span>----------</span> <span>--------</span> \n10              script1        eeg        eeg          2    TIMEOUT      0:0 \n10.batch          batch                   eeg          2  COMPLETED      0:0 \n</code></pre></div></div>\n\n<p>Running a job with a specific memory and cpu limitation:</p>\n\n<div><div><pre><code>sbatch --mem=32768 --cpus-per-task=64 script1\n</code></pre></div></div>\n\n<p>To cancel a job, use <code>scancel</code>.</p>\n\n<p>Slurm queues up jobs when the required resources can’t be satisfied. What is unclear is why users won’t request excessive RAM and CPU per job.</p>",
 9  "content_type": "html",
10  "author": {
11    "name": "Mark Elvers",
12    "email": "mark.elvers@tunbury.org",
13    "uri": null
14  },
15  "categories": [
16    "Slurm",
17    "tunbury.org"
18  ],
19  "source": "https://www.tunbury.org/atom.xml"
20}