Thicket data repository for the EEG
at main 5.9 kB view raw
1{ 2 "id": "https://www.tunbury.org/2025/08/06/slurm-limits", 3 "title": "Further investigations with Slurm", 4 "link": "https://www.tunbury.org/2025/08/06/slurm-limits/", 5 "updated": "2025-08-06T00:00:00", 6 "published": "2025-08-06T00:00:00", 7 "summary": "Slurm uses cgroups to constrain jobs with the specified parameters and an accounting database to track job statistics.", 8 "content": "<p>Slurm uses cgroups to constrain jobs with the specified parameters and an accounting database to track job statistics.</p>\n\n<p>After the initial <a href=\"https://www.tunbury.org/2025/04/14/slurm-workload-manager/\">configuration</a> and ensuring everything is at the same <a href=\"https://www.tunbury.org/2025/07/29/slurm-versions/\">version</a>, what we really need is some shared storage between the head node and the cluster machine(s). I’m going to quickly share <code>/home</code> over NFS.</p>\n\n<p>Install an NFS server on the head node with <code>apt install nfs-kernel-server</code> and set up <code>/etc/exports</code>:</p>\n\n<div><div><pre><code>/home foo(rw,sync,no_subtree_check,no_root_squash)\n</code></pre></div></div>\n\n<p>On the cluster worker, install the NFS client, <code>apt install nfs-common</code> and mount the home directory:</p>\n\n<div><div><pre><code>mount -t nfs head:/home/mte24 /home/mte24\n</code></pre></div></div>\n\n<p>I have deleted my user account on the cluster worker and set my UID/GID on the head node to values that do not conflict with any of those on the worker.</p>\n\n<p>With the directory shared, and signed into the head node as my users, I can run <code>sbatch ./myscript</code></p>\n\n<p>Configure Slurm to use cgroups, create <code>/etc/slurm/cgroups.conf</code> containing the following:</p>\n\n<div><div><pre><code>ConstrainCores=yes\nConstrainDevices=yes\nConstrainRAMSpace=yes\nConstrainSwapSpace=yes\n</code></pre></div></div>\n\n<p>Set these values in <code>/etc/slurm/slurm.conf</code>:</p>\n\n<div><div><pre><code>ProctrackType=proctrack/cgroup\nTaskPlugin=task/cgroup,task/affinity\nJobAcctGatherType=jobacct_gather/cgroup\nDefMemPerNode=16384\n</code></pre></div></div>\n\n<p>For accounting, we need to install a database and another Slurm daemon.</p>\n\n<div><div><pre><code>apt <span>install </span>mariadb-server\n</code></pre></div></div>\n\n<p>And <code>slurmdbd</code> with:</p>\n\n<div><div><pre><code>dpkg <span>-i</span> slurm-smd-slurmdbd_25.05.1-1_amd64.deb\n</code></pre></div></div>\n\n<p>Set up a database in MariaDB:</p>\n\n<div><div><pre><code><span>mysql</span> <span>-</span><span>e</span> <span>\"CREATE DATABASE slurm_acct_db; CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'password'; GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';\"</span>\n</code></pre></div></div>\n\n<p>Create <code>/etc/slurm/slurmdbd.conf</code></p>\n\n<div><div><pre><code>DbdHost=localhost\nSlurmUser=slurm\nStorageType=accounting_storage/mysql\nStorageHost=localhost\nStorageUser=slurm\nStoragePass=password\nStorageLoc=slurm_acct_db\nLogFile=/var/log/slurm/slurmdbd.log\nPidFile=/var/run/slurmdbd/slurmdbd.pid\n</code></pre></div></div>\n\n<p>Secure the file as the password is in plain text:</p>\n\n<div><div><pre><code><span>chown </span>slurm:slurm /etc/slurm/slurmdbd.conf\n<span>chmod </span>600 /etc/slurm/slurmdbd.conf\n</code></pre></div></div>\n\n<p>Then add these lines to slurm.conf</p>\n\n<div><div><pre><code>AccountingStorageType=accounting_storage/slurmdbd\nAccountingStoragePort=6819\nAccountingStorageEnforce=limits,qos,safe\n</code></pre></div></div>\n\n<p>Finally, we need to configure a cluster with a name that matches the name in <code>slurm.conf</code>. An account is a logical grouping, such as a department name. It is not a user account. Actual user accounts are associated with a cluster and an account. Therefore, a minimum configuration might be:</p>\n\n<div><div><pre><code>sacctmgr add cluster cluster\nsacctmgr add account <span>name</span><span>=</span>eeg <span>Organization</span><span>=</span>EEG\nsacctmgr <span>-i</span> create user <span>name</span><span>=</span>mte24 <span>cluster</span><span>=</span>cluster <span>account</span><span>=</span>eeg\n</code></pre></div></div>\n\n<p>To test this out, create <code>script1</code> as follows:</p>\n\n<div><div><pre><code>#!/bin/bash\n# Test script\ndate\necho \"I am now running on compute node:\"\nhostname\nsleep 120\ndate\necho \"Done...\"\nexit 0 \n</code></pre></div></div>\n\n<p>Then submit the job with a timeout of 30 seconds.</p>\n\n<div><div><pre><code>~<span>$ </span>sbatch <span>-t</span> 00:00:30 script1\nSubmitted batch job 10\n</code></pre></div></div>\n\n<p>The job output is in <code>slurm-10.out</code>, and we can see the completion state with <code>sacct</code>:</p>\n\n<div><div><pre><code>~<span>$ </span>sacct <span>-j</span> 10\nJobID JobName Partition Account AllocCPUS State ExitCode \n<span>------------</span> <span>----------</span> <span>----------</span> <span>----------</span> <span>----------</span> <span>----------</span> <span>--------</span> \n10 script1 eeg eeg 2 TIMEOUT 0:0 \n10.batch batch eeg 2 COMPLETED 0:0 \n</code></pre></div></div>\n\n<p>Running a job with a specific memory and cpu limitation:</p>\n\n<div><div><pre><code>sbatch --mem=32768 --cpus-per-task=64 script1\n</code></pre></div></div>\n\n<p>To cancel a job, use <code>scancel</code>.</p>\n\n<p>Slurm queues up jobs when the required resources can’t be satisfied. What is unclear is why users won’t request excessive RAM and CPU per job.</p>", 9 "content_type": "html", 10 "author": { 11 "name": "Mark Elvers", 12 "email": "mark.elvers@tunbury.org", 13 "uri": null 14 }, 15 "categories": [ 16 "Slurm", 17 "tunbury.org" 18 ], 19 "source": "https://www.tunbury.org/atom.xml" 20}