mte/2025_04_14_slurm-workload-manager.json at main · anil.recoil.org/thicket-eeg

Thicket data repository for the EEG
thicket-eeg / mte / 2025_04_14_slurm-workload-manager.json
at main 4.8 kB view raw
 1{
 2  "id": "https://www.tunbury.org/2025/04/14/slurm-workload-manager",
 3  "title": "Slurm Workload Manager",
 4  "link": "https://www.tunbury.org/2025/04/14/slurm-workload-manager/",
 5  "updated": "2025-04-14T00:00:00",
 6  "published": "2025-04-14T00:00:00",
 7  "summary": "Sadiq mentioned slurm as a possible way to better schedule the group’s compute resources. Many resources are available showing how to create batch jobs for Slurm clusters but far fewer on how to set up a cluster. This is a quick walkthrough of the basic steps to set up a two-node compute cluster on Ubuntu 24.04. Note that slurmd and slurmctld can run on the same machine.",
 8  "content": "<p>Sadiq mentioned <code>slurm</code> as a possible way to better schedule the group’s compute resources. Many resources are available showing how to create batch jobs for Slurm clusters but far fewer on how to set up a cluster. This is a quick walkthrough of the basic steps to set up a two-node compute cluster on Ubuntu 24.04. Note that <code>slurmd</code> and <code>slurmctld</code> can run on the same machine.</p>\n\n<p>Create three VMs: <code>node1</code>, <code>node2</code> and <code>head</code>.</p>\n\n<p>On <code>head</code>, install these components.</p>\n\n<div><div><pre><code>apt <span>install </span>munge slurmd slurmctld\n</code></pre></div></div>\n\n<p>On <code>node1</code> and <code>node2</code> install.</p>\n\n<div><div><pre><code>apt <span>install </span>munge slurmd\n</code></pre></div></div>\n\n<p>Copy <code>/etc/munge/munge.key</code> from <code>head</code> to the same location on <code>node1</code> and <code>node2</code>. Then restart <code>munge</code> on the other nodes with <code>service munge restart</code>.</p>\n\n<p>You should now be able to <code>munge -n | unmunge</code> without error. This should also work via SSH. i.e. <code>ssh head munge -n | ssh node1 unmunge</code></p>\n\n<p>If you don’t have DNS, add <code>node1</code> and <code>node2</code> to the <code>/etc/hosts</code> file on <code>head</code> and add <code>head</code> to the <code>/etc/hosts</code> on <code>node1</code> and <code>node2</code>.</p>\n\n<p>On <code>head</code>, create the daemon spool directory:</p>\n\n<div><div><pre><code><span>mkdir</span> /var/spool/slurmctld\n<span>chown</span> <span>-R</span> slurm:slurm /var/spool/slurmctld/\n<span>chmod </span>775 /var/spool/slurmctld/\n</code></pre></div></div>\n\n<p>Create <code>/etc/slurm/slurm.conf</code>, as below. Update the compute node section by running <code>slurmd -C</code> on each node to generate the configuration line. This file should be propagated to all the machines. The configuration file can be created using this <a href=\"https://slurm.schedmd.com/configurator.html\">tool</a>.</p>\n\n<div><div><pre><code>ClusterName=cluster\nSlurmctldHost=head\nProctrackType=proctrack/linuxproc\nReturnToService=1\nSlurmctldPidFile=/var/run/slurmctld.pid\nSlurmctldPort=6817\nSlurmdPidFile=/var/run/slurmd.pid\nSlurmdPort=6818\nSlurmdSpoolDir=/var/spool/slurmd\nSlurmUser=slurm\nStateSaveLocation=/var/spool/slurmctld\nTaskPlugin=task/affinity,task/cgroup\n\n# TIMERS\nInactiveLimit=0\nKillWait=30\nMinJobAge=300\nSlurmctldTimeout=120\nSlurmdTimeout=300\nWaittime=0\n\n# SCHEDULING\nSchedulerType=sched/backfill\nSelectType=select/cons_tres\n\n# LOGGING AND ACCOUNTING\nJobCompType=jobcomp/none\nJobAcctGatherFrequency=30\nSlurmctldDebug=info\nSlurmctldLogFile=/var/log/slurmctld.log\nSlurmdDebug=info\nSlurmdLogFile=/var/log/slurmd.log\n\n# COMPUTE NODES\nNodeName=node1 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1963\nNodeName=node2 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1963\nPartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP\n</code></pre></div></div>\n\n<p>On <code>head</code>, start the control daemon.</p>\n\n<div><div><pre><code>service slurmctld start\n</code></pre></div></div>\n\n<p>And on the nodes, start the slurm daemon.</p>\n\n<div><div><pre><code>service slurmd start\n</code></pre></div></div>\n\n<p>From <code>head</code>, you can now run a command simultaneously on both nodes.</p>\n\n<div><div><pre><code><span># srun -N2 -l /bin/hostname</span>\n0: node1\n1: node2\n</code></pre></div></div>\n\n<p>The optional <code>Gres</code> parameter on <code>NodeName</code> allows nodes to be configured with extra resources such as GPUs.</p>\n\n<p>Typical configurations use an NFS server to make /home available on all the nodes. Note that users only need to be created on the head node and don’t need SSH access to the compute nodes.</p>",
 9  "content_type": "html",
10  "author": {
11    "name": "Mark Elvers",
12    "email": "mark.elvers@tunbury.org",
13    "uri": null
14  },
15  "categories": [
16    "Slurm",
17    "tunbury.org"
18  ],
19  "source": "https://www.tunbury.org/atom.xml"
20}