Thicket data repository for the EEG
at main 4.8 kB view raw
1{ 2 "id": "https://www.tunbury.org/2025/04/14/slurm-workload-manager", 3 "title": "Slurm Workload Manager", 4 "link": "https://www.tunbury.org/2025/04/14/slurm-workload-manager/", 5 "updated": "2025-04-14T00:00:00", 6 "published": "2025-04-14T00:00:00", 7 "summary": "Sadiq mentioned slurm as a possible way to better schedule the group’s compute resources. Many resources are available showing how to create batch jobs for Slurm clusters but far fewer on how to set up a cluster. This is a quick walkthrough of the basic steps to set up a two-node compute cluster on Ubuntu 24.04. Note that slurmd and slurmctld can run on the same machine.", 8 "content": "<p>Sadiq mentioned <code>slurm</code> as a possible way to better schedule the group’s compute resources. Many resources are available showing how to create batch jobs for Slurm clusters but far fewer on how to set up a cluster. This is a quick walkthrough of the basic steps to set up a two-node compute cluster on Ubuntu 24.04. Note that <code>slurmd</code> and <code>slurmctld</code> can run on the same machine.</p>\n\n<p>Create three VMs: <code>node1</code>, <code>node2</code> and <code>head</code>.</p>\n\n<p>On <code>head</code>, install these components.</p>\n\n<div><div><pre><code>apt <span>install </span>munge slurmd slurmctld\n</code></pre></div></div>\n\n<p>On <code>node1</code> and <code>node2</code> install.</p>\n\n<div><div><pre><code>apt <span>install </span>munge slurmd\n</code></pre></div></div>\n\n<p>Copy <code>/etc/munge/munge.key</code> from <code>head</code> to the same location on <code>node1</code> and <code>node2</code>. Then restart <code>munge</code> on the other nodes with <code>service munge restart</code>.</p>\n\n<p>You should now be able to <code>munge -n | unmunge</code> without error. This should also work via SSH. i.e. <code>ssh head munge -n | ssh node1 unmunge</code></p>\n\n<p>If you don’t have DNS, add <code>node1</code> and <code>node2</code> to the <code>/etc/hosts</code> file on <code>head</code> and add <code>head</code> to the <code>/etc/hosts</code> on <code>node1</code> and <code>node2</code>.</p>\n\n<p>On <code>head</code>, create the daemon spool directory:</p>\n\n<div><div><pre><code><span>mkdir</span> /var/spool/slurmctld\n<span>chown</span> <span>-R</span> slurm:slurm /var/spool/slurmctld/\n<span>chmod </span>775 /var/spool/slurmctld/\n</code></pre></div></div>\n\n<p>Create <code>/etc/slurm/slurm.conf</code>, as below. Update the compute node section by running <code>slurmd -C</code> on each node to generate the configuration line. This file should be propagated to all the machines. The configuration file can be created using this <a href=\"https://slurm.schedmd.com/configurator.html\">tool</a>.</p>\n\n<div><div><pre><code>ClusterName=cluster\nSlurmctldHost=head\nProctrackType=proctrack/linuxproc\nReturnToService=1\nSlurmctldPidFile=/var/run/slurmctld.pid\nSlurmctldPort=6817\nSlurmdPidFile=/var/run/slurmd.pid\nSlurmdPort=6818\nSlurmdSpoolDir=/var/spool/slurmd\nSlurmUser=slurm\nStateSaveLocation=/var/spool/slurmctld\nTaskPlugin=task/affinity,task/cgroup\n\n# TIMERS\nInactiveLimit=0\nKillWait=30\nMinJobAge=300\nSlurmctldTimeout=120\nSlurmdTimeout=300\nWaittime=0\n\n# SCHEDULING\nSchedulerType=sched/backfill\nSelectType=select/cons_tres\n\n# LOGGING AND ACCOUNTING\nJobCompType=jobcomp/none\nJobAcctGatherFrequency=30\nSlurmctldDebug=info\nSlurmctldLogFile=/var/log/slurmctld.log\nSlurmdDebug=info\nSlurmdLogFile=/var/log/slurmd.log\n\n# COMPUTE NODES\nNodeName=node1 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1963\nNodeName=node2 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1963\nPartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP\n</code></pre></div></div>\n\n<p>On <code>head</code>, start the control daemon.</p>\n\n<div><div><pre><code>service slurmctld start\n</code></pre></div></div>\n\n<p>And on the nodes, start the slurm daemon.</p>\n\n<div><div><pre><code>service slurmd start\n</code></pre></div></div>\n\n<p>From <code>head</code>, you can now run a command simultaneously on both nodes.</p>\n\n<div><div><pre><code><span># srun -N2 -l /bin/hostname</span>\n0: node1\n1: node2\n</code></pre></div></div>\n\n<p>The optional <code>Gres</code> parameter on <code>NodeName</code> allows nodes to be configured with extra resources such as GPUs.</p>\n\n<p>Typical configurations use an NFS server to make /home available on all the nodes. Note that users only need to be created on the head node and don’t need SSH access to the compute nodes.</p>", 9 "content_type": "html", 10 "author": { 11 "name": "Mark Elvers", 12 "email": "mark.elvers@tunbury.org", 13 "uri": null 14 }, 15 "categories": [ 16 "Slurm", 17 "tunbury.org" 18 ], 19 "source": "https://www.tunbury.org/atom.xml" 20}