The process is very simple, but quite time consuming, and a fast internet will help speed up the process.
Tuesday, October 22, 2019
Upgrading ubuntu server 16.04 to 18.04
Thursday, October 17, 2019
SSH Too many authentication failures error
I encountered this error while trying to ssh into one of my client's machine, one day.
$ ssh web
Received disconnect from 192.168.0.36 port 22:2: Too many authentication failures
Disconnected from 192.168.0.36 port 22
Tuesday, October 1, 2019
Adding GPU as Resource for Slurm
To make a GPU part of resource that can be managed by Slurm, create /etc/slurm-llnl/gres.conf file with definitions of GPUs available on the node. GRES stands for generic resources, and need to be declared so that slurm can manage it.
Below example is for a node with nvidia tesla v100 gpu. Name - name of the resource, can be gpu, nic or mic
Type - arbitrary string identifying the type of device
File - Fully qualified pathname of the device files associated with a resource
Cores - specific cpu core numbers, which can use this resource
$ sudo cat /etc/slurm-llnl/gres.conf
Name=gpu Type=v100 File=/dev/nvidia0 Cores=0,1
Add GresTypes and gres resources in slurm.conf.
The format for gres resources is grestype:optional-type:number-of-resource
$ sudo cat /etc/slurm-llnl/slurm.conf
...
GresTypes=gpu
NodeName=mynode CPUs=12 RealMemory=64091 Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:v100:1
...
Restart slurm services to have the changes take effect
$ sudo systemctl restart slurmd
Check the availability of the gres
$ scontrol show node
Installing Slurm Workload Manager & Job Scheduler on Ubuntu 18.04
Enable universe repository
$ echo "deb http://archive.ubuntu.com/ubuntu bionic universe" | sudo tee -a /etc/apt/sources.list
Update package list
$ sudo apt update
Install slurm-wlm
$ sudo apt install slurm-wlm -y
Install slurm documentation. This is useful to generate slurm.conf using configurator.easy.html page
$ sudo apt install slurm-wlm-doc -y
Get a machine with a web browser, and open /usr/share/doc/slurm-wlm-doc/html/configurator.easy.html to easily generate slurm.conf.
You can also access the configurator online at https://slurm.schedmd.com/configurator.easy.html, but depending on your slurm version, the online version might not be suitable.
Fill up the form, some of the information can be retrieved using command
$ slurmd -C
Some of the configuration that I changed from the default
- Make sure the hostname of the system is ControlMachine and NodeName
- State Preservation: set StateSaveLocation to /var/spool/slurm-llnl
- Process tracking: use Pgid instead of Cgroup
- Process ID logging: set this to /var/run/slurm-llnl/slurmctld.pid and /var/run/slurm-llnl/slurmd.pid
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=myserver
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear
#SelectTypeParameters=
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=myserver CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=myserver Default=YES MaxTime=INFINITE State=UP
DebugFlags=NO_CONF_HASH
$ sudo mkdir /var/spool/slurm-llnl
$ sudo chown -R slurm.slurm /var/spool/slurm-llnl
Create slurm pid directory
$ sudo mkdir /var/run/slurm-llnl/
$ sudo chown -R slurm.slurm /var/run/slurm-llnl
Start and enable the slurm manager on boot
$ sudo systemctl start slurmctld
$ sudo systemctl enable slurmctld
Start slurmd and enable on boot
$ sudo systemctl start slurmd
$ sudo systemctl enable slurmd
If somehow slurmcrld or slurmd failed to start, run the applications interactively with debug options, to check for any errors. If there is any error, adjust slurm.conf accordingly.
$ sudo -u slurm slurmctld -Dcvvv
$ sudo slurmd -Dcvvv
Check slurm ndoes using scontrol command
$ scontrol show node