Tuesday, October 22, 2019

Upgrading ubuntu server 16.04 to 18.04

The process is very simple, but quite time consuming, and a fast internet will help speed up the process.

To start, upgrade all packages to latest, and reboot if necessary
$ sudo apt update -y
$ sudo apt upgrade -y
$ sudo reboot

Once rebooted, and all packages are updated to the latest version, issue a 
$ sudo do-release-upgrade

Answer yes(y) to all questions, and answer "Keep the local version currently installed" for all questions asking to change any configuration to avoid any issue with currently installed applications

Once completed, press y to restart

Login and check your current version
$ cat /etc/os-release


Thursday, October 17, 2019

SSH Too many authentication failures error

I encountered this error while trying to ssh into one of my client's machine, one day.

$ ssh web
Received disconnect from 192.168.0.36 port 22:2: Too many authentication failures
Disconnected from 192.168.0.36 port 22

After searching around, I found an article that showed that, I can solve this issue just by adding one flag to my ssh command
$ ssh web -o IdentitiesOnly=yes 
sam@192.168.25.36's password: 

Now we're talking. I found out that the reason of this behavior is, I accidentally offered too many private keys to the server, causing the server to reach its maximum MaxAuthRetries, and terminate the connection. You can see this by using -v (for verbose) while doing ssh.

To solve this issue for each session, just add IdentitiesOnly=yes option to your ssh command
$ ssh -o IdentitiesOnly=yes web

To make it permanent, edit ~/.ssh/config file, and add below lines
$ cat >> ~/.ssh/config <<EOF
Host web
  IdentitiesOnly yes
EOF

And you are good to go :)

Tuesday, October 1, 2019

Adding GPU as Resource for Slurm

To make a GPU part of resource that can be managed by Slurm, create /etc/slurm-llnl/gres.conf file with definitions of GPUs available on the node. GRES stands for generic resources, and need to be declared so that slurm can manage it.


Below example is for a node with nvidia tesla v100 gpu. Name - name of the resource, can be gpu, nic or mic
Type - arbitrary string identifying the type of device
File - Fully  qualified pathname of the device files associated with a resource
Cores - specific cpu core numbers, which can use this resource
$ sudo cat /etc/slurm-llnl/gres.conf
Name=gpu Type=v100 File=/dev/nvidia0 Cores=0,1


Add GresTypes and gres resources in slurm.conf.
The format for gres resources is grestype:optional-type:number-of-resource
$ sudo cat /etc/slurm-llnl/slurm.conf
...
GresTypes=gpu
NodeName=mynode CPUs=12 RealMemory=64091 Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:v100:1
...


Restart slurm services to have the changes take effect
$ sudo systemctl restart slurmd


Check the availability of the gres
$ scontrol show node

Installing Slurm Workload Manager & Job Scheduler on Ubuntu 18.04

Enable universe repository
$ echo "deb http://archive.ubuntu.com/ubuntu bionic universe" | sudo tee -a /etc/apt/sources.list

Update package list
$ sudo apt update

Install slurm-wlm
$ sudo apt install slurm-wlm -y

Install slurm documentation. This is useful to generate slurm.conf using configurator.easy.html page
$ sudo apt install slurm-wlm-doc -y

Get a machine with a web browser, and open /usr/share/doc/slurm-wlm-doc/html/configurator.easy.html to easily generate slurm.conf.

You can also access the configurator online at https://slurm.schedmd.com/configurator.easy.html, but depending on your slurm version, the online version might not be suitable.

Fill up the form, some of the information can be retrieved using command
$ slurmd -C

Some of the configuration that I changed from the default
- Make sure the hostname of the system is ControlMachine and NodeName
- State Preservation: set StateSaveLocation to /var/spool/slurm-llnl
- Process tracking: use Pgid instead of Cgroup
- Process ID logging: set this to /var/run/slurm-llnl/slurmctld.pid and /var/run/slurm-llnl/slurmd.pid

Once done, click submit, and copy the generated config file to /etc/slurm-llnl/slurm.conf. Below is my sample config, with only one node
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=myserver
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear
#SelectTypeParameters=
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=myserver CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=myserver Default=YES MaxTime=INFINITE State=UP
DebugFlags=NO_CONF_HASH

Create slurm spool directory
$ sudo mkdir /var/spool/slurm-llnl
$ sudo chown -R slurm.slurm /var/spool/slurm-llnl

Create slurm pid directory
$ sudo mkdir /var/run/slurm-llnl/
$ sudo chown -R slurm.slurm /var/run/slurm-llnl

Start and enable the slurm manager on boot
$ sudo systemctl start slurmctld
$ sudo systemctl enable slurmctld

Start slurmd and enable on boot
$ sudo systemctl start slurmd
$ sudo systemctl enable slurmd

If somehow slurmcrld or slurmd failed to start, run the applications interactively with debug options, to check for any errors. If there is any error, adjust slurm.conf accordingly.
$ sudo -u slurm slurmctld -Dcvvv
$ sudo slurmd -Dcvvv

Check slurm ndoes using scontrol command
$ scontrol show node