Documentation and Information:SGE

Introduction

In Math Dept./PACM we use the Sun Grid Engine (SGE from now on) for job submission and management on various clusters. This page contains information on SGE/Sun Grid Engine usage on those clusters.

All jobs on the cluster have to be submitted through the SGE. SGE will queue up your job and then choose free node(s) on which it will be run. If there are no free nodes, or not enough of them, your job will wait in the queue until appropriate resources are available and then the job will be executed.

Before proceeding you may want to first read documentation about modules because you are likely to have to use them if you will be using MPI or if you will be using compilers different from gcc (like PGI or Intel). We will also refer to modules and show them in examples below.

Basic SGE usage

When submitting a job you will first have to create a submission script that will, when executed, launch your actual computation. The submission script can also contain various options that will be interpreted by SGE and that will influence how your job is executed.

Serial jobs/qsub

We will begin with a serial job, i.e. a job that will run on only one processor. Create a submission script, for example call it myjob.sh (.sh extension because this is going to be a bash/sh script but the extension is not necessary - you can choose any name). We will be running myjobexecutable located in myjobdir:

#!/bin/sh
# following option makes sure the job will run in the current directory
#$ -cwd
# following option makes sure the job has the same environmnent variables as the submission shell
#$ -V

# this executable was compiled with intel compiler so we need to load the intel module so that all the libraries will work and be found
module load intel
# and now the actual executable
$HOME/myjobdir/myjobexecutable option1 option2

This job can then be submitted with qsub command and we will call this job run "Job_name":

qsub -N Job_name myjob.sh

SGE will queue up the job and assign it a number (say 3233 - as in 3233rd job). Then on you can refer to this job by either the name you used during submission ("-N" option) or else by its number (3233 in this case).

If the job, i.e. myjobexecutable, outputs anything on the terminal SGE will redirect that output (stdout) and errors (stderror) into files called like Job_name.o3233 (for stdout) and Job_name.e3233 (for stderror) located in the same directory where the job was submitted. These files should be the first place to look at if you need to debug errors in your program or the submission script.

Basic qsub options

We've already seen "-N" option but there are two other options that were placed in the submission script itself, instead of specifying them on the command line. Any option that qsub understands and are used on the command line can also be specified in the submission script. You put such option(s) in a line of its own whose beginning is "#$". For example, instead of specifying "-N Job_name" we could've added the following line to the above script and submitted the job with just "qsub myjob.sh":

#$ -N Job_name

"-cwd" and "-V" options were already seen in myjob.sh sample script.

"-cwd" makes it so that the job is executed in the directory where it was submitted. If this option is missing the job will be executed in your home directory. You will almost always want this option which is why it is convenient to place it in your submission scripts. The main reason why it is useful is if the job reads input files (say initial conditions from a file INPUT) and/or creates output files (say OUTPUT) in the current working directory then you will want to create different directories for each of your runs and submit your jobs from those directories.

"-V" option makes sure that your job has the same environment variables as the shell in which you submit the job. Again, this is an option that is prudent to always have though it shouldn't be depended on completely (because in MPI case the slave jobs might not actually respect this option, unlike the master node that always will).

There are numerous other options that you can use - some are listed below and others can be found on qsub's man page.

Cluster/job status

Now that you know how to submit the job you will want to also know how to check on its status as well as on the status of the cluster.

"qstat" will show you the status of Grid Engine jobs and queues. For example:

[mathuser@comp01 mathuser]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 232629 0.51000 IMAGE005   mathuser     r     06/18/2006 14:29:48 all.q@comp-02                      4
 231554 0.52111 Pt_Al_vac  student      r     06/16/2006 14:47:32 all.q@comp-04                      3
 232626 0.51000 IMAGE002   professor    r     06/18/2006 14:29:33 all.q@comp-11                      1
 232597 0.52333 O_img3     someoneelse  r     06/18/2006 13:16:48 all.q@comp-16                      6

If you type "qstat -f" you will get a detailed list of queues (on each host) and jobs in that queue.

You can get extensive details about a job with "qstat -j jobname/jobnumber". This might also be useful to find out why a job is still waiting to be executed (especially when you have submitted the job with some requirements, like large memory).

You can get a general picture of how busy the cluster really is by typing "qstat -g c":

[mathuser@comp01 mathuser]$ qstat -g c
CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE
-------------------------------------------------------------------------------
all.q                             0.31     14      1    16      0      1

Finally, you can get a quick view on the status of cluster nodes by running "qhost":

[mathuser@comp01 mathuser]$ qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
comp01                  lx26-x86        1  1.00 1011.1M  164.2M 1024.0M   86.4M
comp02                  lx26-x86        1  1.04  503.5M  491.7M 1024.0M  628.2M
comp03                  lx26-x86        1  2.04  503.6M  334.5M 1024.0M  175.1M
comp04                  lx26-x86        1  1.12  503.6M  184.7M 1024.0M  169.6M
.                       .               .  .       .     .         .        .
.                       .               .  .       .     .         .        .
.                       .               .  .       .     .         .        .
comp16                  lx26-x86        1  0.00 1011.1M   92.9M 1024.0M     0.0

Cancel/modify jobs

If you decide to cancel/delete one of your jobs (or of others if you have been designated as a cluster administrator) you can do it with "qdel" command by using jobs name(s) or job IDs. You can also delete all jobs for a particular user:

[mathuser@comp01 mathuser]$ qdel job_name1
[mathuser@comp01 mathuser]$ qdel 33245 33246 33246
[mathuser@comp01 mathuser]$ qdel -u smith

If a job is already running and the regular qdel is not working try forcing the removal with "-f" option. E.g. "qdel -f job_name1".

"qmod" command allows you to modify a job - e.g. you can suspend it, reschedule it, clear error states and so on.

Job statistics

After the job execution has ended you can ask SGE for its statistics - e.g. CPU time and memory used during exection:

[mathuser@comp01 mathuser]$ qacct -j 232741

Parallel jobs (MPI - mpich)

Submit script for MPI parallel jobs has to contain a very specific mpirun command. This is because mpirun needs to be given a list a machines that SGE will reserve for job use. We also want mpirun to use SGE's rsh command which ensures that the job can be properly monitored and controlled by SGI. In particular we can then cancel it or view how many CPU cycles it used. Example submission script for myjobdir/myparallel.exe MPI job compiled with mpich:

#!/bin/sh
# following option makes sure the job will run in the current directory
#$ -cwd
# following option makes sure the job has the same environmnent variables as the submission shell
#$ -V
# VERY IMPORTANT: load appropriate environment module
# in this case this program was compiled with mpich intel version
module load mpich/intel
# and now run the program
mpirun -np $NSLOTS -machinefile $TMPDIR/machines -rsh $TMPDIR/rsh $HOME/myjobdir/myparallel.exe param1 param2

This is how we submit this job to be executed on 10 hosts with job name Job_name:

qsub -N Job_name -pe mpich 10 mympijob.sh

The key option is "-pe" which accepts 2 parameters, the parallel environment (table of available ones follows) and the number of processors you want to reserve for your job. Number of processors to use can be also specified with a range, e.g. 10-20, and the SGE will give you as many as are available in that range. A table describing various options for parallel environment follows.

The next example is for MPI executables compiled with openmpi - note that the file is different from the one we use for mpich and aside from loading a different module we also use mpiexec instead of mpirun

#!/bin/sh
# following option makes sure the job will run in the current directory
#$ -cwd
# following option makes sure the job has the same environmnent variables as the submission shell
#$ -V
# VERY IMPORTANT: load appropriate environment module
# in this case this program was compiled with openmpi pgi version
module load openmpi/pgi
# and now run the program
mpiexec -np $NSLOTS $HOME/myjobdir/myparallel.exe param1 param2

You would submit the above job with a line resembling:

qsub -N Job_name -pe openmpi 10 mympijob.sh

More advanced SGE usage

Request a node with lots of memory

If your job will require a lot of memory you can request from SGE to assign you nodes with a minimum amount of free memory by specifying a job resource requirement mem_free. E.g.

qsub -l mem_free=1G testjob.sh

would ask for nodes with at least 1GB of free memory. Similarly, if you want to see which nodes match your requirements you can query for the same resource:

qhost -l mem_free=1G

The output should contain all the nodes that currently have at least 1GB of free memory.

Note that the job will wait in the queue until a host with enough memory is available, in other words until all of your requirements can be met. To check why a job is waiting just ask for its details with qstat -j jobnum.