Documentation and Information:Computational clusters in Fine Hall

Fine Hall machine room is currently hosting 1 mini computational cluster

NewComp computing cluster

Description

NewComp mini computational cluster consists of 4 nodes, each with 2 Xeon X5680 CPUs - 6 cores each, 12 cores total, running at 3.33GHz. Each node has 96GB of memory - 8 GB/node. Head node is equipped with one Intel Xeon X5650 CPUs (6 cores total, running at 2.67GHz) and with 12 GB of memory.

Nodes are connected with gigabit ethernet networking as well as 4x Infiniband.

Configuration

The cluster is integrated into Fine Hall Math/PACM network and all the cluster machines mount Math/PACM home directories. The operating system used on these machines is a clone of RHEL 6.

For temporary storage, besides /tmp, one can use also /scratch - with no quotas. It must be emphasized that both /scratch and /tmp cannot be used for permanent data storage and no crucial data should be stored there, e.g. use it for intermediate computational results. /tmp and /scratch are NOT backed up and can be erased at any time, especially if a reinstall of one or more machines is required or if one of these directories is full and other users need space. /tmp is also regularly cleaned up by a system job and any file in /tmp that hasn't been accessed in last 10 days will be deleted.

Head node /scratch is approximately 3TBs and its subdirectory /scratch/network is exported to all nodes (as /scratch/network). Therefore if you need to access/write temporary data from all nodes create a subdirectory of /scratch/network (like /scratch/network/username) and read/write there.

Nodes also have local /scratch space and their size is approximately 700GB. This local disk is also quite fast so consider it for fast data writing and reading. Just like with /scratch/network create /scratch/network/username and read/write from there. As mentioned above the /scratch/network on these nodes is mounted from the head node and while bigger in size it is also a lot slower then the local disk.

It cannot be emphasize enough that /scratch (and /scratch/network) is for temporary data storage only. Data placed there will occasionally be purged (without notice, oldest first) as needed to ensure all users have enough space.

Access

At this time the cluster is open to all Math and PACM members.

How to connect

In order to connect to NewComp cluster you will have to login first to math.princeton.edu and from there you can:

ssh newcomp

Login should proceed without the need to enter any passwords.

Compiling your programs

You should be compiling and preparing your jobs on the head node. You can setup your environment to use one of available compilers or MPI versions by using module command. Check how to use environment modules.

For MPI use you should probably be using the latest version of OpenMPI as it can take advantage of infiniband interfaces on nodes.

Scheduling/Running Jobs

No jobs/computations, expect maybe very short test runs, should be run on the head node. Any other jobs will be terminated without prior notice.

All jobs have to be submitted to the scheduler which will take care of assigning the necessary resources and running the job. Any computations found running without being submitted through the scheduler or that were submitted incorrectly (e.g. if the job consumes more cores then allocated or runs after it was supposed to complete) will be terminated without prior notice.

The scheduler in use on newcomp is torque/maui.

Torque/Maui Queues

The scheduler will automatically place your job in one of the following queues. Here are their names and their current limits:

Short Length Queue

4 hour wall clock limit
48 max processes total (of all users together)
3 nodes max per job

Medium Length Queue

4-24 hour wall clock limit
24 max processes total (of all users together)
2 nodes max per job

Long Length Queue

24 hour-7 days wall clock limit
24 max processes total (of all users together)
12 max processes per user

Job Submission Gotchas

Please take a look at the below examples - you absolutely have to specify how many nodes you need and how many cores/node as well as the wall clock. Make sure you specify enough time for your job to finish while trying to be close to the actual run time. The scheduler will use that information to fit your job best and requesting much more time then you actually need might make your jobs wait too long to be scheduled for running.

Submitting Single Core/Serial Jobs

To run a single core program with executable called, say, myprogram compiled with intel 10.1 compiler, you will need to write a job script for torque. Here is a sample command script, serial.cmd, which uses (of course) 1 core:

cd my_serial_directory
cat serial.cmd

# serial job using 1 node and 1 processor, and runs
# for 3 hours (max).
#PBS -l nodes=1:ppn=1,walltime=3:00:00
#
# sends mail if the process aborts, when it begins, and
# when it ends (abe)
#PBS -m abe
#
# load intel compiler settings before running the program
# since we compiled it with intel 10.1
module load intel/10.1
# go to the directory with the program
cd $HOME/my_serial_directory
# and run it
./myprogram

To submit the job to the scheduling system, use:

qsub serial.cmd

Submitting Parallel Jobs

To run your parallel/MPI processing executable called myparallelprog a job script will need to be created for torque. Here is a sample command script, parallel.cmd, which uses 8 cores total (4 cores per node).

cd my_mpi_directory
cat parallel.cmd

#!/bin/bash
# parallel job using 2 nodes and 16 CPU cores, and runs
# for 4 hours (max).
#PBS -l nodes=2:ppn=8,walltime=4:00:00
#
# sends mail if the process aborts, when it begins, and
# when it ends (abe)
#PBS -m abe
#
module load openmpi
cd /u/username/my_mpi_directory
numprocs=`wc -l <${PBS_NODEFILE}`
mpiexec -np $numprocs ./a.out

To submit the job to the batch system, use:

qsub parallel.cmd

Submitting Multiple Parametrized Jobs

If you need to submit multiple, say 100, jobs you can submit them with

[username@newcomp] qsub -t 1-100 jobscript.cmd

That will submit 100 jobs and each will be assigned a unique number (from 1 to 100) available in environment variable PBS_ARRAYID. You can use that environment variable in the jobscript.cmd script, e.g. to process different data sets. For example the script could be

# serial job using 1 node and 1 processor, and runs
# for 3 hours (max).
#PBS -l nodes=1:ppn=1,walltime=3:00:00
#
# sends mail if the process aborts, when it begins, and
# when it ends (abe)
#PBS -m abe
#
cd $HOME/my_serial_directory
# and run it
./myprogram $PBS_ARRAY_ID
==== Useful Scheduler Tools ====

showbf - shows how many nodes are available and for how long. The wall clock limit of a job should be less than the duration reported by showbf, otherwise the job will not run before the next scheduled maintenance period.
diagnose -p - shows the priority assigned to queued jobs
showq or qstat - shows jobs in the queues
xpbs - a graphical display of the queues
pbstop - a text based view of the cluster nodes (e.g., pbstop -c 1 -m 8 -01234567)
qdel - to kill a job
qsig -s 0 <jobid> - alternate way to kill a job that will not be removed with qdel