Difference between revisions of "Documentation and Information:Computational clusters in Fine Hall"

From CompudocWiki
Jump to navigation Jump to search
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
Fine Hall machine room is currently hosting 3 different computational clusters:
+
Fine Hall machine room is currently hosting 1 mini computational cluster
  
== Comp computational cluster ==
+
== NewComp computing cluster ==
 
=== Description ===
 
=== Description ===
Comp cluster is an older cluster consisting of 16 single CPU AMD Athlon machines with speeds around 1.6Ghz and memory per node ranging from 1GB to 512MBNodes are connected together with 100Mb ethernet networking and have 20GB-40GB hard drives.
+
NewComp mini computational cluster consists of 4 nodes, each with 2 Xeon X5680 CPUs - 6 cores each, 12 cores total, running at 3.33GHz.  Each node has 96GB of memory - 8 GB/node.  Head node is equipped with one Intel Xeon X5650 CPUs (6 cores total, running at 2.67GHz) and with 12 GB of memory.
  
 +
Nodes are connected with gigabit ethernet networking as well as 4x Infiniband. 
 
=== Configuration ===
 
=== Configuration ===
The cluster is integrated into Fine Hall Math/PACM network and all the cluster machines mount Math/PACM home directories and run the same operating system version as the rest of Fine Hall Linux machines - PU_IAS/Elders 5 Linux (clone of RHEL5).  The software set also closely matches Fine Hall Linux workstations though some of the graphical/desktop applications with no computational use have not been installed.
+
The cluster is integrated into Fine Hall Math/PACM network and all the cluster machines mount Math/PACM home directories.  The operating system used on these machines is a clone of RHEL 6.
  
For temporary storage, besides /tmp, one can use also /scratch - with no quotas. It must be emphasized that both /scratch and /tmp cannot be used for permanent data storage and no crucial data should be stored there, e.g. use it for intermediate computational results. /tmp and /scratch are '''NOT''' backed up and can be erased at any time, especially if a reinstall of one or more machines is required or if one of these directories is full and other users need space. /tmp is also regularly cleaned up by a system job and any file in /tmp that hasn't been accessed in last 10 days will be deleted.
+
For temporary storage, besides /tmp, one can use also /scratch - with no quotas. It must be emphasized that both /scratch and /tmp cannot be used for permanent data storage and no crucial data should be stored there, e.g. use it for intermediate computational results. /tmp and /scratch are NOT backed up and can be erased at any time, especially if a reinstall of one or more machines is required or if one of these directories is full and other users need space. /tmp is also regularly cleaned up by a system job and any file in /tmp that hasn't been accessed in last 10 days will be deleted.
 +
 
 +
Head node /scratch is approximately 3TBs and its subdirectory /scratch/network is exported to all nodes (as /scratch/network). Therefore if you need to access/write temporary data from all nodes create a subdirectory of /scratch/network (like /scratch/network/username) and read/write there.
  
The scheduling software used on the cluster is the Sun Grid Engine.
+
Nodes also have local /scratch space and their size is approximately 700GB.  This local disk is also quite fast so consider it for fast data writing and reading.  Just like with /scratch/network create /scratch/network/username and read/write from there.  As mentioned above the /scratch/network on these nodes is mounted from the head node and while bigger in size it is also a lot slower then the local disk.
  
 +
It cannot be emphasize enough that /scratch (and /scratch/network) is for '''temporary''' data storage '''only'''.  Data placed there will occasionally be purged (without notice, oldest first) as needed to ensure all users have enough space.
 
=== Access ===
 
=== Access ===
This cluster is fully accessible to all members of Math/PACM.
+
At this time the cluster is open to all Math and PACM members.
  
 
=== How to connect ===
 
=== How to connect ===
Comp cluster's head node name is comp (comp01).  You can connect to it with ssh but only from '''math.princeton.edu''' and '''pacm.princeton.edu'''E.g. '''<tt>ssh comp</tt>'''.   
+
In order to connect to NewComp cluster you will have to login first to <tt>math.princeton.edu</tt> and from there you can:
 +
ssh newcomp
 +
Login should proceed without the need to enter any passwords.
 +
=== Compiling your programs ===
 +
You should be compiling and preparing your jobs on the head node. You can setup your environment to use one of available compilers or MPI versions by using module commandCheck [[Documentation_and_Information:Modules|how to use environment modules]].
 +
 
 +
For MPI use you should probably be using the latest version of OpenMPI as it can take advantage of infiniband interfaces on nodes.
 +
=== Scheduling/Running Jobs ===
 +
No jobs/computations, expect maybe very short test runs, should be run on the head nodeAny other jobs will be terminated without prior notice.
  
== How to Use ==
+
All jobs have to be submitted to the scheduler which will take care of assigning the necessary resources and running the job.  Any computations found running without being submitted through the scheduler or that were submitted incorrectly (e.g. if the job consumes more cores then allocated or runs after it was supposed to complete) will be terminated without prior notice.
No computations/jobs should be ran on the cluster without the use of the scheduling software, SGE.  Any jobs not using SGE might be removed at any time.
 
  
== Macomp computing cluster ==
+
The scheduler in use on newcomp is torque/maui. 
=== Description ===
+
 
MaComp computational cluster consists of 26 dual Opteron 248 nodes (2.2Ghz operating frequency).  Master node is equipped with 8GB of memory and the nodes with 2GB eachNodes are connected with gigabit ethernet networking and have 120GB IDE hard drives.
+
==== Torque/Maui Queues ====
=== Configuration ===
+
The scheduler will automatically place your job in one of the following queues.  Here are their names and their current limits:
The cluster is integrated into Fine Hall Math/PACM network and all the cluster machines mount Math/PACM home directoriesThe operating system used on these machines is a clone of RHEL 3.
+
===== Short Length Queue =====
 +
* 4 hour wall clock limit
 +
* 48 max processes total (of all users together)
 +
* 3 nodes max per job
 +
===== Medium Length Queue =====
 +
* 4-24 hour wall clock limit
 +
* 24 max processes total (of all users together)
 +
* 2 nodes max per job
 +
===== Long Length Queue =====
 +
* 24 hour-7 days wall clock limit
 +
* 24 max processes total (of all users together)
 +
* 12 max processes per user
 +
==== Job Submission Gotchas ====
 +
Please take a look at the below examples - you absolutely have to specify how many nodes you need and how many cores/node as well as the wall clockMake sure you specify enough time for your job to finish while trying to be close to the actual run timeThe scheduler will use that information to fit your job best and requesting much more time then you actually need might make your jobs wait too long to be scheduled for running.
 +
==== Submitting Single Core/Serial Jobs ====
 +
To run a single core program with executable called, say, myprogram compiled with intel 10.1 compiler, you will need to write a job script for torque. Here is a sample command script, serial.cmd, which uses (of course) 1 core:
 +
 
 +
cd my_serial_directory
 +
cat serial.cmd
 +
 +
# serial job using 1 node and 1 processor, and runs
 +
# for 3 hours (max).
 +
#PBS -l nodes=1:ppn=1,walltime=3:00:00
 +
#
 +
# sends mail if the process aborts, when it begins, and
 +
# when it ends (abe)
 +
#PBS -m abe
 +
#
 +
# load intel compiler settings before running the program
 +
# since we compiled it with intel 10.1
 +
module load intel/10.1
 +
# go to the directory with the program
 +
cd $HOME/my_serial_directory
 +
# and run it
 +
  ./myprogram
 +
 
 +
To submit the job to the scheduling system, use:
  
For temporary storage, besides /tmp, one can use also /scratch - with no quotas. It must be emphasized that both /scratch and /tmp cannot be used for permanent data storage and no crucial data should be stored there, e.g. use it for intermediate computational results. /tmp and /scratch are NOT backed up and can be erased at any time, especially if a reinstall of one or more machines is required or if one of these directories is full and other users need space. /tmp is also regularly cleaned up by a system job and any file in /tmp that hasn't been accessed in last 10 days will be deleted.
+
qsub serial.cmd
 +
==== Submitting Parallel Jobs ====
 +
To run your parallel/MPI processing executable called myparallelprog a job script will need to be created for torque. Here is a sample command script, parallel.cmd, which uses 8 cores total (4 cores per node).
  
The scheduling software used is Sun's Grid Engine version 6.0 and all the jobs '''have to''' be submitted with SGE. Once logged in please check <tt>/usr/finehall/computing/sge/samples/readme.txt</tt> for basic instructions on how to submit jobs to SGE and in particular how to submit MPIch jobs. You can find sample submission scripts in <tt>/usr/finehall/computing/sge/samples</tt>
+
  cd my_mpi_directory
=== Access ===
+
cat parallel.cmd
At this time access is restricted to grant applicants/contributers.
 
  
=== How to connect ===
+
#!/bin/bash
In order to connect to MaComp cluster you will have to login first to <tt>math.princeton.edu</tt> and from there you can:
+
  # parallel job using 2 nodes and 16 CPU cores, and runs
  ssh macomp
+
  # for 4 hours (max).
Login should proceed without the need to enter any passwords. If you are denied access or asked for a password then your account has not yet been allowed access to the cluster.
+
#PBS -l nodes=2:ppn=8,walltime=4:00:00
== Wiffin computing cluster ==
+
#
=== Description ===
+
  # sends mail if the process aborts, when it begins, and
Wiffin computational cluster consists of 20 dual Xeon 2.2Ghz nodes. Half of the nodes have 2GB and the other half 4GB memory. Nodes are connected with gigabit ethernet networking.
+
# when it ends (abe)
 +
#PBS -m abe
 +
#
 +
module load openmpi
 +
cd /u/username/my_mpi_directory
 +
numprocs=`wc -l <${PBS_NODEFILE}`
 +
mpiexec -np $numprocs ./a.out
  
=== Configuration ===
+
To submit the job to the batch system, use:
The cluster is running a version of RedHat Linux.
 
  
Scheduling software used is Sun's Grid Engine version 5.3 and all the jobs '''have to''' be submitted with SGE.
+
qsub parallel.cmd
 +
==== Submitting Multiple Parametrized Jobs ====
 +
If you need to submit multiple, say 100, jobs you can submit them with
 +
[username@newcomp] qsub -t 1-100 jobscript.cmd
 +
That will submit 100 jobs and each will be assigned a unique number (from 1 to 100) available in environment variable PBS_ARRAYID.  You can use that environment variable in the jobscript.cmd script, e.g. to process different data sets.  For example the script could be
 +
# serial job using 1 node and 1 processor, and runs
 +
# for 3 hours (max).
 +
#PBS -l nodes=1:ppn=1,walltime=3:00:00
 +
#
 +
# sends mail if the process aborts, when it begins, and
 +
# when it ends (abe)
 +
#PBS -m abe
 +
#
 +
cd $HOME/my_serial_directory
 +
# and run it
 +
./myprogram $PBS_ARRAYID
  
=== Access ===
+
==== Useful Scheduler Tools ====
The access to this cluster is restricted to members of Prof. Emily Carter's research group and it is not otherwise part of Fine Hall network of Math/PACM Linux machines.
+
* showbf - shows how many nodes are available and for how long. The wall clock limit of a job should be less than the duration reported by showbf, otherwise the job will not run before the next scheduled maintenance period.
 +
* diagnose -p - shows the priority assigned to queued jobs
 +
* showq or qstat - shows jobs in the queues
 +
* xpbs - a graphical display of the queues
 +
* pbstop - a text based view of the cluster nodes (e.g., pbstop -c 1 -m 8 -01234567)
 +
* qdel - to kill a job
 +
* qsig -s 0 <jobid> - alternate way to kill a job that will not be removed with qdel

Latest revision as of 12:19, 5 August 2010

Fine Hall machine room is currently hosting 1 mini computational cluster

NewComp computing cluster

Description

NewComp mini computational cluster consists of 4 nodes, each with 2 Xeon X5680 CPUs - 6 cores each, 12 cores total, running at 3.33GHz. Each node has 96GB of memory - 8 GB/node. Head node is equipped with one Intel Xeon X5650 CPUs (6 cores total, running at 2.67GHz) and with 12 GB of memory.

Nodes are connected with gigabit ethernet networking as well as 4x Infiniband.

Configuration

The cluster is integrated into Fine Hall Math/PACM network and all the cluster machines mount Math/PACM home directories. The operating system used on these machines is a clone of RHEL 6.

For temporary storage, besides /tmp, one can use also /scratch - with no quotas. It must be emphasized that both /scratch and /tmp cannot be used for permanent data storage and no crucial data should be stored there, e.g. use it for intermediate computational results. /tmp and /scratch are NOT backed up and can be erased at any time, especially if a reinstall of one or more machines is required or if one of these directories is full and other users need space. /tmp is also regularly cleaned up by a system job and any file in /tmp that hasn't been accessed in last 10 days will be deleted.

Head node /scratch is approximately 3TBs and its subdirectory /scratch/network is exported to all nodes (as /scratch/network). Therefore if you need to access/write temporary data from all nodes create a subdirectory of /scratch/network (like /scratch/network/username) and read/write there.

Nodes also have local /scratch space and their size is approximately 700GB. This local disk is also quite fast so consider it for fast data writing and reading. Just like with /scratch/network create /scratch/network/username and read/write from there. As mentioned above the /scratch/network on these nodes is mounted from the head node and while bigger in size it is also a lot slower then the local disk.

It cannot be emphasize enough that /scratch (and /scratch/network) is for temporary data storage only. Data placed there will occasionally be purged (without notice, oldest first) as needed to ensure all users have enough space.

Access

At this time the cluster is open to all Math and PACM members.

How to connect

In order to connect to NewComp cluster you will have to login first to math.princeton.edu and from there you can:

ssh newcomp

Login should proceed without the need to enter any passwords.

Compiling your programs

You should be compiling and preparing your jobs on the head node. You can setup your environment to use one of available compilers or MPI versions by using module command. Check how to use environment modules.

For MPI use you should probably be using the latest version of OpenMPI as it can take advantage of infiniband interfaces on nodes.

Scheduling/Running Jobs

No jobs/computations, expect maybe very short test runs, should be run on the head node. Any other jobs will be terminated without prior notice.

All jobs have to be submitted to the scheduler which will take care of assigning the necessary resources and running the job. Any computations found running without being submitted through the scheduler or that were submitted incorrectly (e.g. if the job consumes more cores then allocated or runs after it was supposed to complete) will be terminated without prior notice.

The scheduler in use on newcomp is torque/maui.

Torque/Maui Queues

The scheduler will automatically place your job in one of the following queues. Here are their names and their current limits:

Short Length Queue
  • 4 hour wall clock limit
  • 48 max processes total (of all users together)
  • 3 nodes max per job
Medium Length Queue
  • 4-24 hour wall clock limit
  • 24 max processes total (of all users together)
  • 2 nodes max per job
Long Length Queue
  • 24 hour-7 days wall clock limit
  • 24 max processes total (of all users together)
  • 12 max processes per user

Job Submission Gotchas

Please take a look at the below examples - you absolutely have to specify how many nodes you need and how many cores/node as well as the wall clock. Make sure you specify enough time for your job to finish while trying to be close to the actual run time. The scheduler will use that information to fit your job best and requesting much more time then you actually need might make your jobs wait too long to be scheduled for running.

Submitting Single Core/Serial Jobs

To run a single core program with executable called, say, myprogram compiled with intel 10.1 compiler, you will need to write a job script for torque. Here is a sample command script, serial.cmd, which uses (of course) 1 core:

cd my_serial_directory
cat serial.cmd

# serial job using 1 node and 1 processor, and runs
# for 3 hours (max).
#PBS -l nodes=1:ppn=1,walltime=3:00:00
#
# sends mail if the process aborts, when it begins, and
# when it ends (abe)
#PBS -m abe
#
# load intel compiler settings before running the program
# since we compiled it with intel 10.1
module load intel/10.1
# go to the directory with the program
cd $HOME/my_serial_directory
# and run it
./myprogram

To submit the job to the scheduling system, use:

qsub serial.cmd

Submitting Parallel Jobs

To run your parallel/MPI processing executable called myparallelprog a job script will need to be created for torque. Here is a sample command script, parallel.cmd, which uses 8 cores total (4 cores per node).

cd my_mpi_directory
cat parallel.cmd
#!/bin/bash
# parallel job using 2 nodes and 16 CPU cores, and runs
# for 4 hours (max).
#PBS -l nodes=2:ppn=8,walltime=4:00:00
#
# sends mail if the process aborts, when it begins, and
# when it ends (abe)
#PBS -m abe
#
module load openmpi
cd /u/username/my_mpi_directory
numprocs=`wc -l <${PBS_NODEFILE}`
mpiexec -np $numprocs ./a.out

To submit the job to the batch system, use:

qsub parallel.cmd

Submitting Multiple Parametrized Jobs

If you need to submit multiple, say 100, jobs you can submit them with

[username@newcomp] qsub -t 1-100 jobscript.cmd

That will submit 100 jobs and each will be assigned a unique number (from 1 to 100) available in environment variable PBS_ARRAYID. You can use that environment variable in the jobscript.cmd script, e.g. to process different data sets. For example the script could be

# serial job using 1 node and 1 processor, and runs
# for 3 hours (max).
#PBS -l nodes=1:ppn=1,walltime=3:00:00
#
# sends mail if the process aborts, when it begins, and
# when it ends (abe)
#PBS -m abe
#
cd $HOME/my_serial_directory
# and run it
./myprogram $PBS_ARRAYID

Useful Scheduler Tools

  • showbf - shows how many nodes are available and for how long. The wall clock limit of a job should be less than the duration reported by showbf, otherwise the job will not run before the next scheduled maintenance period.
  • diagnose -p - shows the priority assigned to queued jobs
  • showq or qstat - shows jobs in the queues
  • xpbs - a graphical display of the queues
  • pbstop - a text based view of the cluster nodes (e.g., pbstop -c 1 -m 8 -01234567)
  • qdel - to kill a job
  • qsig -s 0 <jobid> - alternate way to kill a job that will not be removed with qdel