Documentation and Information:Computational clusters in Fine Hall

From CompudocWiki
Revision as of 10:32, 26 July 2010 by Plazonic (talk | contribs)
Jump to navigation Jump to search

Fine Hall machine room is currently hosting 1 mini computational cluster

NewComp computing cluster

Description

NewComp mini computational cluster consists of 4 nodes, each with 2 Xeon X5680 CPUs - 6 cores each, 12 cores total, running at 3.33GHz. Each node has 96GB of memory - 8 GB/node. Head node is equipped with one Intel Xeon X5650 CPUs (6 cores total, running at 2.67GHz) and with 12 GB of memory.

Nodes are connected with gigabit ethernet networking as well as 4x Infiniband.

Configuration

The cluster is integrated into Fine Hall Math/PACM network and all the cluster machines mount Math/PACM home directories. The operating system used on these machines is a clone of RHEL 6.

For temporary storage, besides /tmp, one can use also /scratch - with no quotas. It must be emphasized that both /scratch and /tmp cannot be used for permanent data storage and no crucial data should be stored there, e.g. use it for intermediate computational results. /tmp and /scratch are NOT backed up and can be erased at any time, especially if a reinstall of one or more machines is required or if one of these directories is full and other users need space. /tmp is also regularly cleaned up by a system job and any file in /tmp that hasn't been accessed in last 10 days will be deleted.

Head node /scratch is approximately 3TBs and its subdirectory /scratch/network is exported to all nodes (as /scratch/network). Therefore if you need to access/write temporary data from all nodes create a subdirectory of /scratch/network (like /scratch/network/username) and read/write there.

Nodes also have local /scratch space and their size is approximately 700GB. This local disk is also quite fast so consider it for fast data writing and reading. Just like with /scratch/network create /scratch/network/username and read/write from there. As mentioned above the /scratch/network on these nodes is mounted from the head node and while bigger in size it is also a lot slower then the local disk.

It cannot be emphasize enough that /scratch (and /scratch/network) is for temporary data storage only. Data placed there will occasionally be purged (without notice, oldest first) as needed to ensure all users have enough space.

Access

At this time the cluster is open to all Math and PACM members.

How to connect

In order to connect to NewComp cluster you will have to login first to math.princeton.edu and from there you can:

ssh newcomp

Login should proceed without the need to enter any passwords.

Scheduling/Running Jobs

No jobs/computations, expect maybe very short test runs, should be run on the head node. Any other jobs will be terminated without prior notice.

All jobs have to be submitted to the scheduler which will take care of assigning the necessary resources and running the job. Any computations found running without being submitted through the scheduler or that were submitted incorrectly (e.g. if the job consumes more cores then allocated or runs after it was supposed to complete) will be terminated without prior notice.

The scheduler in use on newcomp is torque/maui.

Torque/Maui Queues

The scheduler will automatically place your job in one of the following queues. Here are their names and their current limits:

Short Length Queue
  • 4 hour wall clock limit
  • 48 max processes total (of all users together)
  • 3 nodes max per job
Medium Length Queue
  • 4-24 hour wall clock limit
  • 24 max processes total (of all users together)
  • 2 nodes max per job
Long Length Queue
  • 24 hour-7 days wall clock limit
  • 24 max processes total (of all users together)
  • 12 max processes per user

Submitting Single Core/Serial Jobs

To run a single core program with executable called, say, myprogram, you will need to write a job script for torque. Here is a sample command script, serial.cmd, which uses (of course) 1 core:

cd my_serial_directory
cat serial.cmd

# serial job using 1 node and 1 processor, and runs
# for 3 hours (max).
#PBS -l nodes=1:ppn=1,walltime=3:00:00
#
# sends mail if the process aborts, when it begins, and
# when it ends (abe)
#PBS -m abe
#
cd $HOME/my_serial_directory
./myprogram

To submit the job to the scheduling system, use:

qsub serial.cmd