Configuring a new machine
Contents
Configuring a new machine
If your machine is not supported by SimFactory already, you will need to write your own option list, run script and (for a cluster) submit script.
Machine definition
When using SimFactory on a cluster, it needs to know a lot of information about the details of the cluster, provided in a "machine definition file" in simfactory/mdb/machines. For example, it needs to know the number of cores on each node. Copy one of the provided files, and adapt it to your machine. Getting this right is nontrivial.
When using SimFactory on your laptop or workstation (for which SimFactory naturally has no machine definition file), the "sim setup" command will write a suitable machine definition file automatically. "sim setup" does not support clusters and you have to manually configure things.
The following is based in stampede.ini:
[stampede]
gives the name of the machine that can be used with simfactory's --machine
option. It must be unique among all machine definition files.
nickname = stampede name = Stampede location = TACC description = A very large Linux cluster at TACC webpage = http://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide status = production
describe the machine they are used by simfactory when reporting in the machine but are arbitrary otherwise.
hostname = stampede.tacc.utexas.edu rsynccmd = /home1/00507/eschnett/rsync-3.0.9/bin/rsync envsetup = <<EOT module load intel/15.0.2 module load mvapich2/2.1 module -q load hdf5 module load fftw3 module load gsl module load boost module load papi EOT aliaspattern = ^login[1234](\.stampede\.tacc\.utexas\.edu)?$
hostname
is the name of the login node used by simfactory's login
and remote
commands, envsetup
is executed before each simfactory command (in particular during build and when running the simulation) to ensure a consistent set of libraries are loaded and finally aliaspattern
is a regular expression used by simfactory to identify the machine. It must match all cluster login nodes.
sourcebasedir = /work/00507/@USER@ disabled-thorns = <<EOT ExternalLibraries/BLAS ExternalLibraries/CGNS ExternalLibraries/curl LSUThorns/Twitter ExternalLibraries/flickcurl LSUThorns/Flickr ExternalLibraries/LAPACK ExternalLibraries/libxml2 ExternalLibraries/Nirvana CarpetDev/CarpetIONirvana CarpetExtra/Nirvana ExternalLibraries/OpenSSL EOT enabled-thorns = <<EOT ExternalLibraries/OpenCL CactusExamples/HelloWorldOpenCL CactusExamples/WaveToyOpenCL CactusUtils/OpenCLRunTime CactusUtils/Accelerator McLachlan/ML_BSSN_CL McLachlan/ML_BSSN_CL_Helper McLachlan/ML_WaveToy_CL ExternalLibraries/OpenBLAS ExternalLibraries/pciutils ExternalLibraries/PETSc CactusElliptic/EllPETSc CactusElliptic/TATelliptic CactusElliptic/TATPETSc EOT
sourcebasedir
is the root directory underneath which all Cactus
trees are located, it should be large enough to hold multiple compiled Cactus
checkouts. Some clusters do not provide all libraries to run all thorns in
the Einstein Toolkit or they require alternative libraries (eg OpenBLAS
instead of LAPACK), disabled-thorns
and enabled-thorns
let you choose which thorns to enable/disable in thornlists.
enabled-thorns
will remove a #DISABLED
from lines in the
thornlist, while disabled-thorns
will add it.
optionlist = stampede-mvapich2.cfg submitscript = stampede.sub runscript = stampede-mvapich2.run make = make -j8
optionlist
, submitscript
and runscript
are
used when compiling and submitting simulations and are described in detail
below. You can find examples in the subdirectories of simfactory/mdb.
make
is the command used to compile the code, it can contain
extra arguments to enable for example parallel compilation.
The final set of options deals with the queuing system and characteristics of the machine.
basedir = /scratch/00507/@USER@/simulations cpu = Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz cpufreq = 2.7 flop/cycle = 8 ppn = 16 spn = 2 mpn = 2 max-num-threads = 16 num-threads = 8 memory = 32768 nodes = 6400 min-ppn = 16 allocation = NO_ALLOCATION queue = normal # [normal, large, development] maxwalltime = 48:00:00 # development has 4:0:0 maxqueueslots = 25 # there are 50, but jobs can take 2 slots
basedir
is the root directory under which all simulations are
created, this should live on a fast, parallel file system. cpu
,
cpufreq
, flop/cycle
, spn
and mpn
are currently unused by simfactory. spn
is the number of CPU
sockets per node, mpn
is the number of NUMA domains per node
(memory sockets). ppn
is the nubmer of cores per
node (historically called processors
hence the p
) which
is passed by simfactory to the queuing system to request a certain number of
cores per node. max-num-threads
is the maximum number of threads
that can be used, typically the same as ppn
and
num-threads
is the default number of threads used, often the number
of cores in a single NUMA domain. min-ppn
is the
minimum number of cores that need to be requested, often this is identical to
ppn
if the queuing system does not hanlde under-subscribing a node.
memory
is currently only used by simfactory's distribute
utility script and nodes
is
only used to to abort if more nodes than are in the cluster are requested.
allocation
gives the allocation to which to charge runs on clusters
where computer time is accounted for (which is almost all clusters at
computing centres), queue
is the default queue to submit to, often
named "default", "batch", "production" or similar. maxwalltime
is
the maximum allowed run time for a single job, if a long running simulation is
requested simfactory automatically splits it up into chunks of length no
longer than maxwalltime
. maxqueueslots
is the maximum
number of jobs that can be queued at the same time, a limit imposed on some
clusters to reduce the load on the queue scheduler.
Please see SimFactory's online documentation for the exact definition of the terms that SimFactory uses to refer to cores, nodes, CPUs, processing units etc.
submit = sbatch @SCRIPTFILE@; sleep 5 # sleep 60 getstatus = squeue -j @JOB_ID@ stop = scancel @JOB_ID@ submitpattern = Submitted batch job ([0-9]+) statuspattern = '@JOB_ID@ ' queuedpattern = ' PD ' runningpattern = ' (CF|CG|R|TO) ' holdingpattern = ' S ' #exechost = head -n 1 SIMFACTORY/NODES #exechostpattern = ^(\S+) stdout = cat @SIMULATION_NAME@.out stderr = cat @SIMULATION_NAME@.err stdout-follow = tail -n 100 -f @SIMULATION_NAME@.out @SIMULATION_NAME@.err
submit
, getstatus
and stop
are used by
simfactory as the commands to submit a new job to the queuing system query the
status of a running job and cancel a running job. You can use
@JOB_ID@
to refer to the job's identifier in them. The respective
pattern
variables are regular exttssions simfactory matches
against the ouput of the commands. submitpattern
must capture
(enclose in parenthesis) the actual job id so that it can be referred to as
the first captured group ($1
in sed). statuspattern
is used to select the line in getstatus
's output that contains
the actual job state information. The queuedpattern
,
runningpattern
and holdingpattern
pattens are used
to identify job states, whichever matches first (in the order listed above)
detemines the job state. stdout
, stderr
and
stdout-follow
are commands that simfactory executes during
sim show-output
to obtain the simulations log output,
stdout-follow
is used when the --follow
option is
specified to show output of a running segment. On some clusters the log
output for running segments is not directly accessible from the head nodes and
simfactory has to first log into one of the cluster nodes to gain access to
the output. exechost
and exechostpattern
give the
command and regulra exttssion used to obtain the host to log into. If they are
specified stdout-follow
is executed on that host, otherwise on
the head node.
Option list
The options provided by Cactus are described in the Cactus documentation. This page provides additional information and recommendations.
The following is based on the ubuntu.cfg optionlist which can be found in simfactory/mdb/optionslists. Usually it is best to start from files describing a cluster that is similar, e.g. uses the same MPI stack and compilers, to the machine you would like to set up.
VERSION = 2012-09-28
Cactus will reconfigure when the VERSION string changes.
Compilers
CPP = cpp FPP = cpp CC = gcc CXX = g++ F90 = gfortran
The C and Fortran preprocessors, and the C, C++, Fortran 77 and Fortran 90 compilers, are specified by these options. You can specify a full path if the compiler you want to use is not available on your default path. Note that it is strongly recommended to use compilers from the same family; e.g. don't mix the Intel C Compiler with the GNU Fortran Compiler.
Compilation and linking flags
CPPFLAGS = -DMPICH_IGNORE_CXX_SEEK FPPFLAGS = -traditional CFLAGS = -g3 -march=native -std=gnu99 CXXFLAGS = -g3 -march=native -std=gnu+11 F90FLAGS = -g3 -march=native -fcray-pointer -m128bit-long-double -ffixed-line-length-none LDFLAGS = -rdynamic
Cactus thorns can be written in C or C++. Cactus supports the C99 and C++11 standards respectively. Additionally, the Einstein Toolkit requires some of the GNU extensions provided by the options gnu99 / gnu++11. Without these options each feature required needs to be enabled manually, which is quite error prone. Both GNU and Intel compilers support these options.
-g3 ensures that debugging symbols are included in the object files. It is not necessary to set DEBUG = yes to get debugging symbols.
The rdynamic linker flag ensures that additional information is available in the executable for producing backtraces at runtime in the event of an internal error.
LIBDIRS = C_LINE_DIRECTIVES = yes F_LINE_DIRECTIVES = yes
Debugging
DEBUG = no CPP_DEBUG_FLAGS = -DCARPET_DEBUG FPP_DEBUG_FLAGS = -DCARPET_DEBUG C_DEBUG_FLAGS = -O0 CXX_DEBUG_FLAGS = -O0 F90_DEBUG_FLAGS = -O0
When DEBUG = yes is set (e.g. on the make command line or with SimFactory's --debug option), these debug flags are used. The intention here is to disable optimisation and enable additional code which may slow down execution but makes the code easier to debug.
Optimisation
OPTIMISE = yes CPP_OPTIMISE_FLAGS = # -DCARPET_OPTIMISE -DNDEBUG FPP_OPTIMISE_FLAGS = # -DCARPET_OPTIMISE -DNDEBUG C_OPTIMISE_FLAGS = -O2 -ffast-math CXX_OPTIMISE_FLAGS = -O2 -ffast-math F90_OPTIMISE_FLAGS = -O2 -ffast-math CPP_OPTIMISE_FLAGS = FPP_OPTIMISE_FLAGS = C_NO_OPTIMISE_FLAGS = -O0 CXX_NO_OPTIMISE_FLAGS = -O0 F90_NO_OPTIMISE_FLAGS = -O0
Profiling
PROFILE = no CPP_PROFILE_FLAGS = FPP_PROFILE_FLAGS = C_PROFILE_FLAGS = -pg CXX_PROFILE_FLAGS = -pg F90_PROFILE_FLAGS = -pg
OpenMP
OPENMP = yes CPP_OPENMP_FLAGS = -fopenmp FPP_OPENMP_FLAGS = -fopenmp C_OPENMP_FLAGS = -fopenmp CXX_OPENMP_FLAGS = -fopenmp F90_OPENMP_FLAGS = -fopenmp
Warnings
WARN = yes CPP_WARN_FLAGS = -Wall FPP_WARN_FLAGS = -Wall C_WARN_FLAGS = -Wall CXX_WARN_FLAGS = -Wall F90_WARN_FLAGS = -Wall
ExternalLibraries
The Einstein toolkit thorns use a variety of third-party libraries like MPI or HDF5. These are usually provided by helper thorns in the ExternalLibaries arrangement. As a general rule, to enable a capability FOO add
ExternalLibraries/FOO
to your ThornList and set FOO_DIR to the directory where the include and lib directories are found.
HDF5
If no HDF5 options are given, then HDF5 will be used if it can be automatically detected from standard locations, and will be built from a source package in the HDF5 thorn if not. Alternatively you can specify HDF5_DIR to point to an HF5 installation, for example
HDF5_DIR = /usr/local/hdf5-1.9.1
The following options disable support for Fortran and C++ when building HDF5, as it is not required by the Einstein Toolkit.
HDF5_ENABLE_FORTRAN = no HDF5_ENABLE_CXX = no
MPI
MPI_DIR = /usr MPI_INC_DIRS = /usr/include/mpich2 MPI_LIB_DIRS = /usr/lib MPI_LIBS = mpich fmpich mpl
The correct values to use can be difficult to find out since MPI comes with its own compiler wrappers mpicc etc. that are expected to be used.
Note that the thorn ExternalLibraries/MPI
will often be able to determine the correct options if you set MPI_DIR = NO_BUILD
, ensure that mpicxx
(or similar) is found in $PATH
and leave the other variables unset.
In cases where auto-detection fails (not unlikely on clusters), often, the mpicxx
wrapper will come with options to query the values to use here.
MPI stack | Options | Comments |
---|---|---|
OpenMPI | mpicxx -showme:compile and mpicc -showme:link |
|
mpich (and mvapich) | mpicxx -compile_info and mpicc -link_info |
|
Intel | mpiicxx -compile_info and mpiicxx -link_info |
impi is a derivative of mpich. Note the name of the wrapper is: mpiicxx (with an extra "i"). |
Cray | please use the cc, CC, ftn wrappers and load the correct modules |
Others
PTHREADS_DIR = NO_BUILD
Submission script
The submission script is used to submit a job to the queueing system. See the examples in simfactory/mdb/submitscripts, and create a new one for your cluster that uses the same queueing system.
Run script
The most important part of the run script is usually the set of modules that need to be loaded, and the mpirun command to use on the machine. See the examples in simfactory/mdb/runscripts, and create a new one for your cluster that is similar to one that already exists.