ICSI Speech FAQ:
2.13 How do I use run-command to run programs on idle machines?

Answer by: gelbart+wooters - 2002-06-14, 2002-09-13

We are blessed with some nice software for distributing jobs over the network to idle machines. Accompanying this, we have a "farm" of compute servers on a fast (100Mbps) network.

The usual interface to this is "run-command", which allows you to distribute a particular command or script. The more sophisticated "pmake" interface parallelizes makefiles and is aware of dependencies, but usually run-command is a good fit for our needs. Actually, run-command is built on top of pmake. In this FAQ answer, I have stolen some text from emails I've received from David Johnson (ICSI head sysadmin) and Andreas Stolcke (who installed run-command here). At the bottom I have included verbatim some other useful e-mails.

Information on current jobs (all users)

/usr/local/etc/cctrl -jobs -q `reginfo -hosts -up -attr realwork`

Running jobs

man run-command
run-command <cmd>
run-command -attr noevict <cmd>
run-command -attr <machine-name> <cmd>
run-command -f <cmd.list.file>
run-command -attr noevict -f <cmd.list.file>

The "realwork" (on by default) and "noevict" attributes are described below. The 1024meg, 640meg, 512meg, 384meg, 320meg, 256meg, and 128meg attributes can be used to require that the job runs on a machine with at least that much memory.

With the -f option, run-command will run the commands read from <cmd.list.file> in parallel. This is roughly equivalent to running run-command separately for each line in <cmd.list.file>, except that with the -f option run-command will run several jobs simultaneously. The default number of simultaneous jobs is set to 8. The <cmd.list.file> can contain more than 8 lines, but only 8 will run simultaneously. (NOTE: It is possible to change the default number of jobs using the '-J' option to run-command, but this should not be done without first consulting with the other members of the speech group by sending an email to 'speech-users'.) Finally, the parallel jobs are started with a successive delay of 10 seconds to decrease the startup load on file servers (abbott, home directory and /usr/local/bin servers).

run-command supports saving the command's screen output to a log file. I usually prefer to use tee so that I see the output onscreen as well as saving it. (E.g. in csh or tcsh, "run-command -attr realwork my_script.csh |& tee logfile", using |& to pipe both standard output and standard error output to tee.) If I am running a lot of jobs and I want to keep track of them, I may run each one in a separate xterm window and use ~gelbart/bin/xt to change the title of the xterm to indicate what job is running in that window and whether it has finished yet. Then I can leave the window minimized and check whether the job has finished by checking the window title. The xterm creation and titling can be automated in scripts.

Information on machines

reginfo -showattr
reginfo -attr realwork
reginfo -attr noevict

The -showattr option will tell you what attributes are given to various hosts in the system. "realwork" designates machines owned by the Speech Group that have significant amounts of memory. Most are dedicated computational servers on the fast (100 Mbps) network; a few are desktop machines. Doing "finger @machine_name" is a (imperfect) way of checking whether a machine belongs to a user. The computational servers tend to be named after either alcoholic drinks or fish.

Jobs will only be exported to idle machines. Normally, if a machine stops being idle (for example, because a desktop user comes back from lunch), a job exported to that machine will be killed and automatically restarted on an available machine. The noevict attribute designates machines on which this will not happen--once started, a job will be allowed to run to conmpletion.

Some of the machines with the "noevict" attribute are users' desktop machines. Currently I believe these are mahimahi (wooters) and paprika (stolcke). On weekdays while these machines are being used as desktop machines, they are not ideal for exporting jobs to since exported jobs will compete with the desktop user. You can avoid them using the !machine_name attribute. (If you are running run-command from a shell you may need to use \ to escape the !, as in run-command -attr realwork -attr \!bass -attr \!ketchup -attr \!mahimahi ...)

In summary, you should almost always use the "realwork" attribute when you use run-command. If you don't want to take the chance that your jobs may get evicted and restarted, then you should add the "noevict" attribute.

Job time

run-command can be used for jobs that take hours or jobs that take seconds. However, jobs that take seconds are wasteful since exporting and starting the job takes time. If there are many such short jobs to be run (which may happen in feature calculation, for example) a good approach is to collected them into groups (say 100 jobs in each group) and then export the groups.

Networks and memory

The Speech group has a two-tiered network. Compute servers are on a 1000 Mbps net, as is our main file server abbott (?). Personal machines are on 100 Mbps nets. Jobs will run faster, and use a smaller share of network bandwidth, if the input and output files are local to the machine being run (the /scratch directories can help with this). What are these /scratch directories? On many machines, I think including all the computational servers, there is a /scratch drive (not visible from other machines) for local storage of temporary files. As part of a script, then, you can do " mkdir /scratch/tmp/your-user-name " (which will create a temporary directory for your use under the /scratch directory of whatever machine is running the script), have the script store its temporary files to that directory while running, and delete those temporary files when you are done.

Storing gigabytes of temporary files to your network-mounted home directory, on the other hand, will slow down performance for other users whose home directories are on the same server, and may annoy the sysadmins.

Below is an email I once received from David J about intelligent use of run-command/pmake. This was written when the main network was 10 Mbps instead of 100 Mbps, so the proper thresholds may be different now.

From davidj@ICSI.Berkeley.EDU Fri Jun 14 14:36:00 2002
Date: Thu, 7 Mar 2002 14:52:21 -0800
From: David Johnson 
To: David Gelbart 
Cc: David Johnson , abbot-users@ICSI.Berkeley.EDU
Subject: Re: abbott grinding ot a halt

>>>>> "David" == David Gelbart  writes:

    David> Regarding feature processing...
    >> I would also suggest that folks not be allowed to pmake jobs
    >> without understanding the implications.  I have already had to
    >> stop one person who was doing pmaked feature processing (and
    >> using <1% of the CPU due to the time spent in I/O).

    David> Yesterday I was running a script of Stephane's that
    David> calculated features for the Aurora evaluation (this is a
    David> lot of data) using pmake.  The source data was read off
    David> abbott.  The script spawned a separate pmake job for each
    David> utterance.

    David> Given the catastrophe yesterday and Chuck's request, I will
    David> make this much less parallelized for the next few weeks.

Jobs like feature calculation are the sorts of things that cripple the
network and file servers.

Before pmaking a job it's important to work out at what point it will
become I/O limited.  Time it on a representative sample of data with
all data on local disk.  Do a "top" while it's running to make sure
it's getting >95% CPU on an idle machine.  This'll indicate that it's
CPU limited, not I/O limited.  If you can't get it very close to 100%
on one machine for most of the run then don't even think about pmaking

Then work out, for the example you run, how much data is read or
written.  This includes input data, output data, paramater files
(e.g. weights for an MLP forward pass) and the size of the executable.

>From the time and the aggregate I/O required, you can work out the
average I/O bandwidth required for the job in MB/s.

For jobs on the general network, the average I/O throughput required
should be <0.5MB/s.  For the "realwork" pool, something like 
5MB/s might work.  But you also need to work out where your I/O
bottlenecks are.  If everything's being read from or written to one
machine (e.g. abbott), you have to make sure that you are only using
your fair share of that machine.  If there are say 6 pmake users and
you want to leave some bandwidth spare e.g. for interactive users, you
get 1MB/s on abbott to share between all of your jobs.  If you're
using a parallelism of 10, you'd better be using less than 100KB/s for
each job.

Note that copying data to other machines to relieve the load on abbot
has it's own set of problems.  If you're only using the data once, the
cost of the copy negates any savings.  And putting it on a desktop
machine means that you're limited to <0.5MB/s to the data, and they're
really isn't much free disk in the machine room.  Not to mention that
your random-machine file server is probably a computational server too
- your server-side NFS activity is getting in the way of someone elses
compute jobs client-side NFS requests.

Also, when profiling your typical job, run a job that is likely to use
the maximum amount of memory (if you don't know what parameter
settings use maximum memory, run a sequence in the paramter space and
work out the distribution of memory requirements).  Then make sure you
set a pmake memory attribute of at least 128MB more than your
worst-case job needs.  This is very important - not only will your
jobs run faster if you do this, but evictions are much, much faster
and you're less likely to annoy users who return to their
temporarily-idle machines.

In the end of the day, ICSI hasn't spent a fortune on computational
hardware, networks or high-bandwidth disk. Trying to pretend that we
have just doesn't work!


Previous: 2.12 Which C/C++ compilers are available? - Next: 3.1 I found this file. What is it?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:14 PDT 2009