We are blessed with some nice software for distributing jobs over the network to idle machines. Accompanying this, we have a "farm" of compute servers on a fast (100Mbps) network.
The usual interface to this is "run-command", which allows you to distribute a particular command or script. The more sophisticated "pmake" interface parallelizes makefiles and is aware of dependencies, but usually run-command is a good fit for our needs. Actually, run-command is built on top of pmake. In this FAQ answer, I have stolen some text from emails I've received from David Johnson (ICSI head sysadmin) and Andreas Stolcke (who installed run-command here). At the bottom I have included verbatim some other useful e-mails.
/usr/local/etc/cctrl -jobs -q `reginfo -hosts -up -attr realwork`
man run-command
run-command <cmd>
run-command -attr noevict <cmd>
run-command -attr <machine-name> <cmd>
run-command -f <cmd.list.file>
run-command -attr noevict -f <cmd.list.file>
The "realwork" (on by default) and "noevict" attributes are described below. The 1024meg, 640meg, 512meg, 384meg, 320meg, 256meg, and 128meg attributes can be used to require that the job runs on a machine with at least that much memory.
With the -f option, run-command will run the commands read from <cmd.list.file> in parallel. This is roughly equivalent to running run-command separately for each line in <cmd.list.file>, except that with the -f option run-command will run several jobs simultaneously. The default number of simultaneous jobs is set to 8. The <cmd.list.file> can contain more than 8 lines, but only 8 will run simultaneously. (NOTE: It is possible to change the default number of jobs using the '-J' option to run-command, but this should not be done without first consulting with the other members of the speech group by sending an email to 'speech-users'.) Finally, the parallel jobs are started with a successive delay of 10 seconds to decrease the startup load on file servers (abbott, home directory and /usr/local/bin servers).
run-command supports saving the command's screen output to a log file. I usually prefer to use tee so that I see the output onscreen as well as saving it. (E.g. in csh or tcsh, "run-command -attr realwork my_script.csh |& tee logfile", using |& to pipe both standard output and standard error output to tee.) If I am running a lot of jobs and I want to keep track of them, I may run each one in a separate xterm window and use ~gelbart/bin/xt to change the title of the xterm to indicate what job is running in that window and whether it has finished yet. Then I can leave the window minimized and check whether the job has finished by checking the window title. The xterm creation and titling can be automated in scripts.
reginfo -showattr
reginfo -attr realwork
reginfo -attr noevict
The -showattr option will tell you what attributes are given to various hosts in the system. "realwork" designates machines owned by the Speech Group that have significant amounts of memory. Most are dedicated computational servers on the fast (100 Mbps) network; a few are desktop machines. Doing "finger @machine_name" is a (imperfect) way of checking whether a machine belongs to a user. The computational servers tend to be named after either alcoholic drinks or fish.
Jobs will only be exported to idle machines. Normally, if a machine stops being idle (for example, because a desktop user comes back from lunch), a job exported to that machine will be killed and automatically restarted on an available machine. The noevict attribute designates machines on which this will not happen--once started, a job will be allowed to run to conmpletion.
Some of the machines with the "noevict" attribute are users'
desktop machines. Currently I believe these are
mahimahi (wooters) and paprika (stolcke). On weekdays while these
machines are being used as desktop machines, they are not ideal for
exporting jobs to since exported jobs will compete with the desktop
user. You can avoid them using the !machine_name attribute. (If you
are running run-command from a shell you may need to use \ to escape
the !, as in run-command -attr realwork -attr \!bass -attr
\!ketchup -attr \!mahimahi ...
)
In summary, you should almost always use the "realwork" attribute when you use run-command. If you don't want to take the chance that your jobs may get evicted and restarted, then you should add the "noevict" attribute.
run-command can be used for jobs that take hours or jobs that take seconds. However, jobs that take seconds are wasteful since exporting and starting the job takes time. If there are many such short jobs to be run (which may happen in feature calculation, for example) a good approach is to collected them into groups (say 100 jobs in each group) and then export the groups.
The Speech group has a two-tiered network. Compute servers are on
a 1000 Mbps net, as is our main file server abbott (?). Personal
machines are on 100 Mbps nets. Jobs will run faster, and use a
smaller share of network bandwidth, if the input and output files are
local to the machine being run (the /scratch directories can help with
this). What are these /scratch directories? On many machines, I
think including
Storing gigabytes of temporary files to your network-mounted home directory, on the other hand, will slow down performance for other users whose home directories are on the same server, and may annoy the sysadmins.
Below is an email I once received from David J about intelligent use of run-command/pmake. This was written when the main network was 10 Mbps instead of 100 Mbps, so the proper thresholds may be different now.
From davidj@ICSI.Berkeley.EDU Fri Jun 14 14:36:00 2002 Date: Thu, 7 Mar 2002 14:52:21 -0800 From: David JohnsonTo: David Gelbart Cc: David Johnson , abbot-users@ICSI.Berkeley.EDU Subject: Re: abbott grinding ot a halt >>>>> "David" == David Gelbart writes: David> Regarding feature processing... >> I would also suggest that folks not be allowed to pmake jobs >> without understanding the implications. I have already had to >> stop one person who was doing pmaked feature processing (and >> using <1% of the CPU due to the time spent in I/O). David> Yesterday I was running a script of Stephane's that David> calculated features for the Aurora evaluation (this is a David> lot of data) using pmake. The source data was read off David> abbott. The script spawned a separate pmake job for each David> utterance. David> Given the catastrophe yesterday and Chuck's request, I will David> make this much less parallelized for the next few weeks. Jobs like feature calculation are the sorts of things that cripple the network and file servers. Before pmaking a job it's important to work out at what point it will become I/O limited. Time it on a representative sample of data with all data on local disk. Do a "top" while it's running to make sure it's getting >95% CPU on an idle machine. This'll indicate that it's CPU limited, not I/O limited. If you can't get it very close to 100% on one machine for most of the run then don't even think about pmaking it. Then work out, for the example you run, how much data is read or written. This includes input data, output data, paramater files (e.g. weights for an MLP forward pass) and the size of the executable. >From the time and the aggregate I/O required, you can work out the average I/O bandwidth required for the job in MB/s. For jobs on the general network, the average I/O throughput required should be <0.5MB/s. For the "realwork" pool, something like 5MB/s might work. But you also need to work out where your I/O bottlenecks are. If everything's being read from or written to one machine (e.g. abbott), you have to make sure that you are only using your fair share of that machine. If there are say 6 pmake users and you want to leave some bandwidth spare e.g. for interactive users, you get 1MB/s on abbott to share between all of your jobs. If you're using a parallelism of 10, you'd better be using less than 100KB/s for each job. Note that copying data to other machines to relieve the load on abbot has it's own set of problems. If you're only using the data once, the cost of the copy negates any savings. And putting it on a desktop machine means that you're limited to <0.5MB/s to the data, and they're really isn't much free disk in the machine room. Not to mention that your random-machine file server is probably a computational server too - your server-side NFS activity is getting in the way of someone elses compute jobs client-side NFS requests. Also, when profiling your typical job, run a job that is likely to use the maximum amount of memory (if you don't know what parameter settings use maximum memory, run a sequence in the paramter space and work out the distribution of memory requirements). Then make sure you set a pmake memory attribute of at least 128MB more than your worst-case job needs. This is very important - not only will your jobs run faster if you do this, but evictions are much, much faster and you're less likely to annoy users who return to their temporarily-idle machines. In the end of the day, ICSI hasn't spent a fortune on computational hardware, networks or high-bandwidth disk. Trying to pretend that we have just doesn't work! David.
Previous: 2.12 Which C/C++ compilers are available? - Next: 3.1 I found this file. What is it?
Back to ICSI Speech FAQ index
Generated by build-faq-index on Tue Mar 24 16:18:14 PDT 2009