This file serves as a brief introduction to installing the distributed and fault-tolerance-enhanced version of the Condor package at your site.
The binary distribution is packaged in the following files:
INSTALL.html - this file release.tar.gz - gziped tar file of the "release directory", which contains the Condor binaries and libraries src.tar.gz - gziped tar file of the complete source tree of Task Management Project. trial.tar.gz - gziped tar file of a test program. view.tar.gz - gziped tar file of the java monitor program TaskMgmtView.
The release directory contains two subdirectories: "bin" and "lib". bin contains all executable programs. lib contains libraries to link Condor user programs and scripts used by the Condor system. You should install the release directory under a common directory in a well known location so that your users can find the Condor programs and libraries in a consistent manner. The release directory should be shared by all machines of a common platform in your pool. Note that upon this release only supports Solaris 2.5.1. We may provide codes to other platforms upon request.
1. Un-tar the release directory. For example, if you would like to install all the condor related stuff in "/usr/local/condor" (which we suggest), perform the following operations:
mkdir /usr/local/condor cp release.tar.gz /usr/local/condor cd /usr/local/condor gzip -d < release.tar.gz | tar xvf - rm release.tar.gz
This will unpack Condor's release-directory that contains "bin" and "lib". Make sure that you make the Condor binaries available to your users by either adding "bin" (in the above example commands, this would be /usr/local/condor/bin) into the default PATH for all users, or by putting soft links for all the Condor programs in a directory that is current in their PATH (such as /usr/local/bin).
2. Create a user and group "condor" for all machines in your pool. A home directory for user condor should be created and should be owned by user and group condor and have permission rwxr-xr-x (chmod 755).
Current binary release only supports NFS configuration. That is, Condor's home directory will be shared via a file server on all the machines in your pool. You will need to create a subdirectory for each machine in the user "condor" home directory, where it can keep machine specific data. Name each directory with the hostname of the machine whose data it will hold.
Condor's UID and GID do need to be consistent across all machines in your pool.
3. If Condor runs into a problem at your site, it will send mail describing what went wrong. You need to decide who should get such mails. You may make this an alias so that you can change the recipient of the mail later without changing condor.
4. Once a week Condor will try to send a status report back to its developers. You can specify the email address of the developer who is interested in obtaining this information.
5. Check dcondor_config file under PATH/lib/dcondor_config (PATH is the Condor release path, e.g., /usr/local/condor) to make necessary changes pertaining to your site. Many of the configurations are self-explanatory and specifically
a. The local email address where you want Condor to notify you
regarding problems. See step 3 above.
b. The email address where Condor can send a weekly status report
back to its authors. See step 4 above.
c. The pathname of the directory which contains the "bin" and
"lib" directories, a.k.a the "release directory". See step 1
above.
d. The pathname of the directory which contains the machine
specific data. See step 3 above. Note: two macros are
available to simplify the specification of this directory.
The $(TILDE) macro translates to the name of Condor's home
directory on whatever machine it is evaluated on, and the
$(HOSTNAME) macro evaluates to the hostname of whatever
machine it is evaluated on. Thus you could specify
$(TILDE)/$(HOSTNAME) for this variable.
e. Will all machines in your pool participate in a common file
system via AFS? (Note: Not tested).
If you have AFS, you will be asked the following additional
questions. Read the condor_customize man page referenced
above for details.
1) Do you want to use AFS?
2) Pathname of the AFS "fs" command?
3) Pathname of the AFS "vos" command?
f. Where is your mail program located? (Usually this is
/bin/mail or /usr/bin/mail -- "which mail" should tell you).
g. Internet domain of machines sharing a common UID space. If
you specify a UID space, Condor will execute user jobs under
the UID of the submitting user - otherwise Condor will execute
user jobs with the UID "nobody". Specify "none" unless all
machines in your pool are guaranteed to have consistent UIDs.
h. Internet domain of all machines sharing a common NFS file space
i. In the end of the dcondor_config file, you need to specify the
sequence number and the name of the machines which need
to join the Condor pool. The sequence numbers can be assigned
in arbitrary order but they have to be continuous. For example,
for the list
MACHINE0 = dolphin
MACHINE1 = gorilla
MACHINE12 = hyena
Only dolphin and gorilla are recognized by Condor.
7. After you are done with modifying dcondor_config, edit bin/dcondor_init, lib/config_on and lib/config_off to set the variable RELEASE properly (e.g., as /usr/local/condor/) and then run dcondor_init as root on every machine you wish to add to your pool. This will set up the home directory and machine-specific directories properly. Basically, dcondor_init will create a soft link in the condor home directory that points from ~condor/dcondor_config to <RELEASEDIR>/lib/dcondor_config. Changes made to this file will effect all the machines in your pool of that platform. Every machine in the pool also has a "dcondor_config.local" file in their machine specific directory (See step 2 above). This file allows you to make local configuration changes that only effect one machine. The settings in dcondor_config.local always overrides ones in the site-wide dcondor_config file.
8. Start the condor daemons by running "dcondor_on" on each machine you want in your pool. You must be "root" or "condor" when you run dcondor_on. However, if you start as "condor", condor can not switch UIDs, and therefore, all daemons will run as "condor". See the manual for details about running Condor not as root.
9. Ensure that condor is running. You can run:
ps -elf | egrep dcondor_
On every machine you should have processes for:
dcondor_master dcondor_collector dcondor_startd dcondor_schedd dcondor_monitor
10. Ensure that the condor daemons are communicating. You can run
"dcondor_status" to get a one line summary of the status of each
machine in your pool.
11. Add "dcondor_master" into your startup/bootup scripts (i.e. /etc/rc ) so that
your machine runs "dcondor_master" upon bootup. dcondor_master will then fire
up the necessary Condor daemons whenever your machine is rebooted.
That's the end of the installation. If you want to know more about the original version of Condor, consult the Condor website at http://www.cs.wisc.edu/condor/; if you have problems with the distributed and fault-tolerance-enhanced version of Condor, let us know.
Finally, we include a section that describes the required ownerships and permissions for the various parts of the Condor installation. Following the installation procedures as above should properly set up all of these things. This section is only included as a reference so you know what those tools are doing.
Everything mentioned in this section should be owned by user and group "condor".
A. The Condor release directory.
The release directory contains two subdirectories: "bin" and "lib".
a) The bin directory should have permission rwxr-xr-x (755). All condor binaries in bin should have permission rwxr-xr-x (755) except: dcondor_globalq dcondor_history dcondor_jobqueue dcondor_preen dcondor_prio dcondor_q dcondor_rm dcondor_submit dcondor_summary These programs must be setgid condor to work properly. They should therefore have permission rwxr-sr-x (2755). b) The lib directory should also have permission rwxr-xr-x (755) All the files in the lib should have permission rw-r--r-- (644) except: ld real-ld These two files are just hard links to each other. They are used by the dcondor_compile script and need to have permission rwxr-xr-x (755).
B. Condor's home directory.
Either this directory is shared between all the condor accounts in your pool, or it is a separate directory for each one. If you have a shared directory, then you must create a machine-specific directory for every machine in your pool. I will refer to these machine specific directories as the "local dir" .
a) The home directory of the condor user should contain a file
dcondor_config which is a soft link to
<RELEASEDIR>/lib/dcondor_config, where <RELEASEDIR> is the path
to the Condor release directory.
b) Every local dir should have permissions rwxr-xr-x (755). For
installations without a shared condor home directory, just
consider the home directory itself as the local dir.
c) Inside each local dir, there should be three subdirectories,
each with the given permission:
execute rwxrwxrwx (777)
log rwxrwxr-x (775)
spool rwxrwxr-x (775)
d) Every local dir should also contain a file
"dcondor_config.local", which should have permission rw-rw-r--
(664). This file can be empty, or it can contain
machine-specific configuration settings.
Unpack the src.tar.gz by
gzip -d < src.tar.gz | tar xvf -
It will create a directory called release_src which contains the complete source tree of the Task Management Deliverables. To compile the source, do
cd release_src/src/ ./dcondor_imake make release
which will compile everything and create release_src/src/release_dir/ which contains the binaries and the libraries. All the previously discovered and fixed problems are documented in the file release_src/Problem_AND_Fixed.txt. In order to generate the Makefiles successfully from Imakefile, you need to use ansi_cpp contained in the binary release and set
setenv IMAKECPP /usr/local/condor/bin/ansi_cpp
Unpack the trial.tar.gz by
gzip -d < trial.tar.gz | tar xvf -
It will create a directory called trial which contains a small program that can be compiled using dcondor_compile. To compile the program, do
cd trial/ make
Then you can submit the job using
dcondor_submit loop.cmd
There are several command files that you can look through. They are well documented and you shouldn't have any problem writing a command file of your own using the framework.
Unpack the view.tar.gz by
gzip -d < view.tar.gz | tar xvf -
It will create a directory TaskMgmtView which contains the java source code for the task management monitor program TaskMgmtView. You may want to look over Config.java file to properly set the PATH to invoke condor programs according to your condor release installation (e.g., PATH can be set to /usr/local/condor/bin) To compile the program, download the updated JDK (at least JDK 1.1) from Sun website, install it and do
cd TaskMgmtView/ javac *.java
Then you can run it by
java TaskMgmtView [ -d DOMAIN_NAME ]
DOMAIN_NAME is in the format of eng.ohio-state.edu and should be set to the domain of your condor machines. Current version of TaskMgmtView only supports a single domain Condor pool. It is recommended that TaskMgmtView be run from a machine not in Condor pool, since it may take up significant amount of CPU time. Running it on high-end PC usually yields excellent result. When TaskMgmtView starts up, you will need to modify the monitor port no from 60002 to 60006 and the monitor machine name to any machine with condor daemons up and running. Note that TaskMgmtView has not been fully tested in this release and may have some bugs.
Useful condor programs (For more information, consult the manual please):
1. dcondor_status, gives one-line summary for every machine active
in the pool.
2. dcondor_q, check local job queue.
3. dcondor_globalq, check global job queues.
4. dcondor_history, check local history jobs (jobs completed).
5. dcondor_globalh, check global history jobs. Note that this
implementation of globalh reads history files directly
from ~condor/MACHINE_NAME/spool/history. Thus it can only
be used with NFS properly installed and the Condor environments
correctly configured following the instructions above.
6. dcondor_summary, gives you brief summary on Condor jobs.
7. dcondor_reconfig_schedd, send RECONFIG command to schedd so
that it can re-read dcondor_config file. Useful when you finish
modifying the configuration file and want it to be effective
immediately.
8. dcondor_on and dcondor_off, turn Condor on/off.