DistCondor


This file serves as a brief introduction to installing the distributed and fault-tolerance-enhanced version of the Condor package at your site.

 

Inside the Package

The binary distribution is packaged in the following files:
INSTALL.html	- this file
release.tar.gz	- gziped tar file of the "release directory", which contains
		  the Condor binaries and libraries
src.tar.gz   	- gziped tar file of the complete source tree of
		  Task Management Project.
trial.tar.gz   	- gziped tar file of a test program.
view.tar.gz   	- gziped tar file of the java monitor program TaskMgmtView.
The release directory contains two subdirectories: "bin" and "lib".  bin
contains all executable programs.  lib contains libraries to link Condor
user programs and scripts used by the Condor system.  You should install
the release directory under a common directory in a well known location so
that your users can find the Condor programs and libraries in a consistent
manner. The release directory should be shared by all machines of a common
platform in your pool.  Note that upon this release only supports Solaris
2.5.1.  We may provide codes to other platforms upon request.

Binary Installation

1. Un-tar the release directory.  For example, if you would like to
   install all the condor related stuff in "/usr/local/condor" (which we
   suggest), perform the following operations: 
	mkdir /usr/local/condor
	cp release.tar.gz /usr/local/condor
	cd /usr/local/condor
	gzip -d < release.tar.gz | tar xvf -
	rm release.tar.gz
   This will unpack Condor's release-directory that contains "bin"
   and "lib".  Make sure that you make the Condor binaries
   available to your users by either adding "bin" (in the above
   example commands, this would be /usr/local/condor/bin) into the
   default PATH for all users, or by putting soft links for all the
   Condor programs in a directory that is current in their PATH (such
   as /usr/local/bin).
2. Create a user and group "condor" for all machines in your pool. 
   A home directory for user condor should be created and should be 
   owned by user and group condor and have permission rwxr-xr-x (chmod 755).
   Current binary release only supports NFS configuration. That is,
   Condor's home directory will be shared via a file server on all
   the machines in your pool. You will need to create a subdirectory
   for each machine in the user "condor" home directory, where it can
   keep machine specific data.  Name each directory with the hostname
   of the machine whose data it will hold. 
   Condor's UID and GID do need to be consistent across all machines 
   in your pool.  
3. If Condor runs into a problem at your site, it will send mail
   describing what went wrong.  You need to decide who should get such
   mails.  You may make this an alias so that you can change
   the recipient of the mail later without changing condor.
4. Once a week Condor will try to send a status report back to its
   developers. You can specify the email address of the developer
   who is interested in obtaining this information. 
5. Check dcondor_config file under PATH/lib/dcondor_config (PATH is the
   Condor release path, e.g., /usr/local/condor) to make necessary 
   changes pertaining to your site. Many of the configurations are
   self-explanatory and specifically
     a.	The local email address where you want Condor to notify you
	regarding problems.  See step 3 above.
     b.	The email address where Condor can send a weekly status report
	back to its authors. See step 4 above.
     c.	The pathname of the directory which contains the "bin" and
       "lib" directories, a.k.a the "release directory".  See step 1
	above. 
     d. The pathname of the directory which contains the machine
	specific data.  See step 3 above.  Note: two macros are
	available to simplify the specification of this directory.
	The $(TILDE) macro translates to the name of Condor's home
	directory on whatever machine it is evaluated on, and the
	$(HOSTNAME) macro evaluates to the hostname of whatever
	machine it is evaluated on.  Thus you could specify
	$(TILDE)/$(HOSTNAME) for this variable.
     e.	Will all machines in your pool participate in a common file 
	system via AFS?	(Note: Not tested).
	If you have AFS, you will be asked the following additional
	questions.  Read the condor_customize man page referenced
	above for details.
	1) Do you want to use AFS?  
	2) Pathname of the AFS "fs" command?
	3) Pathname of the AFS "vos" command?
     f.	Where is your mail program located?  (Usually this is
	/bin/mail or /usr/bin/mail -- "which mail" should tell you).
     g. Internet domain of machines sharing a common UID space.  If
	you specify a UID space, Condor will execute user jobs under
	the UID of the submitting user - otherwise Condor will execute
	user jobs with the UID "nobody".  Specify "none" unless all
	machines in your pool are guaranteed to have consistent UIDs. 
     h.	Internet domain of all machines sharing a common NFS file space 
     i. In the end of the dcondor_config file, you need to specify the
        sequence number and the name of the machines which need 
	to join the Condor pool. The sequence numbers can be assigned 
        in arbitrary order but they have to be continuous. For example,
	for the list
	MACHINE0 = dolphin
	MACHINE1 = gorilla
	MACHINE12 = hyena
	Only dolphin and gorilla are recognized by Condor. 
7. After you are done with modifying dcondor_config, edit bin/dcondor_init,
   lib/config_on and lib/config_off to set the variable RELEASE properly
   (e.g., as /usr/local/condor/) and then run dcondor_init as 
   root on every machine you wish to add to your pool. This will set up 
   the home directory and machine-specific directories properly.
   Basically, dcondor_init will create a soft link in the condor home 
   directory that points from ~condor/dcondor_config to
   <RELEASEDIR>/lib/dcondor_config.  Changes made to this file will
   effect all the machines in your pool of that platform.  Every
   machine in the pool also has a "dcondor_config.local" file in
   their machine specific directory (See step 2 above). This file allows you
   to make local configuration changes that only effect one machine.
   The settings in dcondor_config.local always overrides ones in the 
   site-wide dcondor_config file.
8. Start the condor daemons by running "dcondor_on" on each machine you
   want in your pool. You must be "root" or "condor" when you run dcondor_on.  
   However, if you start as "condor", condor can not switch UIDs, 
   and therefore, all daemons will run as "condor". See the manual for 
   details about running Condor not as root.
9. Ensure that condor is running.  You can run:
	ps -elf | egrep dcondor_
   On every machine you should have processes for:
	dcondor_master
	dcondor_collector
	dcondor_startd
	dcondor_schedd
	dcondor_monitor
10. Ensure that the condor daemons are communicating.  You can run
    "dcondor_status" to get a one line summary of the status of each
    machine in your pool.
11. Add "dcondor_master" into your startup/bootup scripts (i.e. /etc/rc ) so that
    your machine runs "dcondor_master" upon bootup.  dcondor_master will then fire
    up the necessary Condor daemons whenever your machine is rebooted.
That's the end of the installation. If you want to know more about the
original version of Condor, consult the Condor website at
http://www.cs.wisc.edu/condor/; if you have problems with the distributed
and fault-tolerance-enhanced version of Condor, let us know.
Finally, we include a section that describes the required ownerships and 
permissions for the various parts of the Condor installation. Following
the installation procedures as above should properly set up all of these 
things.  This section is only included as a reference so you know what 
those tools are doing.
Everything  mentioned in this section should be owned by user and
group "condor".
A. The Condor release directory.
   The release directory contains two subdirectories: "bin" and "lib". 
	a) The bin directory should have permission rwxr-xr-x (755). 
	   All condor binaries in bin should have permission rwxr-xr-x
	   (755) except: 
		dcondor_globalq
		dcondor_history
		dcondor_jobqueue
		dcondor_preen
		dcondor_prio
		dcondor_q
		dcondor_rm
		dcondor_submit
		dcondor_summary		
	   These programs must be setgid condor to work properly.  They
	   should therefore have permission rwxr-sr-x (2755).
	b) The lib directory should also have permission rwxr-xr-x (755)
	   All the files in the lib should have permission rw-r--r--
	   (644) except: 
		ld
		real-ld
	   These two files are just hard links to each other.  They
	   are used by the dcondor_compile script and need to have
	   permission rwxr-xr-x (755).
B. Condor's home directory.
   Either this directory is shared between all the condor accounts in
   your pool, or it is a separate directory for each one.  If you have
   a shared directory, then you must create a machine-specific
   directory for every machine in your pool.  I will refer to these
   machine specific directories as the "local dir" .
     a)	The home directory of the condor user should contain a file
	dcondor_config which is a soft link to
	<RELEASEDIR>/lib/dcondor_config, where <RELEASEDIR> is the path
	to the Condor release directory.  
     b)	Every local dir should have permissions rwxr-xr-x (755).  For 
	installations without a shared condor home directory, just
	consider the home directory itself as the local dir.
     c) Inside each local dir, there should be three subdirectories,
	each with the given permission:
	   execute	rwxrwxrwx (777)
	   log		rwxrwxr-x (775)
	   spool	rwxrwxr-x (775)
     d) Every local dir should also contain a file
	"dcondor_config.local", which should have permission rw-rw-r--
	(664).  This file can be empty, or it can contain
	machine-specific configuration settings.

Source Installation

  Unpack the src.tar.gz by
	gzip -d < src.tar.gz | tar xvf -
  It will create a directory called release_src which contains the
  complete source tree of the Task Management Deliverables.
  To compile the source, do
	cd release_src/src/
	./dcondor_imake
	make release
  which will compile everything and create release_src/src/release_dir/
  which contains the binaries and the libraries. All the previously
  discovered and fixed problems are documented in the file
  release_src/Problem_AND_Fixed.txt. In order to generate the Makefiles
  successfully from Imakefile, you need to use ansi_cpp contained in the
  binary release and set
	setenv IMAKECPP /usr/local/condor/bin/ansi_cpp

Trial Installation

  Unpack the trial.tar.gz by 
	gzip -d < trial.tar.gz | tar xvf -
  It will create a directory called trial which contains a small
  program that can be compiled using dcondor_compile.
  To compile the program, do 
	cd trial/
	make
  Then you can submit the job using
	dcondor_submit loop.cmd
  There are several command files that you can look through. They
  are well documented and you shouldn't have any problem writing
  a command file of your own using the framework.

TaskMgmtView Installation

  Unpack the view.tar.gz by 
	gzip -d < view.tar.gz | tar xvf -
  It will create a directory TaskMgmtView which contains the java
  source code for the task management monitor program TaskMgmtView.
  You may want to look over Config.java file to properly set the
  PATH to invoke condor programs according to your condor release
  installation (e.g., PATH can be set to /usr/local/condor/bin)
  To compile the program, download the updated JDK (at least JDK
  1.1) from Sun website, install it and do
	cd TaskMgmtView/
	javac *.java
  Then you can run it by
	java TaskMgmtView [ -d DOMAIN_NAME ] 
  DOMAIN_NAME is in the format of eng.ohio-state.edu and should
  be set to the domain of your condor machines. Current version
  of TaskMgmtView only supports a single domain Condor pool.
  It is recommended that TaskMgmtView be run from a machine not 
  in Condor pool, since it may take up significant 
  amount of CPU time. Running it on high-end PC usually yields
  excellent result. When TaskMgmtView starts up, you will need
  to modify the monitor port no from 60002 to 60006 and the
  monitor machine name to any machine with condor daemons up
  and running. Note that TaskMgmtView has not been fully
  tested in this release and may have some bugs.

Useful condor programs (For more information, consult the manual please):

  1. dcondor_status, gives one-line summary for every machine active
     in the pool.
  2. dcondor_q, check local job queue.
  3. dcondor_globalq, check global job queues.
  4. dcondor_history, check local history jobs (jobs completed).
  5. dcondor_globalh, check global history jobs. Note that this 
     implementation of globalh reads history files directly
     from ~condor/MACHINE_NAME/spool/history. Thus it can only
     be used with NFS properly installed and the Condor environments
     correctly configured following the instructions above.
  6. dcondor_summary, gives you brief summary on Condor jobs.
  7. dcondor_reconfig_schedd, send RECONFIG command to schedd so
     that it can re-read dcondor_config file. Useful when you finish
     modifying the configuration file and want it to be effective
     immediately.
  8. dcondor_on and dcondor_off, turn Condor on/off.