# Condor ## Really usefull links * https://batchdocs.web.cern.ch/index.html * https://batchdocs.web.cern.ch/local/quick.htmlhttps://batchdocs.web.cern.ch/local/submit.htmlSuper * https://opensciencegrid.org/docs/compute-element/submit-htcondor-ce/ * https://htcondor.readthedocs.io/en/latest/users-manual/index.html ## How to submit a simple job Imagine you have a executable `foo.exe` Create a bash file condor_submit_file.sh (no need for it to be executable) containing ``` executable = ./foo.exe should_transfer_files = yes universe = vanilla output = simple.out error = simple.err log = simple.log # shortest job duration +JobFlavour ="espresso" queue ``` Then run ``` condor_submit condor_submit_file.sh ``` The above will produce a log/out/err files called simple.log, simple.out, simple.err. Log file is accessible from the moment the job is launched and will be updated. Out/Err files will be only available at the end of the job. You can check status of your job (idle/hold/run/done) using ```` condor_q ```` (Sometimes the scheduller cannot be reached using condor_q but it does not mean the job is not running or waiting to be run) Condor jobs launch on CERN with ```environnement = vanilla``` can both access to file stored on AFS and EOS no need to transfer them to the server ## Passing arguments Add line ``` arguments = 1 2 3 myfile.root ``` ## Submitting several jobs Will execute foo.exe and pass as argument mydata.0.root to mydata.149.root ``` executable = ./foo.exe arguments = mydata.$(ProcId).root output = simple.$(ClusterId).$(ProcId).out error = simple.$(ClusterId).$(ProcId).err # group all log files in one log = simple.$(ClusterId).log queue 150 ``` Or sometimes it is easier to create the submit file with a program because arguments can vary a lot. Example process a file with different values of cut for a pT ``` ## common part to all job executable = ./foo.exe output = simple.$(ProcId).out error = simple.$(ProcId).err log = simple.$(ClusterId).log # part specific to each job # will create 3 different jobs arguments = file.root 40 queue arguments = file.root 60 queue arguments = file.root 100 queue ``` ## Job duration * espresso = 20 minutes * microcentury = 1 hour * longlunch = 2 hours * workday = 8 hours * tomorrow = 1 day * testmatch = 3 days * nextweek = 1 week Line to add ``` +JobFlavour = "longlunch" ```` ## Request memory/CPUs * https://batchdocs.web.cern.ch/local/submit.html CERN: default job 2GB of memory and 20GB of disk space (2 GB of memory/core) Request more core (hence more RAM memory) ``` request_memory = 4GB ``` Will reserve 2 cores hence will reserve a total of 8GB of memory Request more disk space ``` request_disk = 40GB ``` Request more disk space : ## Requirement of job host ``` requirements = (OpSysAndVer =?= "CentOS7" && Arch =?= "X86_64") ``` ## Concrete example Imagine you want to execute a bash file ```run_foo_prog.sh``` which will execute your program and let's say that program requires ROOT 6.18 to be set up. If the program `foo.exe` finishes properly it prints RUN_SUCCESSFULL otherwise that string is not printed. First make sure ```run_foo_prog.sh``` is executable. ```` chmod +x run_foo_prog.sh ```` And let's assume you want to pass some argument to the executable `foo.exe` for example a filename and a number Now the command to lauch the job is ``` condor_submit MyArgs="file.root 4" condor_submit_file.sh ``` This time the program ```foo.exe``` need to be transferred to the remote host (remark path are relative to the submitter directory in the ```condor_submit_file.sh``` unless absolute path is used (and unless initialdir is specified see later) condor_submit_file.sh ``` executable = ./run_foo_prog.sh transfer_input_files = ./foo.exe should_transfer_files = yes Arguments = $(MyArgs) universe = vanilla output = simple.out error = simple.err log = simple.log +JobFlavour = "espresso" queue ``` run_foo_prog.sh ``` #!/bin/bash #Fist 2 steps to do before anything else, never remove the while loop # =================================== #get all arguments # do not use $@ instead # see https://stackoverflow.com/questions/3811345/how-to-pass-all-arguments-passed-to-my-bash-script-to-a-function-of-mine/3816747 # https://stackoverflow.com/questions/12314451/accessing-bash-command-line-args-vs # capture argument Args=("$*") # setup ATLAS and root export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh source /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.18.04/x86_64-centos7-gcc48-opt/bin/thisroot.sh # To avoid complaints on compute nodes export HOME=${PWD} echo "PWD = $PWD" echo "PATH = $PATH" echo "LD_LIBRARY_PATH = $LD_LIBRARY_PATH" echo "ROOTSYS = $ROOTSYS" COMMAND_LINE="./foo.exe $Args" echo "Executing commandline" echo "$COMMAND_LINE" # to both execute the command and direct output string to _condor_stdout file which will become the output file at the end of the job output_str_commandline=$(eval $COMMAND_LINE | tee -a _condor_stdout ) ; # test if the program succesfully ended $(grep -q "RUN_SUCCESSFULL" <<< ${output_str_commandline}); bool_succeeded=$? ; echo if [ ${bool_succeeded} == "0" ] ; then echo "=========================" echo "SUCCESS_EXECUTION" echo "=========================" echo"" # return 0 = success exit 0 else echo "=========================" echo "FAILURE_EXECUTION" echo "=========================" echo"" # return 1 = failure exit 1 fi ``` ## Retry failed jobs (super usefull!) * https://batchdocs.web.cern.ch/workarounds/job-retry.htmlon_ ``` # Send the job to held state on failure on_exit_hold = (ExitBySignal == True) || (ExitCode != 0) # Periodically retry the jobs every 10 minutes, up to a maximum of 10 retries periodic_release = (NumJobStarts < 10 && ((CurrentTime - EnteredCurrentStatus) > 600)) ``` ## Transfer input/output files/directory Transfer setup.sh script and build directory (NB no / for build otherwise it would transfer what is inside the build directory not the directory itself) ``` # remark no slash for build otherwise it would transfer what is inside build directory and not the directory itself transfer_input_files = setup.sh,build # will transfer the directory results_dir and myfile.txt from the server to the submitter directory # remark no slash for results_dir otherwise it would transfer what is inside results_dir and not the directory itself transfer_output_files = results_dir,myfile.txt should_transfer_files = yes when_to_transfer_output = ON_EXIT ``` Transfer output remaps mechanism (beware only valid for files not directories) * https://manpages.debian.org/stretch/htcondor/condor_submit.1.en.html How to "remaps" directories to be different from the submission directory? Uses initialdir * https://www-auth.cs.wisc.edu/lists/htcondor-users/2020-September/msg00039.shtml But condor EOS submission is not allowed. * https://batchdocs.web.cern.ch/troubleshooting/eos.html#no-eos-submission-allowed How to remap output directory to EOS? Use of a trick I found initialdir = /./eos/path_you_want Just make sure the log files are not on EOS otherwise it will not work (i.e. when submitting Log = path_you_want then path_you_want is not on EOS) e.g go to AFS and you can launch ``` executable = $ENV(PWD)/foo.exe should_transfer_files = YES transfer_output_files = results # trick for condor initialdir=/./eos/user/b/bouquet/ +JobFlavour="espresso" #job1 Log = $ENV(PWD)/foo_a.log Output = $ENV(PWD)/foo_a.out Error = $ENV(PWD)/foo_a.error Arguments = a queue #job2 Log = $ENV(PWD)/foo_b.log Output = $ENV(PWD)/foo_b.out Error = $ENV(PWD)/foo_b.error Arguments = b queue ``` ## Working with big files * https://batchdocs.web.cern.ch/tutorial/exercise11.html → "transfer_input_files and transfer_output_files (In fact, the output is limited to 1GB)" Use xrdcp ## Define environnement variable for the condor job Imagine you want to defined environnement variable on the host as your EOSPATH or the absolute path where the job was submitted ``` environment = "ABSPATH_SUBMITTER=$ENV(PWD) EOSPATH=/eos/user/b/bouquet/" ``` ## List all properties of jobs ``` condor_q -l ``` ## Remove jobs Remove all jobs launched ```` condor_rm -all ```` Remove some specific jobs based on its cluster id e.g 5001 and 5002 ```` condor_rm 5001 5002 ```` ## Connect to a job to see if it is running succesfully Only works if the job is in run state ``` condor_ssh_to_job 5001.0 ``` ## Use Proxy * https://batchdocs.web.cern.ch/tutorial/exercise2e_proxy.html Before submitting the job setup following variables ```` # setup proxy valid for 96h echo "Setting voms-proxy" voms-proxy-init -voms atlas -valid 96:00 voms-proxy-info -all export PROXYFILENAME=x509up_u$(id -u) export PROXYFILEPATH=$HOME/private/$PROXYFILENAME echo "Copying $PROXYFILENAME to $HOME/private/" cp /tmp/$PROXYFILENAME $PROXYFILEPATH ```` In the submitter file add ```` transfer_input_files = $ENV(PROXYFILEPATH) environment = "X509_USER_PROXY=$ENV(PROXYFILENAME)" ```` And in your bash script that will be executed on the remote host add ```` echo "X509_USER_PROXY = $X509_USER_PROXY" voms-proxy-info -all voms-proxy-info -all -file $X509_USER_PROXY ````