Troubleshooting and Notes on SMSCG Middleware and Applications

Author(s): Sergio, Placi
Reviewer: to be specified
Last modified: 7.12.2011 (by Placi)
ToDo(s): living document....

Problem list:

LRMS queue inactive

  1. In arc.conf check whether the paths to your lrms system are correct. e.g. if you have a pbs system and the binaries are located at /usr/local/bin you should have an entry there like.
    [common] .... pbs_bin_path="/usr/local/bin" pbs_log_path="/var/spool/pbs/server_logs" maui_bin_path="/usr/local/sbin/maui"
  2. On the front-end, check whether your pool accounts got the permission to  submit jobs to the queue. You can easily do this by logging into the front-end and to:
    sudo <pool_account> qstat -Qf # this should show you if environment is set up correctly echo "echo `/bin/hostname`" | qsub # this will submit a job to the batch system (if you have the permissions)
    Should you not have been able to submit the job, you should set up the permssions accordingly. For pbs this might look as follows:
     
    root# qmgr root# Qmgr: set queue <your_queue_name> acl_groups="pool_account_group1, pool_account_group2, ..."
  3. other suggestions (?)..

Submitted jobs failed at downloader process

When job submitted to a give site systematically fail with the following error message:

"Failed to run downloader (pre-processing)"

 

The reason could be the following:

The grid-manager has been launched without a proper LD_LIBRARY_PATH containing references to additional required lib folders.
For example, when LFC is installed in a dedicated folder ( /opt/lfc ), grid-manager needs to be launched with /opt/lfc/lib[64] within its LD_LIBRARY_PATH. Normally the grid-manger init script takes into account LFC_LOCATION variable, so a site should check whether the varialbe is set properly and whether it is correctly exported to the grid-manager.
One way of inspecting the LD_LIBRARY_PATH ( together with the rest of the environment ) from a running grid-manager is teh following:

# ps -aef | grep grid-manager | grep -v grep | awk '{print $2}' | xargs  --replace cat /proc/{}/environ |  tr "\000" "\n"

 

 Stop/Killing 'PREPARING' job

Apparently there is no simple procedure for stopping/killing a 'PREPARING' job
in this case, indeed, the bottleneck is the downloader process (grid-manager spawns one downloader per job).
For the time being we can apply the following procedure:

  1. Identify the long pending downloader processes
  2. associate downloader processes with corresponding grid job
  3. identify job.<>.* in controldir
  4. identify job sessiondir
    $ ps -aef | grep -i down

    smscg002 16672  8068  0 10:54 ?        00:00:00 /opt/nordugrid/libexec/downloader -U 1091 -f -n 10000 -c -i 3000 1659213232516901609638041 /var/log/nordugrid/jobstatus /share/apps/nordugrid/grid/1659213232516901609638041

    5th field (in this case 10:54) gives information on when the process started
    in this case '1659213232516901609638041' is the reference to the numeric part of the job
    so in $controldir you should check for
    job.1659213232516901609638041.*

    while '/share/apps/nordugrid/grid/1659213232516901609638041' is the job's sessiondir
       
  5. kill downloader process associated to job
    kill [-9] 16672
    kills the corresponding downloader process
    [-9] depends on whether one would like to take an aggressive approach
    (only to be used if regular kill does not work)
  6. manually clean up sessiondir
     
    rm -rf '/share/apps/nordugrid/grid/1659213232516901609638041*'

    Note: this is just an example from the previous command
     
  7. if necessary manually cleanup job.
    in controldir (though grid-manager should automatically cleanup controldir as specified in arc.conf)

    rm -rf $controldir/job.1659213232516901609638041.*
    Note: this operation should be avoided unless very necessary (i.e. free critical disk space)
    grid-manager and other processes relay on these files for their regular operations
    also note that grid-manager automatically cleans up $controldir according to what specific dir in arc.conf

    [grid-manager]
    defaultttl="259200 345600" 

    # defaultttl [ttl [ttr]] - ttl is the time in seconds for how long a session
    # directory will survive after job execution has finished. If not specified
    # the default is 1 week. ttr is how long information about a job will be kept.
    # If not specified, the ttr default is one month.

        

 Some jobs seem not to change state automatically

For example, these three jobs have been in "running" (INLRMS:R) state for 3 days, despite being marked as 8-hour jobs in the xRSL:

$ ngstat gsiftp://ce.lhep.unibe.ch:2811/jobs/104281319594868328983692 gsiftp://ce.lhep.unibe.ch:2811/jobs/128031319594414964103921 gsiftp://ce.lhep.unibe.ch:2811/jobs/90841319 5943892122102926
Job gsiftp://ce.lhep.unibe.ch:2811/jobs/104281319594868328983692 Job Name: CodemlApplication Status: INLRMS:R
Job gsiftp://ce.lhep.unibe.ch:2811/jobs/128031319594414964103921 Job Name: CodemlApplication Status: INLRMS:R
Job gsiftp://ce.lhep.unibe.ch:2811/jobs/908413195943892122102926 Job Name: CodemlApplication Status: INLRMS:R

 

If I cancel them, they move to "KILLING" state and stick there (they have been "KILLING" since yesterday evening)

$ ngkill gsiftp://ce.lhep.unibe.ch:2811/jobs/104281319594868328983692 gsiftp://ce.lhep.unibe.ch:2811/jobs/128031319594414964103921 gsiftp://ce.lhep.unibe.ch:2811/jobs/90841319 5943892122102926
Jobs processed: 3, killed: 3, deleted: 3

 

$ ngstat gsiftp://ce.lhep.unibe.ch:2811/jobs/104281319594868328983692 gsiftp://ce.lhep.unibe.ch:2811/jobs/128031319594414964103921 gsiftp://ce.lhep.unibe.ch:2811/jobs/90841319 5943892122102926
Job gsiftp://ce.lhep.unibe.ch:2811/jobs/104281319594868328983692 Job Name: CodemlApplication Status: KILLING Error: User requested to cancel the job
Job gsiftp://ce.lhep.unibe.ch:2811/jobs/128031319594414964103921 Job Name: CodemlApplication Status: KILLING Error: User requested to cancel the job
Job gsiftp://ce.lhep.unibe.ch:2811/jobs/908413195943892122102926 Job Name: CodemlApplication Status: KILLING Error: User requested to cancel the job
command : (see description above)
SMSCG_UI :
problem_started : since Oct 26, 2011
Grid_Job_Id : gsiftp://ce.lhep.unibe.ch:2811/jobs/104281319594868328983692

 

A check of the system indicated that the node on which these 3 jobs were running happened to crash. When this happens, gridengine keeps reporting the latest known status about the jobs, which in this case was:

662153 1.00500 CodemlAppl smscg004 dr 10/26/2011 04:01:04 all.q@compute-1-1.local 1
662167 1.00500 CodemlAppl smscg004 dr 10/26/2011 04:01:19 all.q@compute-1-1.local 1
662220 1.00500 CodemlAppl smscg004 dt 10/26/2011 04:18:49 all.q@compute-1-1.local 1

 

This happen when an exec node crashes, thus the sge_exec daemon cannot report back to sge_master any updated information on the running jobs. For this, we introduce the following configuration parameters for the qmaster (using qconf -mconf):

qmaster_params ENABLE_ENFORCE_MASTER_LIMIT=true, ENABLE_FORCED_QDEL_IF_UNKNOWN=true

 

ENABLE_ENFORCE_MASTER_LIMIT
If this parameter is set then the s_rt, h_rt limit of a running job are tested and executed by the ge_qmaster(8) when the ge_execd(8) where the job is in unknown state. After s_rt or h_rt limit of a job is expired then the master daemon will wait additional time defined by DURATION_OFFSET (default 60s). If the execution daemon still cannot be contacted when this additional time is elapsed, then the master daemon will force the deletion of the job.

ENABLE_FORCED_QDEL_IF_UNKNOWN
If this parameter is set then a deletion request for a job is automatically interpreted as a forced deletion request (see -f of qdel(1)) if the host, where the job is running is in unknown state.

Exact behavior of the qmaster can be observed by the qmaster logfile (/opt/gridengine/default/common/reporting)

1320407696:job_log:1320407696:deleted:77992:0:NONE:r:scheduler:compute -0-1.local:0:1024:1320407306:x.sh:sergio:sergio::defaultdepartment:sge:job deleted by forced deletion request
320407696:job_log:1320407696:finished:77992:0:NONE:r:master:ocikbpra.local:0:1024:1320407306:
x.sh:sergio:sergio::defaultdepartment:sge:job waits for schedds deletion
1320407706:job_log:1320407706:deleted:77992:0:NONE:T:scheduler:ocikbpra.local:0:1024:1320407306:
x.sh:sergio:sergio::defaultdepartment:sge:job deleted by schedd

 

This option has to be complemented with a more tolerant DURATION_OFFSET (in our case we use 2 days) with qconf -msconf

params DURATION_OFFSET=259200