Hanging batch job

Mike Polley

Hanging batch job
There is a batch job that's been idleling for over 24 hours. All efforts
to cancel job or thread have been futile. Omegamon indicated that the
address space has been swapped out and the the status is LW (long wait).
This job is also locking up other resources. What are some of the likely
causes to prevent this job from being canceled? Also, what needs to be done
to make this job go away? TIA for you help.



Paul Packham

Re: Hanging batch job
(in response to Mike Polley)
Mike,

Is the job processing in DB2 ? If so then it may be doing a massive roll back if
the process was not committing. Also multiple cancels of a batch job running
against DB2 is not a good idea, as it causes the rollback to have to be actioned a
number of times.

Regards, Paul


|--------+---------------------------->
| | "Polley, Mike" |
| | <[login to unmask email]|
| | WARE.COM> |
| | Sent by: DB2 Data |
| | Base Discussion |
| | List |
| | <[login to unmask email]|
| | ASSOC.COM> |
| | |
| | |
| | 03/01/2003 16:13 |
| | Please respond to |
| | DB2 Data Base |
| | Discussion List |
| | |
|--------+---------------------------->
>-------------------------------------------------------------------------------|
| |
| To: [login to unmask email] |
| cc: |
| Subject: Hanging batch job |
>-------------------------------------------------------------------------------|




There is a batch job that's been idleling for over 24 hours. All efforts
to cancel job or thread have been futile. Omegamon indicated that the
address space has been swapped out and the the status is LW (long wait).
This job is also locking up other resources. What are some of the likely
causes to prevent this job from being canceled? Also, what needs to be done
to make this job go away? TIA for you help.


DB2-L
webpage at http://listserv.ylassoc.com. The owners of the list can be reached at
[login to unmask email]



This email and any attachments are confidential and intended for the addressee
only. If you are not the named recipient, you must not use, disclose, reproduce,
copy or distribute the contents of this communication. If you have received this
in error, please contact the sender and then delete this email from your system.



Phil Grainger

Re: Hanging batch job
(in response to Paul Packham)
Hi Mike,

A LongWait indicates that the address space is waiting for something that
MVS doesn't expect to complete any time soon (waiting on an operator
message, waiting for an archive log tape, waiting for a dataset recall for
example)

Also, you say that "all efforts to cancel it" have been futile - which
implies you have issued more (possibly many more) than one cancel command.
This is not good.

If the job had done a lot of uncommitted updates before you issued the FIRST
cancel, then it would have had to back all those updates out. if you then
issue a SECOND cancel, it will have to undo all the backout work it had done
so far to get it back to the point it could start the real backout again. If
you then issue a THIRD cancel ......(you see the problem starting to
emerge??).

Worst case scenario, if the job had done 'n' updates, then every cancel you
issue adds at most another 2n backouts to the list of things to do.

As we're nearly at the weekend, you could try a -STOP DB2 MODE(FORCE) {the
sledgehammer} or /CANCEL IRLM {the piledriver} to get rid of it or an
Omegamon KILL command.

Mind you, when you restart DB2, be prepared for a long wait whilst Db2 sorts
out the data integrity!

Hope this helps

One other thing, you should be able to see from Omegamon how many updates
and commits it has done. It the evidence points to something other than a
long backout chain, then feel free to ignore me!

Phil Grainger
Computer Associates
Product Manager, DB2
Tel: +44 (0)161 928 9334
Fax: +44 (0)161 941 3775
Mobile: +44 (0)7970 125 752
[login to unmask email]


-----Original Message-----
From: Polley, Mike [mailto:[login to unmask email]
Sent: 03 January 2003 16:14
To: [login to unmask email]
Subject: [DB2-L] Hanging batch job


There is a batch job that's been idleling for over 24 hours. All efforts
to cancel job or thread have been futile. Omegamon indicated that the
address space has been swapped out and the the status is LW (long wait).
This job is also locking up other resources. What are some of the likely
causes to prevent this job from being canceled? Also, what needs to be done
to make this job go away? TIA for you help.





michael bell

Re: Hanging batch job
(in response to Phil Grainger)
some things to check
1. did the job abend before you noticed it? If it is waiting on DB2 for
rollback (my first guess), then you have to check for DB2 messages in the
CNTL and DBM1 address spaces for things like waiting for a archive log to be
mounted, waiting for a lock, IBM support will have a full list and you need
to talk to them and have your MVS people handy.
2. Does the job issue updates? I have never seen a read only job hang like
this but you might get lucky. If there were updates the backout is going
to need lots of archive logs after 24 hours.

If you can't resolve the hang normally, the only cure for this that I know
of is either recycle DB2 or IPL. If you use a MVS force command, it will
usually take DB2 down with it. The normal rules apply. MVS force command can
easily result in an unusable MVS system - use caution. Usually recycle DB2
is sufficient but you may need multiple stop/start db2 and possibly a
customized BSDS to control the backout processing DB2 is going to want to
do. Most people want to review this with IBM support before you start down
this path since you are practicing on production.

There are some additional complexities if you are using data sharing since
the member might be handling castout for other jobs active on other members.
This means you have to consider the entire data sharing group as being at
risk.

Mike Bell
HLS Technologies
----- Original Message -----
From: "Polley, Mike" <[login to unmask email]>
Newsgroups: bit.listserv.db2-l
To: <[login to unmask email]>
Sent: Friday, January 03, 2003 10:13 AM
Subject: Hanging batch job


> There is a batch job that's been idleling for over 24 hours. All efforts
> to cancel job or thread have been futile. Omegamon indicated that the
> address space has been swapped out and the the status is LW (long wait).
> This job is also locking up other resources. What are some of the likely
> causes to prevent this job from being canceled? Also, what needs to be
done
> to make this job go away? TIA for you help.
>
>
>





Jeremiah Eden

Re: Hanging batch job
(in response to michael bell)
I'll byte.
Have you checked for outstanding replies on the console (including tape)?
Any Reserves on Enqueues?
Does it have any DDF connection?
Does it use stored procedures?
Are you using data sharing?
Have you displayed Lock and Utility information?
Are you willing to Force the job or recycle DB2 (at your convenience)?


-----Original Message-----
From: Polley, Mike [mailto:[login to unmask email]
Sent: Friday, January 03, 2003 10:14 AM
To: [login to unmask email]
Subject: Hanging batch job


There is a batch job that's been idleling for over 24 hours. All efforts
to cancel job or thread have been futile. Omegamon indicated that the
address space has been swapped out and the the status is LW (long wait).
This job is also locking up other resources. What are some of the likely
causes to prevent this job from being canceled? Also, what needs to be done
to make this job go away? TIA for you help.








Mark Ediger

Re: Hanging batch job
(in response to Jeremiah Eden)
Mike,
First check Omegamon for MVS to see if it is waiting on a system enq.
If so then resolve that problem. If not and your program is a straight
SQL statement connect to DB2, you probably will have to force your DB2
subsystem down. If however, your program has called a stored procedure,
or udf, then try cancelling the WLM address space where the
function/procedure is running.
Good luck


>>> [login to unmask email] 01/03/03 10:13AM >>>
There is a batch job that's been idleling for over 24 hours. All
efforts
to cancel job or thread have been futile. Omegamon indicated that the
address space has been swapped out and the the status is LW (long
wait).
This job is also locking up other resources. What are some of the
likely
causes to prevent this job from being canceled? Also, what needs to be
done
to make this job go away? TIA for you help.



visit the DB2-L webpage at http://listserv.ylassoc.com. The owners of
the list can