Sun's Hudson deployment : Forked process failing to exec

Also see the public write up about this in here

On rator.sfbay, several processes are forked but failing to exec.

hudson@rator ~ $ ps -ef | grep java
noaccess   832     1   0   Jun 03 ?         285:37 /usr/java/bin/java -server -Xmx128m -XX:+BackgroundCompilation -XX:PermSize=32m
    mode 24104     1   0   Jul 01 pts/2       0:00 /usr/dist/share/java,v1.5.0_14/5.x-i86pc/bin/java -Djdk.home=/usr/dist/share/ja
  hudson 20906   536   0   Jul 02 ?           0:00 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=5589 -Xmx
  hudson  3907   536   0   Jul 02 ?           0:00 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=5589 -Xmx
  hudson   536     7   0   Jul 01 ?         393:55 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=5589 -Xmx
  hudson  7398   536   0   Jul 02 ?           0:00 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=5589 -Xmx
      hg 21255 21086   0   Jun 18 ?           6:48 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=28801 -ja
  hudson   671   536   0   Jul 01 ?           0:57 /files/hudson/tools/java1.6/bin/java -Xdebug -Xrunjdwp:transport=dt_socket,serv
  hudson 15109  9416   0 16:39:50 pts/9       0:00 grep java

See the child process of 536. This is the thread that's waiting for the completion:

"Executor #2 for maddie" Id=81 RUNNABLE (in native)
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.(UNIXProcess.java:53)
	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	at hudson.Proc$LocalProc.(Proc.java:104)
	at hudson.Proc$LocalProc.(Proc.java:82)
	at hudson.Proc$LocalProc.(Proc.java:74)
	at hudson.Util.createSymlink(Util.java:725)
	at hudson.model.AbstractBuild$AbstractRunner.createLastSuccessfulLink(AbstractBuild.java:247)
	at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:241)
	at hudson.model.Run.run(Run.java:778)
	at hudson.model.Build.run(Build.java:85)
	at hudson.model.ResourceController.execute(ResourceController.java:70)
	at hudson.model.Executor.run(Executor.java:85)

When I killed those processes somehow the parent process (536) was killed too, and thereby bringing down the entire Hudson. This might be related to the shortage of swap space.

2008/07/08

This just happened again on rator.sfbay. This time I used pstack to get the native stack trace:

hudson@rator ~ $ ps -ef | grep java
noaccess   832     1   0   Jun 03 ?         300:56 /usr/java/bin/java -server -Xmx128m -XX:+BackgroundCompilation -XX:PermSize=32m
  hudson  6101 15216   0   Jul 03 ?           0:00 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=5589 -Xmx
  hudson 15216     7   0   Jul 03 ?         557:02 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=5589 -Xmx
    mode 24104     1   0   Jul 01 ?           0:00 /usr/dist/share/java,v1.5.0_14/5.x-i86pc/bin/java -Djdk.home=/usr/dist/share/ja
  hudson 10467 15216   0   Jul 07 ?           0:00 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=5589 -Xmx
  hudson  2565 15216   0 02:16:33 ?           0:00 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=5589 -Xmx
  hudson 27123 15216   0   Jul 03 ?           0:00 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=5589 -Xmx
      hg 21255 21086   0   Jun 18 ?           8:57 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=28801 -ja
  hudson 15707 15216   0   Jul 03 ?           3:04 /files/hudson/tools/java1.6/bin/java -Xdebug -Xrunjdwp:transport=dt_socket,serv
  hudson 10936 15216   0   Jul 05 ?           0:00 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=5589 -Xmx
  hudson 16947 15216   0   Jul 05 ?           0:00 java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=5589 -Xmx
  hudson 11346 10308   0 16:58:05 pts/15      0:00 grep java
hudson@rator ~ $ pstack 16947
16947:  java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=
 feeaf52b lwp_park (0, 0, 0)
 feea7ffe mutex_lock_internal (bfbad04, 0, 1) + 1bc
 feea843d mutex_lock_impl (bfbad04, 0) + 12d
 feea8492 mutex_lock (bfbad04) + 10
 91edd1b9 soft_delete_session (bfbad00, 1, 1) + a9
 91edce89 soft_delete_all_sessions (1, 91f0d4bc, 91f24378, 91f24000, 8843160c, feea8492) + 49
 91ed79e0 finalize_common (1, 0) + 6c
 91ed7b67 softtoken_fini (feffa7d8, 91f60958, 91f0d4bc, 88431674, fefd314b, 8de50538) + 37
 91f0d4e1 _fini    (8de50538, 92ff0de8, feffa7d8, 8de50550, 8de50538, 92ff0de8) + 25
 fefd314b call_fini (feffa2dc, 8de50538) + f3
 fefdc08a remove_hdl (91f60cb0, 92ff0de8, 0) + 7c2
 fefd7154 dlclose_core (91f60cb0, 92ff0de8, feffa2dc) + a0
 fefd71a1 dlclose_intn (91f60cb0, 92ff0de8) + 25
 fefd720d dlclose_check (91f60cb0, 92ff0de8) + 2d
 fefd7273 dlclose  (91f60cb0) + 3b
 91f838f0 pkcs11_slottable_delete (91f966d8, 91f96000, 0, fef496c0, fede0500, 884317d4) + b4
 91f7fa3c finalize_common (0, fede0500, 91f96000, 884317e0, 884317e0, 91f7f78f) + dc
 91f7fa89 pkcs11_fini (fede0080, 884317fc, fee47282, 8d137200, fef45000, 88431820) + 41
 91f7f78f pkcs11_fork_child (8d137200, fef45000, 88431820, 8d137200, 88431820, 88431820) + 27
 fee47282 _postfork_child_handler (b, 93eac24, fef45000, 99a72e80, fef49720, fef49700) + 32
 feea167d forkx    (0) + 11d
 feea172e fork     (c45d800, f42d369c, f42d36a0, c45d800, 999c2878, 99a72cd8) + 1a
 fe287edf Java_java_lang_UNIXProcess_forkAndExec (c45d8ec, 88431940, 8843193c, 88431938, 1, 88431930) + 367
 fb20ebd2 ???????? (0, fb20b399, 99a732e0, 99a732d0, 99a732c0, 0)
 fb202de9 ???????? (0, 0, c, 99a72f48, 1, 99a72ee0)
 fbee9bf0 ???????? ()
hudson@rator ~ $ pstack 27123
27123:  java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=
 feeaf52b lwp_park (0, 0, 0)
 feea7ffe mutex_lock_internal (97e6094, 0, 1) + 1bc
 feea843d mutex_lock_impl (97e6094, 0) + 12d
 feea8492 mutex_lock (97e6094) + 10
 91edd1b9 soft_delete_session (97e6090, 1, 1) + a9
 91edce89 soft_delete_all_sessions (1, 91f0d4bc, 91f24378, 91f24000, 8c19afbc, feea8492) + 49
 91ed79e0 finalize_common (1, 0) + 6c
 91ed7b67 softtoken_fini (feffa7d8, 91f60958, 91f0d4bc, 8c19b024, fefd314b, 8de50538) + 37
 91f0d4e1 _fini    (8de50538, 92ff0de8, feffa7d8, 8c19b024, fefd314b, 8de50538) + 25
 fefd314b call_fini () + f3
hudson@rator ~ $ pstack 2565 10467 6101
2565:   java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=
 feeaf52b lwp_park (0, 0, 0)
 feea7ffe mutex_lock_internal (cb6cb9c, 0, 1) + 1bc
 feea843d mutex_lock_impl (cb6cb9c, 0) + 12d
 feea8492 mutex_lock (cb6cb9c) + 10
 91edd1b9 soft_delete_session (cb6cb98, 1, 1) + a9
 91edce89 soft_delete_all_sessions (1, 91f0d4bc, 91f24378, 91f24000, 902334cc, feea8492) + 49
 91ed79e0 finalize_common (1, 0) + 6c
 91ed7b67 softtoken_fini (feffa7d8, 91f60958, 91f0d4bc, 90233534, fefd314b, 8de50538) + 37
 91f0d4e1 _fini    (8de50538, 92ff0de8, feffa7d8, 8de50550, 8de50538, 92ff0de8) + 25
 fefd314b call_fini (feffa2dc, 8de50538) + f3
 fefdc08a remove_hdl (91f60cb0, 92ff0de8, 0) + 7c2
 fefd7154 dlclose_core (91f60cb0, 92ff0de8, feffa2dc) + a0
 fefd71a1 dlclose_intn (91f60cb0, 92ff0de8) + 25
 fefd720d dlclose_check (91f60cb0, 92ff0de8) + 2d
 fefd7273 dlclose  (91f60cb0) + 3b
 91f838f0 pkcs11_slottable_delete (91f966d8, 91f96000, 0, fef496c0, fede0500, 90233694) + b4
 91f7fa3c finalize_common (0, fede0500, 91f96000, 902336a0, 902336a0, 91f7f78f) + dc
 91f7fa89 pkcs11_fini (fede0080, 902336bc, fee47282, 9134b200, fef45000, 902336e0) + 41
 91f7f78f pkcs11_fork_child (9134b200, fef45000, 902336e0, 9134b200, 902336e0, 902336e0) + 27
 fee47282 _postfork_child_handler (0, b1cde73, fef45000, f30bcaa3, fef49720, fef49700) + 32
 feea167d forkx    (0) + 11d
 feea172e fork     (9133c00, f42d3664, f42d3668, 9133c00, a200aa38, 90233730) + 1a
 fe287edf Java_java_lang_UNIXProcess_forkAndExec (9133cec, 90233800, 902337fc, 902337f8, 2, 902337f0) + 367
 fb20ebd2 ???????? (0, fb20b399, 93027718, 93027708, 930276f8, 1)
 fb202de9 ???????? (1, 93027660, 1, 93027458, 2, 930273f0)
 fc91de98 ???????? ()
10467:  java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=
 feeaf52b lwp_park (0, 0, 0)
 feea7ffe mutex_lock_internal (d7ed02c, 0, 1) + 1bc
 feea843d mutex_lock_impl (d7ed02c, 0) + 12d
 feea8492 mutex_lock (d7ed02c) + 10
 91edd1b9 soft_delete_session (d7ed028, 1, 1) + a9
 91edce89 soft_delete_all_sessions (1, 91f0d4bc, 91f24378, 91f24000, 9168b8fc, feea8492) + 49
 91ed79e0 finalize_common (1, 0) + 6c
 91ed7b67 softtoken_fini (feffa7d8, 91f60958, 91f0d4bc, 9168b964, fefd314b, 8de50538) + 37
 91f0d4e1 _fini    (8de50538, 92ff0de8, feffa7d8, 8de50550, 8de50538, 92ff0de8) + 25
 fefd314b call_fini (feffa2dc, 8de50538) + f3
 fefdc08a remove_hdl (91f60cb0, 92ff0de8, 0) + 7c2
 fefd7154 dlclose_core (91f60cb0, 92ff0de8, feffa2dc) + a0
 fefd71a1 dlclose_intn (91f60cb0, 92ff0de8) + 25
 fefd720d dlclose_check (91f60cb0, 92ff0de8) + 2d
 fefd7273 dlclose  (91f60cb0) + 3b
 91f838f0 pkcs11_slottable_delete (91f966d8, 91f96000, 0, fef496c0, fede0500, 9168bac4) + b4
 91f7fa3c finalize_common (0, fede0500, 91f96000, 9168bad0, 9168bad0, 91f7f78f) + dc
 91f7fa89 pkcs11_fini (fede0080, 9168baec, fee47282, fe45ba00, fef45000, 9168bb10) + 41
 91f7f78f pkcs11_fork_child (fe45ba00, fef45000, 9168bb10, fe45ba00, 9168bb10, 9168bb10) + 27
 fee47282 _postfork_child_handler (1, d6ba709, fef45000, 2, fef49720, fef49700) + 32
 feea167d forkx    (0) + 11d
 feea172e fork     (8da3c00, f42d3664, f42d3668, 8da3c00, a200c160, 9168bb60) + 1a
 fe287edf Java_java_lang_UNIXProcess_forkAndExec (8da3cec, 9168bc2c, 9168bc28, 9168bc24, 2, 0) + 367
 fb20ebd2 ???????? (0, fb20b399, 9658b978, 9658b968, 9658b958, 0)
 fb202de9 ???????? (0, 0, 0, 0, 2, 9658b7c0)
 fb202f0d ???????? (9658b690, 2, 0, 9658b898, c9, 9658b7c0)
 fb202da7 ???????? (9658aba8, 0, 0, f48c9b20, 9658b4b0, 9658b468)
 fb202da7 ???????? (0, 0, 9658b410, 9dab8bb0, 9168bd50, f30f4210)
 fb202da7 ???????? (9658b410, 9dab8bb0, 8da3cec, 0, 9658b410, 9658b428)
 fd058d70 ???????? (0, 0, 8da38d8, 0, 8c3e048, 0)
 00000000 ???????? (fe888b40, fe768ef0, fe7637d4, fe888b34, fe888a8c, fe888b1c)
 00000000 ???????? (fffffe18, 3badd9be, f468fe16, e78b7254, fe1dbbcc, 0)
 00000000 ???????? ()
6101:   java -Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=
 feeaf52b lwp_park (0, 0, 0)
 feea7ffe mutex_lock_internal (8f809a4, 0, 1) + 1bc
 feea843d mutex_lock_impl (8f809a4, 0) + 12d
 feea8492 mutex_lock (8f809a4) + 10
 91edd1b9 soft_delete_session (8f809a0, 1, 1) + a9
 91edce89 soft_delete_all_sessions (1, 91f0d4bc, 91f24378, 91f24000, 8ce1930c, feea8492) + 49
 91ed79e0 finalize_common (1, 0) + 6c
 91ed7b67 softtoken_fini (feffa7d8, 91f60958, 91f0d4bc, 8ce19374, fefd314b, 8de50538) + 37
 91f0d4e1 _fini    (8de50538, 92ff0de8, feffa7d8, 8de50550, 8de50538, 92ff0de8) + 25
 fefd314b call_fini (feffa2dc, 8de50538) + f3
 fefdc08a remove_hdl (91f60cb0, 92ff0de8, 0) + 7c2
 fefd7154 dlclose_core (91f60cb0, 92ff0de8, feffa2dc) + a0
 fefd71a1 dlclose_intn (91f60cb0, 92ff0de8) + 25
 fefd720d dlclose_check (91f60cb0, 92ff0de8) + 2d
 fefd7273 dlclose  (91f60cb0) + 3b
 91f838f0 pkcs11_slottable_delete (91f966d8, 91f96000, 0, fef496c0, fede0500, 8ce194d4) + b4
 91f7fa3c finalize_common (0, fede0500, 91f96000, 8ce194e0, 8ce194e0, 91f7f78f) + dc
 91f7fa89 pkcs11_fini (fede0080, 8ce194fc, fee47282, 8d111a00, fef45000, 8ce19520) + 41
 91f7f78f pkcs11_fork_child (8d111a00, fef45000, 8ce19520, 8d111a00, 8ce19520, 8ce19520) + 27
 fee47282 _postfork_child_handler (0, c40195b, fef45000, 93018488, fef49720, fef49700) + 32
 feea167d forkx    (0) + 11d
 feea172e fork     (a400400, f42d369c, f42d36a0, a400400, 9b880310, 8ce1958c) + 1a
 fe287edf Java_java_lang_UNIXProcess_forkAndExec (a4004ec, 8ce19640, 8ce1963c, 8ce19638, 2, 8ce19630) + 367
 fb20ebd2 ???????? (0, fb20b399, 930191f8, 930191e8, 930191d8, 1)
 fb202de9 ???????? (1, 93019188, 1, 93018fc8, 2, 93018f58)
 fbee9bf0 ???????? ()

The source code for these code is available on OpenSolaris site: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/pkcs11/pkcs11_softtoken/common/softGeneral.c#finalize_common
I found a posting on forum that looks very related. More hits in here

Various suggestions

Apply patch 118918? There's an on-going escalation?

A workaround? Apply patch 127111 -> this was obsolted by 127127.

turns out this was reported in Hudson list some time ago.

Fix

Rajiv and I attempted to apply some of the mentioned patches, but they failed to pkgadd, citing version inconsistencies. Perhaps those patches are not for the Solaris running on rator?

I then applied the suggested workaround in jre/lib/security/java.security file and restart Hudson on 2008/07/09.