Jenkins Infrastructure : wiki.jenkins-ci.org

Where are things?

Stock Confluence is installed at /srv/wiki under its own user account 'wiki'.

To bounce the server

ssh lettuce.jenkins-ci.org

then

# Restart Confluence
sudo docker restart confluence
# Restart Confluence cache
sudo docker restart confluence-cache

TODOs

Incident Records

There's a ticket filed in Atlassian support for the recent outages. So let's keep records of when/how Confluence failed. (newer ones first)

See How to do a post-mortem analysis for what data to collect before relaunching a new instance

March 16th 18:49 PT

Upgraded JDK to 6u24 since investigation in CSP-58700 seems to indicate that there have been 7 JDK crashes while JIT-ing the exact same method. This KB article appears spot on.

March 16th afternoon PT

With the help of OSUSL, the VM now has 2.5GB heap. I've modified the VM parameters to "-Xmx768m -XX:MaxPermSize=256m". Previously it was 512m and 192m respectively.

March 16th (1st time)

JVM crash on out of memory error (full report):

# A fatal error has been detected by the Java Runtime Environment:
#
# java.lang.OutOfMemoryError: requested 536870928 bytes for Chunk::new. Out of swap space?
#
#  Internal Error (allocation.cpp:215), pid=19777, tid=140363900471040
#  Error: Chunk::new
#
# JRE version: 6.0_22-b04
# Java VM: Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode linux-amd64 )
# An error report file with more information is saved as:
# /srv/wiki/confluence-3.4.7-std/logs/hs_err_pid19777.log

It appears that the JVM crashed when it was trying to reallocate the oldgen from 300MB-ish to 500MB-ish because the kernel didn't have enough swap space to underwrite the new allocation.

March 16th (2nd time)

Unresponsive JVM. "jmap -heap" reported that all the heap spaces have fully filled up. Presumably the JVM went into the excessive GC mode although I couldn't confirm it.

Attaching to process ID 1423, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 17.1-b03

using thread-local object allocation.
Parallel GC with 2 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 469762048 (448.0MB)
   NewSize          = 1310720 (1.25MB)
   MaxNewSize       = 17592186044415 MB
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 205520896 (196.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 118554624 (113.0625MB)
   used     = 118554624 (113.0625MB)
   free     = 0 (0.0MB)
   100.0% used
From Space:
   capacity = 13762560 (13.125MB)
   used     = 0 (0.0MB)
   free     = 13762560 (13.125MB)
   0.0% used
To Space:
   capacity = 19005440 (18.125MB)
   used     = 0 (0.0MB)
   free     = 19005440 (18.125MB)
   0.0% used
PS Old Generation
   capacity = 313196544 (298.6875MB)
   used     = 313196480 (298.68743896484375MB)
   free     = 64 (6.103515625E-5MB)
   99.99997956554718% used
PS Perm Generation
   capacity = 132775936 (126.625MB)
   used     = 132711288 (126.56334686279297MB)
   free     = 64648 (0.06165313720703125MB)
   99.95131045432811% used

March 16th (3rd time)

Andrew restarted it. No details.

Sep 23 2015

20150903 Wiki Outage

Attachments:

hs_err_pid19777.log (text/x-log)