I've just built a single cell, two node three cluster IBM BPM Advanced 8.5.5 environment, against a remote DB2 ESE 10.1.0.3 server.
and stop MEClusterMember2, I see this: -
[21/11/14 13:57:33:123 GMT] 0000008f SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSID0016I: Messaging engine MECluster.000-BPM.ProcessServer.Bus is in state Started.
So I was a little startled when, after starting the Deployment Environment, the Service Integration Bus (SIbus) failed to properly start.
This is what I saw in one of my Cluster Member logs: -
[21/11/14 13:17:03:719 GMT] 00000073 SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSIS1593I: The messaging engine, ME_UUID=E997A9EFA09498FC, INC_UUID=6DC2A53AD19710D7, has failed to gain an initial lock on the data store.
[21/11/14 13:17:03:719 GMT] 00000073 SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSIS1538I: The messaging engine, ME_UUID=E997A9EFA09498FC, INC_UUID=6DC2A53AD19710D7, is attempting to obtain an exclusive lock on the data store.
[21/11/14 13:17:03:719 GMT] 00000073 SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSIS1538I: The messaging engine, ME_UUID=E997A9EFA09498FC, INC_UUID=6DC2A53AD19710D7, is attempting to obtain an exclusive lock on the data store.
This was a clean build, so the Messaging Engine database should have been OK.
The tables were definitely there: -
SIB000 DB2USER1 T 2014-11-21-13.43.55.547439
SIB001 DB2USER1 T 2014-11-21-13.43.55.682333
SIB002 DB2USER1 T 2014-11-21-13.43.55.819494
SIBCLASSMAP DB2USER1 T 2014-11-21-13.43.55.334938
SIBKEYS DB2USER1 T 2014-11-21-13.43.55.947883
SIBLISTING DB2USER1 T 2014-11-21-13.43.55.420531
SIBOWNER DB2USER1 T 2014-11-21-13.43.55.151963
SIBOWNERO DB2USER1 T 2014-11-21-13.43.55.081007
SIBXACTS DB2USER1 T 2014-11-21-13.43.56.039355
SIB001 DB2USER1 T 2014-11-21-13.43.55.682333
SIB002 DB2USER1 T 2014-11-21-13.43.55.819494
SIBCLASSMAP DB2USER1 T 2014-11-21-13.43.55.334938
SIBKEYS DB2USER1 T 2014-11-21-13.43.55.947883
SIBLISTING DB2USER1 T 2014-11-21-13.43.55.420531
SIBOWNER DB2USER1 T 2014-11-21-13.43.55.151963
SIBOWNERO DB2USER1 T 2014-11-21-13.43.55.081007
SIBXACTS DB2USER1 T 2014-11-21-13.43.56.039355
and yet .... they were ALL empty :-(
As this is MY own environment, I called the ball and dropped the SIB tables: -
db2 drop table db2user1.sib000
db2 drop table db2user1.sib001
db2 drop table db2user1.sib002
db2 drop table db2user1.sibclassmap
db2 drop table db2user1.sibkeys
db2 drop table db2user1.siblisting
db2 drop table db2user1.sibowner
db2 drop table db2user1.sibownero
db2 drop table db2user1.sibxacts
db2 drop table db2user1.sib001
db2 drop table db2user1.sib002
db2 drop table db2user1.sibclassmap
db2 drop table db2user1.sibkeys
db2 drop table db2user1.siblisting
db2 drop table db2user1.sibowner
db2 drop table db2user1.sibownero
db2 drop table db2user1.sibxacts
and restarted the MECluster
This time around, the tables were nicely populated e.g.
db2 "select id from db2user1.sib000"
...
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
272
269 record(s) selected.
...
253
254
255
256
257
258
259
260
261
262
263
264
265
266
272
269 record(s) selected.
...
and the SIbus comes up nicely: -
with JVM1 reports: -
[21/11/14 13:43:58:431 GMT] 0000006a SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSID0016I: Messaging engine MECluster.000-BPM.ProcessServer.Bus is in state Started.
and JVM2 reports: -
[21/11/14 13:47:23:859 GMT] 00000065 SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSID0016I: Messaging engine MECluster.000-BPM.ProcessServer.Bus is in state Joined.
In other words, the Bus Member on node 1 is active, with the Bus Member on node 2 standing by to take over.
When I stopped the MEClusterMember1 on node 1, I see this from node 2: -
[21/11/14 13:51:53:684 GMT] 00000097 SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSID0016I: Messaging engine MECluster.000-BPM.ProcessServer.Bus is in state Started.
which again is as expected.
And, as a final acid test, when I restart MEClusterMember1, I see this: -
[21/11/14 13:55:33:043 GMT] 00000062 SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSID0016I: Messaging engine MECluster.000-BPM.ProcessServer.Bus is in state Joined.
and stop MEClusterMember2, I see this: -
[21/11/14 13:57:33:123 GMT] 0000008f SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSID0016I: Messaging engine MECluster.000-BPM.ProcessServer.Bus is in state Started.
both messaging coming from node 1.
This shows that, once I dropped and recreated the SIB tables, the bus comes up nicely, and failover works both ways - node 1 to node 2 and node 2 to node 1.
This ties up with the IBM BPM pattern, known as 1-of-n, where only one ME / Bus Member can be active at any one time, regardless of the number of nodes in the cell / members in the cluster.
Which is nice.
So what went wrong ? I do not know, but I know how to resolve it AND, more importantly, watch for problems.
Some background reading: -