Monday, March 28, 2016

5.000.000 Messages per Day Handled by Oracle Service Bus 11g

Hello everyone! Let me share a story about how to build an enterprise service bus (ESB), which connects together a number of legacy application and a new-born centralized information system, and how Oracle Service Bus (OSB) can help handle up to 5.000.000 messages in a day ensuring guaranteed delivery.



Challenge


Every data change from a one application (a source) must be synchronized with another application (a destination) in the near real-time mode, batch processing is not a solution. So, the changes need to be captured and free to publish on a service bus. The following pattern works fine: when a user changes data in the source application, a trigger in an application database creates a record in an event table. Let's name a record in the event table as an "event". An event is specified by an id; a changed entity type; an operation, such as to create, update or delete; a state, such as new, in a process, successful processed, unsuccessful processed. Actual data can be retrieved from the database using an id and a changed entity type because there is a database view for each type of entities, the bus has a concern in. The OSB instance polls the events using the database adapter and selects a corresponding record from a database view. The content of the retrieved record is transformed to the canonical data model of the service bus, from zero till five messages are generated for one database record. The messages are put in a JMS-queue, and the corresponding records in the database marked with the "in process" flag.

In order to send messages to a destination application, there is the OSB proxy-service gets messages from the JMS-queue, transforms it from the canonical model to a destination format, and calls a web-service provided by the destination application. The proxy-service waits for a service response, gets a status of handling from this one, and put the status into a statuses queue. The bus gets the status from the queue and updates a corresponding record in the event table of the source application database.

The process is shown on the bellow UML diagram:


From one point of view, the events are read from databases by 120 records per query, so for 5.000.000 events, it is about 41.600 transactions per day. From another point of view, one retrieved from a database record generates up to 5 messages, let me get 2 in average. Each message is put into a destination application, so there are 10.000.000 transactions. A processing status for every message must be set after the completion, it brings into account another 5.000.000 transactions. Total score is about 180 transactions per second. Each transaction is an 2PC (XA) one because two resources (database - queue or queue - queue) are involved.


Put me in production!


Just after the bus had put in production, nothing worked fine: the bus handled about 2-3 events per second, it is about 200.000 events per day. Looked like a significant performance leak. Other negative symptoms were overloaded database connection pools and XA-transaction timeouts (each data retrieval transaction took more than three minutes).

As was shown by some analysis, the problem was in the integration layer of a source application: there were a few million records in the event table, which were generated before connection the source application to the ESB, and there was slow performance SQL with some source database views.

For the first problem solving, another event state was taken into account: ready-to-work. The events are transmitted from the new state to the ready-to-work one by small bunches, just about 50.000 events. The service bus retrieves ready-to-work events only.

SQL tuning is a solution for the second problem. The tuning is a specific engineering discipline and there are some interesting tips and approaches. It is another story, let me note only, the average time is taken for retrieving an event became about 300 - 400 msec after optimisation.

Tuning Oracle Service Bus


Multiple activation instances

At the start, the solution run on a physical server and the domain consisted of two WebLogic instances. There were two retrievals from a database, corresponding. In the case when reading a record from the database takes about 300 ms, plus polling itself takes some time, plus XML transformation, plus some 2PC commit management and JMS overhead, well, 120 records will be handled for 40-60 seconds. If we skip an amount of time for processing events through the bus and a destination application, the bus throughput will be about 2 * 120 / 60 - 4 messages per second or 340.000 messages per day. So sad :-(

Parallel message processing is a solution. There is the activationInstances parameter for Oracle Service Bus DB Adapter. The parameter specifies the number of adapter instances. The parameter is set as a property of the proxy-server built by the DB Adapter. The experience has shown that eight is the optimal value. The throughput became about 8 * 2 * 120/60 - 32 messages per second or 2.700.000 messages per day.


Leveraging RouterRuntimeCache

By default, OSB will not compile a pipeline until a request message for a given service is received. Once it has been compiled, the pipeline is cached in memory for re-use. There are two parts of the cache: static and dynamic. The static part is not available for the garbage collector, while the dynamic one is. The size of the static cache is specified by the RouterRuntimeCache parameter, the default size limit of the parameter is 100 entries (or pipelines). The developed bus contains about 50 adapters, so the default value isn't enough. The situation can be changed by adding the following JVM parameter:

-Dcom.bea.wli.sb.pipeline.RouterRuntimeCache.size=3000

The above line should be written as a part of the EXTRA_JAVA_PROPERTIES environment variable definition in the setDomainEnv.sh/cmd file.

Tuning JVM

When all available tunings were applied, the garbage collector monitor showed only a thin working memory range, which contained about a few hundreds megabytes. Once the heap had become full, garbage collector used to reclaim the 800 - 1000 MB and the heap gradually filled again. For decreasing garbage collector rate, the heap amount was increased, and the parallel old generation collector was turn on. The following JVM tuning flags were activated:


-Xms16384m -Xmx16384m -Xmn6144m -XX:+AggressiveOpts -Xnoclassgc -XX:ReservedCodeCacheSize=128m -XX:+ParallelRefProcEnabled -XX:+DisableExplicitGC -XX:+UseParallelOldGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:servers/osb_server1/logs/gc.log


As a result, a full garbage collector cycle takes 1.5 - 2 seconds and the execution interval is about 6 - 8 hours.

Loading data to a destination


Data is loaded into a destination application after getting it from a JMS-queue. There are a number of threads for reading the data from the queue. The number is specified as a configuration parameter of the corresponding Work Manager. There is its own Work Manager for each proxy-service, which gets messages from the queue.


The work manager was configured for 24 threads, so every WebLogic instance loads the data to the destination application using 24 threads, or using 48 threads from the overall domain point of view. The dedicated integration layer of the destination application was organized. The layer is deployed on 12 WebLogic instances, so many instances are used because the application is able to run only on 32-bits JVMs, and the heap size is limited by two GBs.

Once the ESB had been started, it turned out that the destination database overloaded a storage subsystem. The solution was to create some necessary indexes in the destination application database.

Scalability


After successful tuning, the service bus achieved the throughput goal. But in some time, the storage subsystem became looked as a bottleneck, so the disk usage was at 90%, it led a hot situation. The second physical server was added to the bus cluster. As a result, an OSB domain, which consisted of four WebLogic instances for JMS servers and four instances of OSB and an Administration Server, was created. Each pair of the JMS servers as well as of the OSB ones were put on a dedicated physical server. A Node Manager is in use for each machine.


Every OSB instance was changed in the following way: only four activationInstances for retrieving data from a database left there (so, 16 threads handle data retrieving from a source database in total), as well as only 16 thread for loading data into the destination system configured on an instance (64 threads are in total).

Hardware


The bus was deployed on two physical servers, each one has the following configuration:

- Intel(R) Xeon(R) E5649 2.53GHz CPU, 12 physical cores or 24 if Hyper-Threading is turned on.
- 128 GB RAM.

Conclusions


What do we conclude from the story? Firstly, the tuning and the scalability of an enterprise application integration solution is an iterative process: changes applied to one component can show a problem in another component. When the second problem has been solved, a new problem heaves in sight, and so on, until throughput goal has been achieved. Secondly, a bottleneck usually is in applications connected to the bus and, only after every performance gap is eliminated, the bus itself could be tuned. From another point of view, while there are performance gaps in the bus, the problems related to connected applications are hidden, this consideration is a reason why the tuning is an iterative process. Thirdly, a proprietary service bus proves itself as a high-scalable solution, the throughput was increased in 2x just by adding a second physical server.

Please, let me know if the story helps you to create an integration solution. Your questions are welcome here.

Do you design or develop an EAI solution? It can be interesting for you:


Would you like to give a 'Like'? Please follow me on Twitter!