Process large amounts of Elasticsearch data using TIBCO ActiveMatrix BusinessWorks 5

Sometimes you want to create a longitudinal study of patterns in your Elasticsearch data and you want to analyze the entire event stream matching your criteria. The scroll API provides a mechanism for asking Elasticsearch for every last entry matching a query and then to get the results back in chunks which sequentially represent the entire set of matching records.


The following is an excerpt from the Elastic webpage that explains the API:

While a search request returns a single “page” of results, the Elasticsearch scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database.

On my GitHub page you can find my TIBCO ActiveMatrix BusinessWorks 5 ProcessDefinition (see bottom of this post for a image of the definition) including the neccessary XSDs. Mind you that I’ve used the REST and JSON plugin to make my life easier working with JSON in BW. In my process, I’m simply logging the results of the scrolled searches to a file (using the standard “Write To Log” activity).

You would want to have BusinessWorks to pipe the data to a stream processing application, built in something like Kafka Streams or Apache Spark. As a basis, I used patterns that I found in client helpers based on Python and Java programming languages.

If you’re solely interested in reindexing documents from one index to another, take a look at elasticdump at hub.docker.com. If you are new to Docker, this article from Rubén Middeljans will help you getting started.
BusinessWorks ProcessDefinition:
Credits of the coding girl photo: http://www.pstune.com/