Java EE 7 Batch Processing and World of Warcraft – Part 1

This was one of my sessions at the last JavaOne. This post is going to expand the subject and look into a real application using the Batch JSR-352 API. This application integrates with the MMORPG World of Warcraft.

Since the JSR-352 is a new specification in the Java EE world, I think that many people don’t know how to use it properly. It may also be a challenge to identify the use cases to which this specification apply. Hopefully this example can help you understand better the use cases.

Abstract

World of Warcraft is a game played by more than 8 million players worldwide. The service is offered by region: United States (US), Europe (EU), China and Korea. Each region has a set of servers called Realm that you use to connect to be able to play the game. For this example, we are only looking into the US and EU regions.

One of the most interesting features about the game is that allows you to buy and sell in-game goods called Items, using an Auction House. Each Realm has two Auction House’s. On average each Realm trades around 70.000Items. Let’s crunch some numbers:

512 Realm’s (US and EU)

70 KItem’s per Realm

More than 35 MItem’s overall

The Data

Another cool thing about World of Warcraft is that the developers provide a REST API to access most of the in-game information, including the Auction House’s data. Check here the complete API.

The Auction House’s data is obtained in two steps. First we need to query the correspondent Auction HouseRealm REST endpoint to get a reference to a JSON file. Next we need to access this URL and download the file with all the Auction HouseItem’s information. Here is an example:

The Application

Our objective here is to build an application that downloads the Auction House’s, process it and extract metrics. These metrics are going to build a history of the Items price evolution through time. Who knows? Maybe with this information we can predict price fluctuation and buy or sell Items at the best times.

The Setup

Jobs

The main work it’s going to be performed by Batch JSR-352 Jobs. A Job is an entity that encapsulates an entire batch process. A Job will be wired together via a Job Specification Language. With JSR-352, a Job is simply a container for the steps. It combines multiple steps that belong logically together in a flow.

The Code

Back-end – Java EE 7 with Java 8

Most of the code is going to be in the back-end. We need Batch JSR-352, but we are also going to use a lot of other technologies from Java EE: like JPA, JAX-RS, CDI and JSON-P.

Since the Prepare Job is only to initialize application resources for the processing, I’m skipping it and dive into the most interesting parts.

Files Job

The Files Job is an implementation of AbstractBatchlet. A Batchlet is the simplest processing style available in the Batch specification. It’s a task oriented step where the task is invoked once, executes, and returns an exit status. This type is most useful for performing a variety of tasks that are not item-oriented, such as executing a command or doing file transfer. In this case, our Batchlet is going to iterate on every Realm make a REST request to each one and retrieve an URL with the file containing the data that we want to process. Here is the code:

A cool thing about this is the use of Java 8. With parallelStream() invoking multiple REST request at once is easy as pie! You can really notice the difference. If you want to try it out, just run the sample and replace parallelStream() with stream() and check it out. On my machine, using parallelStream() makes the task execute around 5 or 6 times faster.

Update
Usually, I would not use this approach. I’ve done it, because part of the logic involves invoking slow REST requests and parallelStreams really shine here. Doing this using batch partitions is possible, but hard to implement. We also need to pool the servers for new data every time, so it’s not terrible if we skip a file or two. Keep in mind that if you don’t want to miss a single record a Chunk processing style is more suitable. Thank you to Simon Martinelli for bringing this to my attention.

Since the Realms of US and EU require different REST endpoints to invoke, these are perfect to partitioned. Partitioning means that the task is going to run into multiple threads. One thread per partition. In this case we have two partitions.

To complete the job definition we need to provide a JoB XML file. This needs to be placed in the META-INF/batch-jobs directory. Here is the files-job.xml for this job:

In the files-job.xml we need to define our Batchlet in batchlet element. For the partitions just define the partition element and assign different properties to each plan. These properties can then be used to late bind the value into the LoadAuctionFilesBatchlet with the expressions #{partitionPlan['region']} and #{partitionPlan['target']}. This is a very simple expression binding mechanism and only works for simple properties and Strings.

Process Job

Now we want to process the Realm Auction Data file. Using the information from the previous job, we can now download the file and do something with the data. The JSON file has the following structure:

item-auctions-sample.json

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

{

"realm":{

"name":"Grim Batol",

"slug":"grim-batol"

},

"alliance":{

"auctions":[

{

"auc":279573567,// Auction Id

"item":22792,// Item for sale Id

"owner":"Miljanko",// Seller Name

"ownerRealm":"GrimBatol",// Realm

"bid":3800000,// Bid Value

"buyout":4000000,// Buyout Value

"quantity":20,// Numbers of items in the Auction

"timeLeft":"LONG",// Time left for the Auction

"rand":0,

"seed":1069994368

},

{

"auc":278907544,

"item":40195,

"owner":"Mongobank",

"ownerRealm":"GrimBatol",

"bid":38000,

"buyout":40000,

"quantity":1,

"timeLeft":"VERY_LONG",

"rand":0,

"seed":1978036736

}

]

},

"horde":{

"auctions":[

{

"auc":278268046,

"item":4306,

"owner":"Thuglifer",

"ownerRealm":"GrimBatol",

"bid":570000,

"buyout":600000,

"quantity":20,

"timeLeft":"VERY_LONG",

"rand":0,

"seed":1757531904

},

{

"auc":278698948,

"item":4340,

"owner":"Celticpala",

"ownerRealm":"Aggra(Português)",

"bid":1000000,

"buyout":1000000,

"quantity":10,

"timeLeft":"LONG",

"rand":0,

"seed":0

}

]

}

}

The file has a list of the Auction’s from the Realm it was downloaded from. In each record we can check the item for sale, prices, seller and time left until the end of the auction. Auction’s are algo aggregated by Auction House type: Alliance and Horde.

For the process-job we want to read the JSON file, transform the data and save it to a database. This can be achieved by Chunk Processing. A Chunk is an ETL (Extract – Transform – Load) style of processing which is suitable for handling large amounts of data. A Chunk reads the data one item at a time, and creates chunks that will be written out, within a transaction. One item is read in from an ItemReader, handed to an ItemProcessor, and aggregated. Once the number of items read equals the commit interval, the entire chunk is written out via the ItemWriter, and then the transaction is committed.

ItemReader

The real files are so big that they cannot be loaded entirely into memory or you may end up running out of it. Instead we use JSON-P API to parse the data in a streaming way.

To open a JSON Parse stream we need Json.createParser and pass a reference of an inputstream. To read elements we just need to call the hasNext() and next() methods. This returns a JsonParser.Event that allows us to check the position of the parser in the stream. Elements are read and returned in the readItem() method from the Batch API ItemReader. When no more elements are available to read, return null to finish the processing. Note that we also implements the method open and close from ItemReader. These are used to initialize and clean up resources. They only execute once.

ItemProcessor

The ItemProcessor is optional. It’s used to transform the data that was read. In this case we need to add additional information to the Auction.

The entire process with a file of 70 k record takes around 20 seconds on my machine. I did notice something very interesting. Before this code, I was using an injected EJB that called a method with the persist operation. This was taking 30 seconds in total, so injecting the EntityManager and performing the persist directly saved me a third of the processing time. I can only speculate that the delay is due to an increase of the stack call, with EJB interceptors in the middle. This was happening in Wildfly. I will investigate this further.

To define the chunk we need to add it to a process-job.xml file:

process-job.xml

XHTML

1

2

3

4

5

6

7

<step id="processFile"next="moveFileToProcessed">

<chunk item-count="100">

<reader ref="auctionDataItemReader"/>

<processor ref="auctionDataItemProcessor"/>

<writer ref="auctionDataItemWriter"/>

</chunk>

</step>

In the item-count property we define how many elements fit into each chunk of processing. This means that for every 100 the transaction is committed. This is useful to keep the transaction size low and to checkpoint the data. If we need to stop and then restart the operation we can do it without having to process every item again. We have to code that logic ourselves. This is not included in the sample, but I will do it in the future.

Running

To run a job we need to get a reference to a JobOperator. The JobOperator provides an interface to manage all aspects of job processing, including operational commands, such as start, restart, and stop, as well as job repository related commands, such as retrieval of job and step executions.

To run the previous files-job.xml Job we execute:

Execute Job

Java

1

2

JobOperator jobOperator=BatchRuntime.getJobOperator();

jobOperator.start("files-job",newProperties());

Note that we use the name of job xml file without the extension into the JobOperator.

Next Steps

We still need to aggregate the data to extract metrics and display it into a web page. This post is already long, so I will describe the following steps in a future post. Anyway, the code for that part is already in the Github repo. Check the Resources section.

Resources

You can clone a full working copy from my github repository and deploy it to Wildfly. You can find instructions there to deploy it.

Indeed, that was my first implementation. I ended up with a different approach for a few reasons:
– The REST requests are very slow. To parallelize the requests in a Chunk I’ve could partition by zone (only 2 partitions). To be able to do it faster, I would have to partition by Id or some other key. I just felt that it was hard to find a partition plan that achieved the same performance of parallelStream, but I didn’t test it, so this is just perception.
– Currently the code skips the creation of the file record if an error occurs. So the job does not fail. Is not terrible, since you need to pool the servers for new data every time.

Maybe I’m bending a bit the concepts behind the batch, but I see no problem in using parallelStream in this case.

Actually, I’m not doing it that way. I think it’s overcomplicating the problem. I’m just transforming the JSON data, into a regular database table, and extract the metrics with database functions: SUM, COUNT, AVG and so on.

I love this article, which gives concrete insights into Java Batch. I’m very interested in volumetry. As noted by Roberto, Java Batch is young. I’d like to know the behavior of Java Batch app. facing up huge data sets. How load balancing is managed, etc.
So, anyway, thanks for this very accessible discussion.

At the moment there is not load balancing available. It’s not part of the specification, but maybe some vendors can provide some support for it in the future. As far as I know, none of the vendors have that kind of support yet. I guess the best bet is to do this manually, using some kind of JMS cluster and sending the messages to start the jobs across the cluster. Of course, you would need to handle all the logic to check if the job is still running and so on, but it shouldn’t be hard.

I have tried to read data from one table of a database, process it and save the process data in another table. The problem is that the reading process becomes an infinite loop, althought the database have only 100 records. This is the code that i have been using:

If you want to read items from a database, I would advice to use plain JDBC and a Cursor. With JPA you really can’t control the batch items that you are processing in the chunk. Can’t see your job.xml, but let’s say you have an item count of 100 and there are 1000 record in the database. It seems that your query will grab all of the records regardless of the item count. Check my ProcessedAuctionsReader in the Wow Auctions project.

Just by having a quick look into your code, nothing seems to indicate an infinite loop. Do you have the code available somewhere so I can run it?