Data Warehousing using EVE Market Data Relay (EMDR)

Knowing the price and volume of items moving in the Eve market at a specific point in time is a very powerful piece of information that can help further your space empire.

Historically eve-central was the repository of market data but with new advances in cache scraping, data transport methods, and large archival storage methods have brought us exciting new capabilities.

Original implementations of market aggregation sites had a user click the Market Export button in the Eve client to produce text files in their local Documents\EVE\logs\Marketlogs folder and then use an application to transfer the exported files to a database. This method was tedious, lacked region and item coverage, and was prone to people editing the data before it was sent off.

A renascence was generated when developers began to explore the cache in the Eve client. The client cache is a location that serves as temporary storage for information that you are working with in the client; it is volatile and changes all the time rather than holding static art, formulas, or universe data as seen in the client .stuff files.

EMDR is a service that takes market orders from the cache of your local client and sends that immediately upon viewing it in the client to a relay service that people can subscribe to. If you click on an item in the market window, that piece of data is immediately sent off to many people that can receive it. The transfer is quick, the data is not tampered with, and it can easily be relayed to many interested parties. This is pure data, a statisticians dream!

Working with Eve-Central

What follows are notes from my partner, Raath, who has been withing with the EMDR method to improve our industrial operation.

When I first began adding price dependencies to the DRK Industry Tracker, it was back in the day before we had the EMDR. Eve-central was the bespoke out of game price lookup service and with the aid of their API I pulled prices from there.

At the time I don’t think it had the option to do a full pull of prices on every item in game so I devised a system where the tracker would cache prices locally and update them when they were older than an hour on a need to know mechanic so only the relevant data was requested.

I did this so that I didn’t swamp eve-central with hundreds of requests every hour for information that 95% of which would never be used. It was a system that worked well except for the occasional lag caused by the prices updating. As I said before, it only updated when prices were older than an hour as the users requested them. But when I started to think of releasing the tracker publicly, I needed something a little more reliable.

Transition to EMDR

At this point in time the EMDR was a fully developed solution so I started to look into how I could begin using this as my price basis and integrating data into our industrial workflow. Not really knowing the volume of information that would soon be assaulting our little virtual machine, I make a few mistakes.

Flood of Data

The term for the EMDR feed that people have adopted is the “fire hose” as it is literally a flood of overwhelming data being constantly sent at you with no regard for your ability to process it. As clients all across the world are clicking around the market window, updates are being sent to you in near real-time. The function name of “engage_fire_hose” only seemed natural.

Statistics

Our Industry site currently consumes 8 gig of data, serves 6-700 MB and calls around 150,000 API requests per day with the vast majority of our incoming data coming from the EMDR updates. In January we received 179 GB and sent 17 GB for an average of 5.8 GB/day of incoming data.

Processing Challenges

With the large amount of incoming price data, we needed to be able to efficiently process and store it while keeping the server responsive.

The first method I devised was a system where I stored transactionIDs in a hash table. This design soon showed its weakness as we hit a memory limit when the hash table started to fill up with millions of transactionIDs.

Additionally my attempts to keep load on MySQL server down were also in vain as the information was coming in so quickly that my consumer was having trouble keeping up. We soon had problems with locked rows and inserts failing so I had to completely revise my entire approach.

What resulted was a consumer process that guzzles around 8 GB of data per day and spits all the information to CSV files. There is no logical processing done in the consumer now, it just munches and spits data to file. Another cron job that runs every minute scans the temp file directory for CSVs, parses them all into one single large CSV, and then performs a LOAD DATA IN FILE into MySQL, a method which I’ve found is not only lightening fast but also keeps the load down.

Not all of the data is useful as the majority of it is duplicates. When the server has free time, we run a cleanup process in the form of flag checking remove expired records so that the data we display is as up to date.

Future Plans

All this work for clean price data. We have plans to further expand some of the market data functionality in projects to start using the cached market history to show the historical price or items and build costs compared to live data.

What kind of aggregations are you using for this dataset? You’re probably interested in min/max/25/50/75th percentile buy/sell by item by region by day? Except you probably aren’t actually interested in by region and might be interested in which hours see a spike in traffic?