Storing Stock Candle Data Efficiently

When I acquire stock candles, I need to know the storage footprint, database architecture, and how to deal with multi-threaded concurrency of the databases.

How large could a single database grow?

Each candle has OHLCV data plus a unix timestamp in seconds (Note: I chose this timestamp structure to aid in cleaning the candles without the need for a slow Date interpreter. Statistical languages like R and trading platforms like NinjaTrader or MetaTrader natively use a human-readable date-time strings). The napkin math for SQLite then becomes:

Timestamp

Open

High

Low

Close

Volume

INT 4 bytes

REAL 8 bytes

REAL 8 bytes

REAL 8 bytes

REAL 8 bytes

INT 4 bytes

4B + (8B * 4) + 4B = 40 bytes/candle

When I built my first prototype database I encoded the candle time period in the table name (e.g. OneDay, OneMinute). This saved space until I tried to clean the candles. It turns out that the broker will send incomplete period candles. A start and end timestamp is needed.

A start and end timestamp is needed, so the updated structure looks like this:

Start

End

Open

High

Low

Close

Volume

INT 4 bytes

INT 4 bytes

REAL 8 bytes

REAL 8 bytes

REAL 8 bytes

REAL 8 bytes

INT 4 bytes

4B + 4B + (8B * 4) + 4B = 44 bytes/candle

Expected day-candle space per day?

I established earlier that there are about 48,753 quotable symbols, but let’s round that to 50,000 symbols. How much storage will one day of candles cost? Let’s assume there is at least one trade per day for each symbol and nothing is halted or delisted today.

50,000 day candles * 44 bytes = 2,148 KB ≈ 2.1 MB per day

Worst case minute-candle space per day?

It is more challenging to estimate the space required for minute candles because not all symbols move every minute. A thinly-traded warrant might only have a handful of trades in the 390 minutes of trading (6.5 hours * 60 minutes = 390 minutes) whereas FB may have over 600 candles or more due to after-hours trading.

When in the blink of an eye Facebook’s stock crashed on June 25, 2018 from an all-time high of US$218 instantly down to US$173.5, it traded after hours for a full 4 hours past end of trading. I was watching the Time and Sales. That’s an extra 240 minute-candles of data past “worst case”.

Facebook after-hours stock crash on July 25, 2018

Let’s calculate the worst case in-hours candle data space and then an average case:

The actual number fluctuates wildly day after day, but PINX stocks don’t move as much as AAPL, so for a hopeful estimate let’s use 84 Mb/day as our storage requirement (82 MB + 2 MB).

Some time goes by…

Measured candle space per day

It’s good to start with estimates to get ballpark estimates. Now let’s actually calculate the space growth and see how close I came to the optimistic estimate. Using a 10-trading-day period from Monday, July 9th, 2018 at midnight until Friday, July 20th, 2018 at 23:59 Toronto time, a helper program calculated this:

Storage cost per year?

Canada trades on some U.S. holidays, and the U.S. trades on some Canadian holidays, but let’s use the 252 trading days a year. The napkin math then becomes:

252 days * 291 MB/day = 73,332 MB/year ≈ 72 GB per year!

Assuming equal turnover of symbols (which is not the case2), we would need to be prepared to store about 72 GB of candle data per year over and above storing all the historic data we want.

Monolithic or distributed databases?

Let’s make an assumption that you have one year of historic data, or 72 GB of clean OHLCV candle data already. Where exactly does that data reside? A local MySQL database? Would SQLite work? How about Amazon AWS’s RDS solution?

It we are just accumulating candles each day for storage, then it requires 33 hours per week. If we want to accumulate and analyze the candles, then 88 hours per week are needed:

With a 72 GB general SSD and 293 MB/day inbound (free), the costs are:

DB Instance

33 hours/week

88 hours/week

db.t2.medium (2 CPUs, 2 GB)

US$15.64/mo

US$31.76/mo

db.t2.xlarge (4 CPUs, 16 GB)

US$109.07/mo

Say you have 2 years of historic data, or 144 GB:

DB Instance

33 hours/week

88 hours/week

db.t2.medium (2 CPUs, 2 GB)

US$23.92/mo

US$40.04/mo

db.t2.xlarge (4 CPUs, 16 GB)

US$117.35/mo

Say you have 5 years of historic data, or 360 GB:

DB Instance

33 hours/week

88 hours/week

db.t2.medium (2 CPUs, 2 GB)

US$71.76/mo

US$100.73/mo

db.t2.xlarge (4 CPUs, 16 GB)

US$165.19/mo

This doesn’t include the hardware running the TA algorithms or where your intermediary data (for indicators) resides. With this approach, you are most likely losing money on trades, and on database usage fees. Let’s reconsider using AWS or any cloud-hosted solution.

Distributed SQLite databases?

There are advantages with using SQLite databases over monolithic databases: speed, transportability, and scalability. In my setup, the combined candles DB for AAPL is under 10 MB. If I need to perform heavy analysis on tech alpha-dog Apple, I can load the entire candle DB into a SQLite in-memory store. If I need to see how that other NASDAQ securities move in relation and/or response to Apple, then the other 3300 or so of those securities can be loaded into distributed memory – it’s ballpark 30 GB of RAM to hold it all. Small, individual databases can be replicated across clusters on file systems such as GFS, GlusterFS, and Hadoop. This scales well on inexpensive hardware too. I just wouldn’t know where to begin with a monolithic 360 GB MySQL database entrenched in a standalone file system while keeping cost down.

Conclusion: Having a single 360 GB MySQL database doesn’t make sense. Just optimizing, backing that up, creating replicas, or getting into parallel processing with concurrency raises all kinds of challenges. It’s also expensive if hosted in a cloud environment. The only way the TA algorithms are going to be performant is if they run in parallel and on clusters. The way I’ve architected my databases is that each symbol has its own SQLite DB that has a table for minute candles and a table for day candles. In another post I will explain that we don’t need to scan each database repeatedly anyway, but we can store deltas locally and on clusters.

“One trend that needs to be highlighted is the number of listed of companies in the US: over the past 20-years from 1996 to 2016 they have been falling by half from approx 7,322 to 3,671” Source: WorldBank.org ↩