The Securities and Exchange Commission is now collecting 400 gigabytes of market data daily, including information on every trade across numerous exchanges, as part of what SEC chairman Elisse Walter on Tuesday called an "unprecedented" push to use data to better understand how the market works.

The big data effort, which centers around a system known as the Market Information Data Analytics System (Midas), is being used not only to understand market trends and rapidly emerging modes of trading such as high-frequency trading, but also, according to the SEC, to inform future policy making.

Midas, which is costing the SEC $2.5 million a year, captures data such as time, price, trade type and order number on every order posted on national stock exchanges, every cancellation and modification, and every trade execution, including some off-exchange trades. Combined it adds up to billions of daily records.

Although the system has been live for only a few months, the data in some cases goes back as far as four or five years. The uncompressed archive of Midas data would amount to about 1 petabyte, although the archive has been compressed to just more than 100 terabytes.

According to Walter and the top official overseeing Midas, Gregg Berman, who was recently named associate director of the Office of Analytics and Research in the SEC's Division of Trading and Markets, Midas has potentially wide application for the SEC.

"The downpour of data generated by the markets every hour will lead to better regulation and better investor protection," Walter said in a speech Tuesday, adding that Midas will "dramatically improve our understanding of the way today's markets function."

Much of the initial public information around Midas has centered around its ability to monitor and analyze high-frequency trading. "It will give us dramatically better insight into the function of a market that moves many millions of dollars in millionths of a second," Walter said in her speech. "It will be like the first time scientists used high-speed photography and strobe lighting to see how a hummingbird's wings actually move."

In more tangible terms, such information will, for example, help the SEC to rapidly analyze the causes of so-called flash crashes, in which the market drops significantly in a brief time period, and will facilitate the study of the need for and possible impact of potential regulations like requiring high-frequency traders to hold quotes for minimum time periods.

Much of the big data push stems from the May 6, 2010, "flash crash" in which the Dow Jones Industrial Average dropped about 600 points within five minutes before later recovering those losses. High-frequency trading would later shoulder much of the blame. The SEC investigation, led by Berman, took months and a lot of custom software development, but Midas could greatly accelerate such analysis.

"We realized we needed this data all the time, and not just in an emergency, to understand detailed patterns, monitor the markets, and inform us on how orders flow, interactions with trades, rates of cancellation, volatility, and other things that are really central to understanding market structure," Berman, a former hedge fund manager and Princeton-trained nuclear physics PhD, said in a recent interview with InformationWeek.

Despite the initial emphasis on high-frequency trading, however, Berman says that Midas has a "much wider" focus. Midas could prove invaluable in the SEC's efforts to ensure more data-informed policy-making across the board. "As you can imagine, we hear lots of diverse opinions," Berman said. "Sometimes you get a nice 20-page treatise on why X is the worst thing in the world, followed by a nice 20-page treatise on why X is the best thing in the world. Our own independent analysis can contribute to the debate directly," he said.

Midas won't be able to fill in all of the current holes in SEC's vision. For example, the SEC won't be able to see the identities of entities involved in trades and Midas doesn't look at, for example, futures trades and trades executed outside the system in what are known as "dark pools." However, even without this data, Midas might serve as the necessary foundation for doing much wider analyses in the future.

Midas is a hosted system, so the SEC can focus on data analysis, not on keeping the lights on. The system is hosted with SEC's vendor, Tradeworx, which is both a high-frequency trading technology vendor and a trading firm itself. Tradeworx uses the cloud -- Amazon Web Services -- to help power Midas. Services used include Amazon S3 storage, Amazon EC2 infrastructure-as-a-service, and Amazon Elastic Block Storage. "The larger your data sets become, the more difficult it becomes to analyze," said Tradeworx CEO Manoj Narang. "This is why Amazon makes an enormous amount of sense. You can fire up 100 servers arbitrarily to analyze data in parallel."

Midas' underlying analytics platform was homegrown by Tradeworx and engineered for fast data transfer and rapid data analytics. "There are a lot of optimized calculations, a lot of work on parallelization of calculations," said Narang.

To the data analyst, Midas isn't fronted by a fancy Web app. Rather, it's accessed through a Unix shell, and the interface requires knowledge of scripting. Berman rattles off the tools of the trade: C++, Perl, Python. Tradeworx has also thrown in its own API and a series of commands to help the SEC manipulate data. "It's sort of like Legos," Berman says. "If you know how to work with Legos, you can build amazing things."

Although Midas isn't graphics heavy, there are rich visualizations to help analysts dig deeper into the data, and Narang said that future updates to the system will focus on rolling out more visual tools so that users don't have to understand Unix to use Midas.

The capability in Midas isn't necessarily novel. Other companies such as Nanex have said that they can offer similar functionality. However, it's the first time that the SEC has had such power at its fingertips, which means that the agency needs to make sure it has the right people in place to make use of the data. Now that the system is live, Berman is on the lookout for that talent. "We brought a few people in, but we need C++ programmers, algorithmic high-frequency trading people, people who worked on trading desks working on models," he said. "We're looking for people with a level of technical and market expertise that's somewhat unprecedented."

Attend Interop Las Vegas, May 6-10, and attend the most thorough training on Apple Deployment at the NEW Mac & iOS IT Conference. Use Priority Code DIPR02 by March 2 to save up to $500 off the price of Conference Passes. Join us in Las Vegas for access to 125+ workshops and conference classes, 350+ exhibiting companies, and the latest technology. Register for Interop today!