According to my current understanding, there is a clear difference between data mining and mathematical modeling.

Data mining methods treat systems (e.g., financial markets) as a "black box". The focus is on the observed variables (e.g., stock prices). The methods do not try to explain the observed phenomena by proposing underlying mechanisms that cause the phenomena (i.e., what happens in the black box). Instead, the methods try to find some features, patterns, or regularities in the data in order to predict future behavior.

Mathematical modeling, in contrast, tries to propose a model for what happens inside the black box.

Which approach dominates in quantitative finance? Do people try to use more and more fancy data mining techniques or do people try to construct better and better mathematical models?

6 Answers
6

I would offer the distinctions are i) pure statistical approach, ii) equilibrium based approach, and iii) empirical approach.

The statistical approach includes data mining. Its techniques originate in statistics and machine learning. In its extreme there is no a priori theoretical structure imposed on asset returns. Factor structure might be identified thru Principal Components, for example. The goal here is to maximize predictive accuracy at the expense of intuition and explanatory power. This approach increasingly dominates at very short frequencies in modeling market microstructure, market making algorithms, volatility modeling, etc. However, even in high-frequency trading one can impose a factor model based on depth of order book, liquidity, factor characteristics (momentum, correlation with S&P), etc. Therefore my guess is that hybrid models (factor model + a statistical model to pick-up signal in residuals) are dominant in the HFT space.

The equilibrium approach is best characterized by CAPM or Fama-French models that originate in academic finance. Here you have a theory (such as Arbitrage Pricing Theory, consumption-based theories, Black-Scholes, etc.) that imposes structure to the returns you are modeling. Many well-regarded quant shops use extensions of an equilibrium model (adding momentum, liquidity, own-volatility, etc.) or other factors to generate expected returns. In the mid-late 90's many academics identified "anomalies" and left to start their own funds and these are the biggest players. Cliff Asness et al played a big role in developing this approach at Goldman Sachs Asset Management and later at Applied Quantitative Research (AQR). Lakonishok et al started LSV. Andrew Lo has been involved with Simplex. Also, there are funds that use BARRA or Axioma models to make explicit factor bets. Many of the funds in this camp may agree that factor structure exist in the market but disagree on their source. Some would argue that the premiums exist because of behavioral bias as opposed to compensation for systematic risk. Nonetheless I would group these sub-camps under this banner since their approach in estimating risk is very similar (although they disagree on how to interpret and whether one can exploit factor premia). If you measure quant funds by AUM I would argue that this school has the highest share. Indeed the so-called quant meltdown of Aug '07 supports this view since it implies that many firms were trading on the same factors (value and momentum in particular).

The third approach is the empirical approach. Members of this group apply a framework to analyze returns and usually hail from a statistical physics, computer science, or perhaps bioinformatics background. This is where you have a hypothesis (based on corporate finance theory, or by observation of market history) and you may test the hypothesis out-of-sample or in another market. I would place Capital Fund Management, Nassim Taleb's Empirica, and Vic Neiderholffer as exemplars of this category. I would argue that you can include fundamental analysts and managers such as Warren Buffet, Ken Fisher, and Peter Lynch in this category as well. They have a model for evaluating the returns of stocks (i.e. margin of safety, strong positioning, etc.) grounded in history although they do not formalize the model in the language of statistics.

UPDATE: Here is a link to "Challenges in Quantitative Equity Management" by Fabozzi. It does a great job discussing what quant methods are used, and covers market share and growth rates. This may answer your question at a granular level (see page 60 for example)

Things started in the late 1980s and through the 1990s with analytical approaches particularly to derivative pricing (as in "hey, let's create yet another exotic option we can sell to the buy side"). The risk modelling "fashion" of the 1990s (when regulated entities such as banks needed to beef up reporting) carried on with this, as did credit modelling as that market grew in leaps and bounds.

At the same time there have always been empirical approaches to arbitrage as e.g. with the PDT group at Morgan Stanley and other early adopters. And with increases in computing power, as well as increased data availability and electronic exchange access, there are fewer barriers to entry on this front and hence more players competing for what may be a fixed pool of profits.

As to your questions about which approach dominates: hard to say, I guess it ebbs and flows somewhat. "Classic" quant investing got whiplashed in 2007 and 2008, which lead to increased interest in higher frequency approaches. Which may now be saturated---time will tell. As for the analytical side: it appears over-harvested too with few recent advances.

I think both approaches don't answer question of profitability. The most algo systems are more sophisticated than this. I would extend your list to adaptive algorithms, stat models and knowing something that other overlook.

Is your question more about approaches taken on the buy side vs. sell side? If so, you may want to read Attilio Meucci's paper, P vs. Q, on this topic. He breaks down the dichotomy as derivatives pricing (the "Q" world), which uses a lot of very sophisticated modeling involving Ito calculus and PDEs, and portfolio management (the "P" world), which makes use of a lot of statistics and large scale estimation. Both sides, however, may do some sophisticated mathematical modeling, but only the buy side truly "mines" the data.

If your question is more about what is done on the buy side, it would depend on what you mean by "dominates." The pure data mining approaches tend not to work well at all beyond very short horizons. As @Quant-Guy has already mentioned, these approaches are most often found at market making and high frequency trading (HFT) firms. Many studies have already found that well over half the trading in U.S. equities (all trading, not just quant) is done by these firms, so there is no question that these firms dominate in terms of $ traded. If anything, the flash crash revealed that these firms are already too large. On the other hand, most of these kinds of strategies are extremely capacity constrained, whereas the modeling approaches tend to work well at longer horizons as well. Consequently, the modelers tend to dominate in terms of AUM. Like the flash crash for HFT firms, the Quant Quake of Auguest 2007 revealed that the modeling types were also too large. Now that each category of quant finance has had its own mini-crisis, it is anybody's guess which will dominate in the future (perhaps neither/both).

If you are seriously interested in a taxonomy of quant strategies, and particularly quant equity, you should check out Morgan Stanley's presentation on this topic. They break down all quant funds as follows (some firms may do more than one of these):

Equity

Equity Market Neutral

Technical

Event-Driven

Hedge Fund Replication

Futures & Forwards

CTAs

Short-Term Traders

Systematic Macro

HF Replication

Options

Volatility Arbitrage

Credit

Correlation, basis trading, long/short

Hybrid Asset Strategies

HF Replication

The MS presentation then goes in depth into comparisons of EMN and Technical Equities strategies.

I don't see the difference in the 'statistical' and 'empirical' approaches.

Statistical or Data Mining or Machine Learning approaches, which mostly are under the same umbrella, rely on inductive inference. On the other hand, the analytic approach, relies on some prior axioms which we assume to be true by definition, and beyond this step, the theory is constructed in a deductive manner - i.e. as the implications of the particular set of axioms about human behavior. Equilibrium approaches derive from economic theory, or financial economics. CAPM is linked to assumptions about utility, hetero/homogeneity of agents and their expectations. To what degree any of these assumptions are 'true' (or not), is something the testing of the theories has to answer (and to a certain extent what would be 'reasonable' to expect of human behavior).

The 'law of one price' is something which we can subscribe to more easily than say the alphas of a CAPM regression of stock returns on market returns, being jointly zero (in a statistical sense), the latter being a particular case of consumption covariance risk, and all consumption risk being proxied for by the market.

In this sense, the applicability of analytic approaches is limited to operating within the confines of the axioms which lie beneath the theory - or expecting violations of those axioms in particular cases (which perhaps over time will expand that theory to a newer one).

However, as more complex/realistic theories are devised, there is also the concern whether the theory itself was formed after peeking at the data - i.e. devising theories to explain persistent patterns or anomalies which an earlier theory could not 'explain' away. In this context, Fama-French's model is not a theory - it spotted an empirical regularity which was not explained by CAPM, but it is not a theory in the deductive sense. However, CAPM is a theory - albeit one which does not explain (far less predict) accurately all aspects of real markets.

The application of analytic approaches to derivatives pricing is readily understood because the law of no arbitrage is easier to swallow as a 'law' than say equilibrium arguments about the market. The 'purest' analytic approach is the one which tells me to buy oranges in market A at price 'p1' and sell them in market B at price 'p2', where p2 > p1. Short of this, we are always estimating some parameters even to use what is otherwise a deductive theory.

Also, I suspect that the statistical and analytic approaches are not always mutually exclusive. As an example, if we accept CAPM to hold more or less, the series constructed from the residuals could be then subjected to torture by machine learning methods to extract a signal, if any. In this case, the lines are blurred.

If we think of this as a Bayesian, then the 'analytic' or theoretical arguments help us form our prior, and the data we have at hand is 'blended' with our prior to arrive at the posterior.

At one extreme, 'pure' statistical/data mining approaches are left to function without any prior structure we impose on the market dynamics and solely on the detection of empirical regularities (though, even using a trader's inputs to a data mining approach would constitute becoming somewhat 'analytic' - as we expect the trader to have some sort of prior story to justify his inputs). At the other extreme, pure analytic example would be exploitation of LOP as in the oranges example - i.e. the inference holds by force of logic.

But the question of which approach is dominant - I suspect, as Dirk pointed, these things ebb and flood, and moreover, 'reasonable' assumptions giving some 'weak' structure to a 'data mining approach' is probably more profitable when we can posit some structure, instead of no structure at all.

As best I can tell, the primary difference between traditional approaches (be they classical or Bayesian) and the newer "predictive analytics" is that the traditional approaches make a few explicit (and testable) assumptions, and then give you quantified estimates of your potential errors. The newer methods exchange a chance at more sophisticated (and accurate) prediction for a "black box" where you can't know anything about the error bars in any given application of the method.