Learn more about using open source R for big data analysis, predictive modeling, data science and more from the staff of Revolution Analytics.

finance

January 19, 2015

Bruno Rodrigues teaches a class on applied econometrics at the University of Strasbourg, with a focus on implementing econometric concepts in the R language. Since many of the students don't have any previous programming background, he's put together a tutorial on the basics of applied econometrics with R. The first two chapters serve as a general-purpose beginners' introduction to R, while chapter 3 explores basic applied econometrics with R (primarily data summaries and linear models). A fourth chapter to come promises a focus on reproducible research, so check back for updates to this free document at the link below. (And if you're looking for a more advanced tutorial on econometrics with R, check out Econometrics in R by Grant Farnsworth.)

December 09, 2014

About 22 months ago I had the privilege of introducing Quandl to the world on this blog. At that time Quandl had about 2 million datasets and a few hundred users. (And we thought that was fabulous.) Now, at the end of 2014, we have some 12 million datasets on the site and tens of thousands of registered users. On most days we serve about 1 million API requests.

One thing that has not changed however, is the simplicity with which R users can access Quandl. Joseph’s post last year, and Ilya’s this year both demonstrated the ease of connecting to Quandl via R.

Adoption of Quandl in the R community was perhaps the biggest factor in our early success. Thus it is fitting that I am back guest-blogging here at this moment in time because we are actually at the dawn of a new chapter at Quandl: We’re adding commercial data to the site. We are going to make hundreds of commercial databases from domain experts like Zacks, ORATS, OptionWorks, Corre Group, MP Maritime, DelphX, Benzinga and many others available via the same simple API.

What makes this new foray interesting is that we won't be playing by the rules that the incumbent oligarchy of data distributors have established. Their decades-old model has not served consumers well: it keeps data prices artificially high, it cripples innovation, and it is antithetical to modern patterns of data consumption and usage. In fact, the business models around commercial data predate the internet itself. They can and should be disrupted. So we're going to give that a go.

Our plan is nothing less than democratizing supply and demand of commercial data. Anyone will be able to buy data on Quandl. There will be no compulsory bundling, forcing you to pay for extra services you don’t need; no lock-in to expensive long-term contracts; no opaque pricing; no usage monitoring or consumption limits; no artificial scarcity or degradation. Users will be able to buy just the datasets they need, a la carte, as and when they need them. They will get their data delivered precisely the way they want, with generous free previews, minimal usage restrictions and all the advantages of the Quandl platform. And of course, the data itself will be of the highest quality; professional grade data manufactured by the best curators in the world.

We will also democratize the supply of data. Anyone, from existing data vendors and primary data producers to individuals and entrepreneurs, will have equal access to the Quandl platform and the unmet demand of the Quandl user base. We want to create a situation where anyone capable of curating and maintaining a database can monetize their work. In time, we hope that competition among vendors will force prices to their economic minimum. This is the best possible way to deliver the lowest possible prices to our users.

At the same time this democratization should empower capable curators to realize the full value of their skills: If someone can build and maintain a database that commands $25 a month from 1000 people, then Quandl can be the vehicle that transforms that person from skilled analyst to successful data vendor.

If you were to characterize what we are doing as a marketplace for data you would be absolutely correct. We are convinced that fair and open competition will do great things, both for data consumers who are, frankly, being gouged, and for existing and aspirational data vendors who are disempowered. Open and fair competition is a panacea for both ills: it effects lower prices, wider distribution, better data quality, better documentation and better customer service.

Our foray into commercial data has already started with 6 pilot vendors. They range from entrepreneurially-minded analysts who are building databases to rival what the incumbents currently sell for exorbitant fees, to long-established data vendors progressive enough to embrace Quandl’s modern paradigm. We have no less than 25 vendors coming online in Q1 2015.

So, Quandl in 2015 should very quickly become everything an analyst needs: A free and unlimited API, dozens of package connections including to R, 12 million (and growing) free and open datasets, and access to commercial data from the best companies in the world at ever decreasing prices. Wish us luck!

Public pension funds invest contributions from governments and public sector workers in an effort to ensure that they can pay all promised benefits when due. State and local government pension funds in the United States currently have more than $3 trillion invested, more than $2 trillion of which is in equity-like investments. For example, NYC, has over $158 billion invested. Governments usually act as a backstop: if pension fund investment returns do better than expected, governments will be able to contribute less, but if investment returns fall short they will have to contribute more. When that happens, politicians must raise taxes or cut spending programs. These risks often are not well understood or widely discussed. (For a discussion of many of the most significant issues, see Strengthening the Security of Public Sector Defined Benefit Plans.)

We are building stochastic simulation models in R to help quantify the investment risks and their potential consequences. We are modeling the finances of specific pension plans, taking into account all of the main flows such as current and expected benefit payouts to workers, contributions from governments and from workers, and investment returns, and how they affect liabilities and investible assets. The models will take into account the changing demographics of the workforce and retiree populations. We are modeling investment returns stochastically, examining different return scenarios and different economic environments, as well as different governmental contribution policies. We will use these models to evaluate the risks currently being taken and to help provide policy advice to governments, pension funds, and others. (For a full description of our approach, see Modeling and Disclosing Public Pension Fund Risk, and Consequences for Pension Funding Security)

We have chosen R because:

It is extremely flexible, allowing us to do data collection, data management, exploratory data analysis, and other essential non-modeling tasks.

Manipulating matrices is easy.

It has sophisticated tools for modeling investment returns and for analyzing and presenting results of simulations. And it has great tools for visualizing results.

The work can be completely open and reproducible, which is essential to the success of this project.

All programming languages have weaknesses. R’s great flexibility means that it is easy to write ill-organized programs that are hard to understand and debug. And poorly written programs that do not take advantage of R’s strengths can be extremely slow. We believe we can compensate for these weaknesses by making our programs modular, using a consistent programming style with appropriate documentation, and by using R features smartly and speed-testing where appropriate.

R analysts and programmers interested in learning about the opportunity to work on this project should examine the programmer/analyst position description and related materials at the Rockefeller Institute’s web site.

August 12, 2014

Correlations between holdings in a portfolio are of course a key component in financial risk management. Borrowing a tool common in fields such as bioinformatics and genetics, we will look at how to use heat maps in R for visualizing correlations among financial returns, and examine behavior in both a stable and down market.

While base R contains its own heatmap(.) function, the reader will likely find the heatmap.2(.) function in the R package gplots to be a bit more user friendly. A very nicely written companion article entitled A short tutorial for decent heat maps in R (Sebastian Raschka, 2013), which covers more details and features, is available on the web; we will also refer to it in the discussion below.

We will present the topic in the form of an example.

Sample Data

As in previous articles, we will make use of R packages Quandl and xts to acquire and manage our market data. Here, in a simple example, we will use returns from the following global equity indices over the period 1998-01-05 to the present, and then examine correlations between them:

As noted above, heatmap.2(.) is the function in the gplots package that we will use. For convenience, we’ll wrap this function inside our own generate_heat_map(.) function, as we will call this parameterization several times to compare market conditions.

As for the parameterization, the comments should be self-explanatory, but we’re keeping things simple by eliminating the dendogram, and leaving out the trace lines inside the heat map and density plot inside the color legend. Note also the setting Rowv = FALSE, this ensures the ordering of the rows and columns remains consistent from plot to plot. We’re also just using the default color settings; for customized colors, see the Raschka tutorial linked above.

As expected, we trivially have correlations of 100% down the main diagonal. Note that, as shown in the color key, the darker the color, the lower the correlation. By design, using the parameters of the heatmap.2(.) function, we set the title with the main = title parameter setting, and the correlations shown in black by using the notecol="black" setting.

Next, let’s look at a period of relative calm in the markets, namely the year 2004:

Note that in this case, at a glance of the darker colors in each of the cells, we can see that we have even lower correlations than those from our entire data set. This may of course be verified by comparing the numerical values.

Finally, let’s look at the opposite extreme, during the upheaval of the financial crisis in 2008-2009:

generate_heat_map(corr3,"Correlations of World Market Returns, Oct 2008 - May 2009")

This yields the following heat map:

Note that in this case, again just at first glance, we can tell the correlations have increased compared to 2004, by the colors changing from dark to light nearly across the board. While there are some correlations that do not increase all that much, such as the SP500/Nikkei and the Russell 2000/Kospi values, there are others across international and capitalization categories that jump quite significantly, such as the SP500/Hang Seng correlation going from about 21% to 41%, and that of the Russell 2000/DAX moving from 43% to over 57%. So, in other words, portfolio diversification can take a hit in down markets.

Conclusion

In this example, we only looked at seven market indices, but for a closer look at how correlations were affected during 2008-09 -- and how heat maps among a greater number of market sectors compared -- this article, entitledDiversification is Broken, is a recommended and interesting read.

July 24, 2014

If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, R Marries NetLogo: Introduction to the RNetLogo Package in the Journal of Statistical Software, academic work with ABMs didn’t really take off until the late 1990s.

Now, people are using ABMs for serious studies in economics, sociology, ecology, socio-psychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in Nature for using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a 2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)

Fortunately, the tools for building ABMs seem to be keeping pace with the ambition of the modelers. There are now dozens of platforms for building ABMs, and it is somewhat surprising that NetLogo, a tool with some whimsical terminology (e.g. agents are called turtles) that was designed for teaching children, has apparently become a defacto standard. NetLogo is Java based, has an intuitive GUI, ships with dozens of useful sample models, is easy to program, and is available under the GPL 2 license.

As you might expect, R is a perfect complement for NetLogo. Doing serious simulation work requires a considerable amount of statistics for calibrating models, designing experiments, performing sensitivity analyses, reducing data, exploring the results of simulation runs and much more. The recent JASS paper Facilitating Parameter Estimation and Sensitivity Analysis of Agent-Based Models: a Cookbook Using NetLogo and R by Thiele and his collaborators describe the R / NetLogo relationship in great detail and points to a decade’s worth of reading. But the real fun is that Thiele’s RNetLogo package lets you jump in and start analyzing NetLogo models in a matter of minutes.

Here is part of an extended example from Thiele's JSS paper that shows R interacting with the Fire model that ships with NetLogo. Using some very simple logic, Fire models the progress of a forest fire.

The general idea is that turtles represent the frontier of the fire run through a grid of randomly placed trees. Not shown in the above snippet is the logic that shows that the entire model is controlled by a single parameter representing the density of the trees.

This next bit of R code shows how to launch the Fire model from R, set the density parameter, and run the model.

As with many dynamical systems, the Fire model displays a phase transition. Setting the density lower than 55 will not result in the complete destruction of the forest, while setting density above 75 will very likely result in complete destruction. The following plot shows this behavior.

RNetLogo makes it very easy to programatically run multiple simulations and capture the results for analysis in R. The following two lines of code runs the Fire model twenty times for each value of density between 55 and 65, the region surrounding the pahse transition.

d <-seq(55,65,1)# vector of densities to examine
res <- rep.sim(d,20)# Run the simulation

The plot below shows the variability of the percent of trees burned as a function of density in the transition region.

My code to generate plots is available in the file: Download NelLogo_blog while all of the code from Thiele's JSS paper is available from the journal website.

July 01, 2014

Last time in part 1 of this topic, we used the xts and lubridate packages to interpolate a zero rate for every date over the span of 30 years of market yield curve data. In this article, we will look at how we can implement the two essential functions of a term structure: the forward interest rate, and the forward discount factor.

Definitions and NotationWe will apply a mix of notation adopted in the lecture notes Interest Rate Models: Introduction, pp 3-4, from the New York University Courant Institute (2005), along with chapter 1 of the book Interest Rate Models — Theory and Practice (2nd edition, Brigo and Mercurio, 2006). A presentation by Damiano Brigo from 2007, which covers some of the essential background found in the book, is available here, from the Columbia University website.

First, t ≧ 0 and T ≧ 0 represent time values in years.

P(t, T) represents the forward discount factor at time t ≦ T, where T ≦ 30 years (in our case), as seen at time = 0 (ie, our anchor date). In other words, again in US Dollar parlance, this means the value at time t of one dollar to be received at time T, based on continuously compounded interest. Note then that, trivially, we must have P(T, T) = 1.

R(t, T) represents the continuously compounded forward interest rate, as seen at time = 0, paid over the period [t, T]. This is also sometimes written as F(0; t, T) to indicate that this is the forward rate as seen at the anchor date (time = 0), but to keep the notation lighter, we will use R(t, T) as is done in the NYU notes.

We then have the following relationships between P(t, T) and R(t, T), based on the properties of continuously compounded interest:

P(t, T) = exp(-R(t, T)・(T - t)) (A)

R(t, T) = -log(P(t, T)) / (T - t) (B)

Finally, the interpolated the market yield curve we constructed last time allows us to find the value of R(0, T) for any T ≦ 30. Then, since by properties of the exponential function we have

P(t, T) = P(0, T) / P(0, t) (C)

we can determine any discount factor P(t, T) for 0 ≦ t ≦ T ≦ 30, and therefore any R(t, T), as seen at time = 0.

Converting from Dates to Year FractionsBy now, one might be wondering -- when we constructed our interpolated market yield curve, we used actual dates, but here, we’re talking about time in units of years -- what’s up with that? The answer is that we need to convert from dates to year fractions. While this may seem like a rather trivial proposition -- for example, why not just divide the number of days between the start date and maturity date by 365.25 -- it turns out that, with financial instruments such as bonds, options, and futures, in practice we need to be much more careful. Each of these comes with a specified day count convention, and if not followed properly, it can result in the loss of millions for a trading desk.

This is one commonly used convention and is very simple to calculate; however, for certain bond calculations, it can become much more complicated, as leap years are considered, as well as local holidays in the country in which the bond is traded, plus more esoteric conditions that may be imposed. To get an idea, look up day count conventions used for government bonds in various countries.

In the book by Brigo and Mercurio noted above, the authors in fact replace the “T - t” expression with a function (tau) τ(t, T), which represents the difference in time based upon the day count convention in effect.

For the remainder of this article, we will implement to the “T - t” above as a day count function, as demonstrated in the example to follow.

Implementation in RWe will first revisit the example from our previous article on interpolation of market zero rates, and then use this to demonstrate the implementation of term structure functions to calculate forward discount factors and forward interest rates.

a) The setup from part 1

Let’s first go back to the example from part 1 and construct our interpolated 30-year market yield curve, using cubic spline interpolation. Both the xts and lubridate packages need to be loaded. The code is republished here for convenience:

This gives us a reasonably smooth curve, preserving the monotonicity of our data points:

c) Implement functions for discount factors and forward rates

We will now implement these functions, utilizing equations (A), (B), and (C) above. We will also take advantage of the functional programming feature in R, by incorporating the Actual / 365 Fixed day count as a functional argument, as an example. One could of course implement any other day count convention as a function of two lubridate dates, and pass it in as an argument.

First, let’s implement the Actual / 365 Fixed day count as a function:

# Simple example of a day count function: Actual / 365 Fixed# date1 and date2 are assumed to be lubridate dates, so that we can# easily carry out the subtraction of two dates.dayCountFcn_Act365F <- function(date1, date2){ yearFraction <- as.numeric((date2 - date1)/365) return(yearFraction) }

We have shown how one can implement a term structure of interest rates utilizing tools available in the R packages lubridate and xts. We have, however, limited the example to interpolation within the 30 year range of given market data without discussing extrapolation in cases where forward rates are needed beyond the endpoint. This case does arise in risk management for longer term financial instruments such as variable annuity and life insurance products, for example. One simple-minded -- but sometimes used -- method is to fix the zero rate that is given at the endpoint for all dates beyond that point. A more sophisticated approach is to use the financial cubic spline method as described in the paper by Adams (2001), cited in part 1 of the current discussion. However, xts unfortunately does not provide this interpolation method for us out of the box. Writing our own implementation might make for an interesting topic for discussion down the road -- something to keep in mind. For now, however, we have a working term structure implementation in R that we can use to demonstrate derivatives pricing and risk management models in upcoming articles.

June 17, 2014

In this post, I will demonstrate how to obtain, stitch together, and clean data for backtesting using futures data from Quandl. Quandl was previously introduced in the Revolutions Blog. Functions I will be using can be found in my IK Trading package available on my github page.

With backtesting, it’s often times easy to get data for equities and ETFs. However, ETFs are fairly recent financial instruments, making it difficult to conduct long-running backtests (most of the ETFs in inception before 2003 are equity ETFs), and with equities, they are all correlated in some way, shape, or form to their respective index (S&P 500, Russell, etc.), and their correlations generally go to 1 right as you want to be diversified.

An excellent source of diversification is the futures markets, which contain contracts on instruments ranging as far and wide as metals, forex, energies, and more. Unfortunately, futures are not continuous in nature, and data for futures are harder to find.

Thanks to Quandl, however, there is some freely available futures data. The link can be found here.

The way Quandl structures its futures is that it uses two separate time series: the first is the front month, which is the contract nearest expiry, and the second is the back month, which is the next contract. Quandl’s rolling algorithm can be found here.

In short, Quandl rolls in a very simple manner; however, it is also incorrect, for all practical purposes. The reason being is that no practical trader holds a contract to expiry. Instead, they roll said contracts sometime before the expiry of the front month, based on some metric.

This algorithm uses the open interest cross to roll from front to back month and then lags that by a day (since open interest is observed at the end of trading days), and then “rolls” back when the front month open interest overtakes back month open interest (in reality, this is the back month contract becoming the new front month contract). Furthermore, the algorithm does absolutely no adjusting to contract prices. That is, if the front month is more expensive than the back month, a long position would lose the roll premium and a short position would gain it. This is in order to prevent the introduction of a dominating trend bias. The reason that the open interest is chosen is displayed in the following graph:

This is the graph of the open interest of the front month of oil in 2000 (black time series), and the open interest of the back month contract in red. They cross under and over in repeatable fashion, making a good choice on when to roll the contract.Let’s look at the code:

The arguments to the function are a stem code, a start date, end date, and two print arguments (for debugging purposes). The stem code takes the form of CHRIS/<<EXCHANGE>>_<<CONTRACT STEM>>, such as “CHRIS/CME_CL” for oil.

This code simply fetches both futures contracts from Quandl and combines them into one xts. Although Quandl takes a type argument, I have programmed this function specifically for xts types of objects, since I will use xts-dependent functionality later.Let's move along.

#impute bad back month open-interest prints -- #if it is truly a low quantity, it won't make a #difference in the computation.both$OI[both$OI==-1] <- both$lagOI[both$OI==-1]both$BI[both$BI==-1] <- both$lagBI[both$BI==-1]

This is the first instance of countermeasures in the function taken to counteract messy data. This imputes any open interest NAs with the value -1, and then imputing the first NA after a non NA day with the previous day's open interest. Usually, days on which open interest is not available are days after which the contract is lightly traded, so the values that will be imputed in cases during which the contract was not traded will be negligible. However, imputing an NA value with a zero during the midst of heavy trading has the potential to display the wrong contract as the one with the higher open interest.

#since we have to observe OI cross, we roll next day#any time we're not on the back contract, we're on the front contractboth$tracker[both$OIdiff > 0] <- 1both$tracker <- na.locf(both$tracker)

This code sets up the system for keeping track of which contract is in use. When the difference in open interest crosses under zero, that's the formal open interest cross, and we roll a day later. On the other hand, when the open interest difference crosses back over zero, that isn't a cross. That is the back month contract becoming the front month contract. For instance, assume that you rolled to the June contract in the third week of May. Quandl would display the June contract as the back contract in May, but come June, that June contract is now the front contract instead. So therefore, there is no lag on the computation in the second instance.

Using the previous tracker variable, the code is then able to compile the relevant data for the futures contract. That is, front contract when the front contract is more heavily traded, and vice versa.

This code uses xts-dependent functionality with the rbind call. In this instance, there are two separate streams: the front month stream, and the back month stream. Through the use of xts functionality, it's possible to merge the two streams indexed by time.

Next, the code imputes all NA open values with the close (settle) from the previous trading day. In the case that opens are the only missing field, I opted for this over removing the observation entirely. Next, any observation with a missing open, high, low, or close value gets removed. This is simply my personal preference, rather than attempting to take some form of liberty with imputing data to the highs, lows, and closes based on the previous day, or some other pattern thereof.

If verbose is enabled, the function will print the actual data removed.

ATR <- ATR(HLC=HLC(relevant))

#Technically somewhat cheating, but could be stated in terms of #lag 2, 1,and 0.#A spike is defined as a data point on Close that's more than #5 ATRs away from both the preceding and following day.spikes <- which(abs((relevant$Close-lag(relevant$Close))/ATR$atr) > 5 & abs((relevant$Close-lag(relevant$Close, -1))/ATR$atr) > 5)if(verbose) { message(paste(instrument, "had", length(spikes),"spike days removed from data.")) print(relevant[spikes,])}

Finally, some countermeasures against spiky types of data. I define a spike as a price move in the closing price which is 5 ATRs (in this case, n=14) away in either direction from both the previous and next day. Spikes are removed. After this, the code is complete.

To put this into perspective visually, here is a plot of the 30-day Federal Funds rate (CHRIS/CME_FF), from 2008, demonstrating all the improvements my process makes to Quandl’s raw data in comparison to the front month continuous (current) contract.

The raw, front-month data is displayed in black (the long lines are missing data from quandl, displayed as zeroes, but modified in scale for the sake of the plot). The results of the algorithm are presented in blue.

At the very beginning, it’s apparent that the more intelligent rolling algorithm adapts to what would be the new contract prices sooner. Secondly, all of those long bars on which Quandl had missing data have been removed so as not to interfere with calculations. Lastly, at the very end, that downward “spike” in prices has also been dealt with, making for what appears to be a significantly more correct pricing series.

To summarize, here's what the code does:

1) Downloads the two data streams2) Keeps track of the proper contract at all time periods3) Imputes or removes bad data, bad data being defined as incomplete observations or spikes in the data.

The result is an xts object practically identical to one downloaded with more common to find data, such as equities or ETFs, which allows for a greater array of diversification in terms of the instruments on which to backtest trading strategies, such as with the quantstrat package.

The results of such backtests can be found on my blog, and my two R packages (this functionality will be available in my IKTrading package) can be found on my Github page.

June 10, 2014

Last time, we used the discretization of a Brownian Motion process with a Monte Carlo method to simulate the returns of a single security, with the (rather strong) assumption of a fixed drift term and fixed volatility. We will return to this topic in a future article, as it relates to basic option pricing methods, which we will then expand upon.

For more advanced derivatives pricing methods, however, as well as an important topic in its own right, we will talk about implementing a term structure of interest rates using R. This will be broken up into two parts: 1) working with dates and interpolation (the subject of today’s article), and 2) calculating forward interest rates and discount factors (the topic of our next article), using the results presented below.

Working with Dates in R

The standard date objects in base R, to be honest, are not the most user-friendly when it comes to basic date calculations such as adding days, months, or years. For example, just to add, say, five years to a given date, we would need to do the following:

So, you’re probably asking yourself, “wouldn’t it be great if we could just add the years like this?”:

endDate <- startDate + years(5) # ?

Well, the good news is that we can, by using the lubridate package. In addition, instantiating a date is also easier, simply by indicating the date format (eg ymd(.) for year-mo-day) as the function. Below are some examples:

Remark: Note that “UTC” is appended to the end of each date, which indicates the Coordinated Universal Time time zone (the default). While it is not an issue in these examples, it will be important to specify a particular time zone when we set up our interpolated yield curve, as we shall see shortly.

Interpolation with Dates in R

When interpolating values in a time series in R, we revisit with our old friend, the xts package, which provides both linear and cubic spline interpolation. We will demonstrate this with a somewhat realistic example.

But while the talks may be illuminating, the real take-aways from the conference are the R packages. These tools embody the work of the thought leaders in the field of computational finance and are the means for anyone sufficiently motivated to understand this cutting edge work. By my count, 20 of the 44 tutorials and talks given at the conference were based on a particular R package. Some of the packages listed in the following table are well-established and others are work-in-progress sitting out on R-Forge or GitHub, providing opportunities for the interested to get involved.

Many of these packages / projects also have supplementary material that is worth chasing down. Be sure to take a look at Alexios Ghalanos recent post that provides an accessible introduction to his stellar keynote address.

Many thanks to the organizers of the conference who, once again, did a superb job, and to the many professionals attending who graciously attempted to explain their ideas to a dilletante. My impression was that most of the attendies thoroughly enjoyed themselves and that the general sentiment was expressed by the last slide of Stephen Rush's presentation:

May 08, 2014

R/Finance 2014 is just about a week away. Over the past four or five years this has become my favorite conference. It is small (300 people this year), exceptionally well-run, and always offers an eclectic mix of theoretical mathematics, efficient, practical computing, industry best practices and trading “street smarts”. This clip of Blair Hull delivering a keynote speech at R/Finance 2012 is an example of the latter. It ought to resonate with anyone who has followed some of the hype surrounding Michael Lewis recent book Flash Boys.

In any event, I thought it would be a good time to look at the relationship between R and Finance and to highlight some resources that are available to students, quants and data scientists looking to do computational finance with R.

First off, consider what computational finance has done for R. From the point of view of the development and growth of the R language, I think it is pretty clear that computational finance has played the role of the ultimate “Killer App” for R. This high stakes, competitive environment where a theoretical edge or a marginal computational advantage can mean big rewards has led to R package development in several areas including time series, optimization, portfolio analysis, risk management, high performance computing and big data. Additionally, challenges and crisis in the financial markets have helped accelerate R’s growth into big data. In this podcast, Michael Kane talks about the analysis of the 2010 Flash Crash he did with Casey King and Richard Holowczak and describes using R with large financial datasets.

Conversely, I think that it is also clear that R has done quite a bit to further computational finance. R’s ability to facilitate rapid data analysis and visualization, its great number of available functions and algorithms and the ease with which it can interface to new data sources and other computing environments has made it a flexible tool that evolves and adapts at a pace that matches developments in the financial industry. The list of packages in the Finance Task View on CRAN indicates the symbiotic relationship between the development of R and the needs of those working in computational finance. On the one hand, there are over 70 packages under the headings Finance and Risk Management that were presumably developed to directly respond to a problem in computational finance. But, the task view also mentions that packages in the Econometrics, Multivariate, Optimization, Robust, SocialSciences and TimeSeries task views may also be useful to anyone working in computational finance. (The High Performance Computing and Machine Learning task views should probably also be mentioned.) The point is that while a good bit of R is useful to problems in computational finance, R has greatly benefited from the contributions of the computational finance community.

If you are just getting started with R and computational finance have a look at John Nolan’s R as a Tool in Computational Finance. Other resources for R and computational finance that you may find helpful are::

Package VignettesSeveral of the Finance related packages have very informative vignettes or associated websites. For example have a look at those for the packages portfolio, rugarch, rquantlib (check out the cool rotating distributions), PerformanceAnalytics, and MarkowitzR.