I've only recently begun exploring and learning R (especially since Dirk recommended RStudio and a lot of people in here speak highly of R). I'm rather C(++) oriented, so it got me thinking - what are the limitations of R, in particular in terms of performance?

I'm trying to weigh the C++/Python/R alternatives for research and I'm considering if getting to know R well enough is worth the time investment.

Available packages look quite promising, but there are some issues in my mind that keep me at bay for the time being:

How efficient is R when it comes to importing big datasets? And first of all, what's big in terms of R development? I used to process a couple hundred CSV files in C++ (around 0.5M values I suppose) and I remember it being merely acceptable. What can I expect from R here? Judging by Jeff's spectacular results I assume with a proper long-term solution (not CSV) I should be even able to switch to tick processing without hindrances. But what about ad-hoc data mangling? Is the difference in performance (compared to more low level implementations) that visible? Or is it just an urban legend?

What are the options for GUI development? Let's say I would like to go further than research oriented analysis, like developing full blown UIs for investment analytics/trading etc. From what I found mentioned here and on StackOverflow, with proper bindings I am free to use Python's frameworks here and even further chain into Qt if such a need arises. But deploying such a beast must be a real nuisance. How do you cope with it?

In general I see R's flexibility allows me to mix and match it with a plethora of other languages (either way round - using low level additions in R or embed/invoke R in projects written in another language). That seems nice, but does it make sense (I mean like thinking about it from start/concept phase, not extending preexisting solutions)? Or is it better to stick with one-and-only language (insert whatever you like/have experience with)?

So to sum up: In what quant finance applications is R a (really) bad choice (or at least can be)?

11 Answers
11

R can be pretty slow, and it's very memory-hungry. My data set is only 8 GB or so, and I have a machine with 96 GB of RAM, and I'm always wrestling with R's memory management. Many of the model estimation functions capture a link to their environment, which means you can be keeping a pointer to each subset of the data that you're dealing with. SAS was much better at dealing with large-ish data sets, but R is much nicer to deal with. (This is in the context of mortgage prepayment and default modeling.)

Importing the data sets is pretty easy and fast enough, in my experience. It's the ballooning memory requirements for actually processing that data that's the problem.

Anything that isn't easily vectorizable seems like it would be a problem. P&L backtesting for a strategy that depends on the current portfolio state seems hard. If you're looking at the residual P&L from hedging a fixed-income portfolio, with full risk metrics, that's going to be hard.

I doubt many people would want to write a term structure model in R or a monte-carlo engine.

Even with all that, though, R is a very useful tool to have in your toolbox. But it's not exactly a computational powerhouse.

Thanks for bringing this up. I haven't thought that memory management could be such an issue. And great example selection. Just what gets into my field of interests. ;-) At least I shall know what not to expect from R. But as you say, it's probably nice to at least get acquainted with it, so as to know when it may come in handy. Forgive my question blatantly arising from my personal interests and lack of knowledge in the matter at the same time, but how appropriate is R in machine learning applications and particularly Bayesian network inference?
–
Karol PiczakMar 15 '11 at 22:23

Well, Bayesian statistics is a the prime example for mixing of C++ (for speed) with R (for ease of analysis), see the Bayesian Stats Task View. Also, packages like ff and bigmemory deal with large memory, see the HPC Task View. These issues can be addressed quite well in a hybrid manner as I indicated in my answer above.
–
Dirk EddelbuettelMar 16 '11 at 14:42

1

I have spent some time during the last weeks to infer parameters (calibrate) of a vol-sto model (hence using the classical Baysian inference theory). I used C# interfaced with R, no memory problem, even with a parallelized implementation..
–
Beer4AllAug 5 '11 at 11:04

1

SAS appears faster b/c its data step encourages you to processes data one line at time, whereas R encourages loading everything into memory-- however that isn't the only way to do it. My (biological) datasets are 300gb+ each, and I have no particular problem processing them with a Perl->R pipeline with intermediate data placed in sqlite databases, never loading more than 1gb into memory.
–
user1481Oct 5 '11 at 19:21

Yeah, I suppose so. But this part "at a possible cost in terms of time to code" is exactly my problem. I indeed like C++, but at times it's really hard to get something up and running quickly. So I'm looking for a tool that would allow me to crash test my concepts more rapidly and see if they are worth implementing in, as you name it, industrial-strength at all. But I wouldn't like to shoot myself in the foot either and get locked in a point where it's not appropriate for production use nor preliminary research.
–
Karol PiczakMar 15 '11 at 22:36

I am not an R advocate, but can witness that R is trivially very, very good at data analysis. It is essentially a LISP-like functional language domesticated enough to make you productive in one afternoon. It is unbeatable at getting data in your system, analyzing them, and producing high-quality output, be it latex reports or charts. I have used several languages (from SAS to Python), and none is close to R's productivity when it comes to advanced data analysis. It has an unparalleled suite of packages, with great redundancy: there are 4 packages on Kalman filter, and almost 10 packages on various regularized regression alone. Very often, the packages complement papers that are just being published, thus giving you access to the latest technology. It's not a problem to consume datasets of 100M rows or more, given sufficient memory. Those complaining about memory management in R should try MATLAB.
Sure, it's slow, but consider this:

Linear algebra is as fast as C++, and interfacing to LAPACK is a
whole lot easier in R than in C++;

there are APIs to specific DBs, key-value stores, and ODBC);

many packages are optimized for speed and written in C or Fortran;

For 99% of applications it is fast enough;

For the remaining 1% use in which you are not bottlenecked by
computation or data management, you can speed up things in C, C++,
Fortran, Java.

I would maintain that well-coded R can be faster and more robust than poor C++ code. R can be used in production, although with some care, and not as the main language. It is definitely not suited for doing GUIs.

I thought the plug for Python was a bit off-topic, but I'll say that Python is without doubt among the most versatile and easy languages (I mean it as a compliment), and Cython is a great asset. Still, I believe languages, like people, should be judged based on what they're best at, not on what they're good enough at. I'd assert that R is best at data analysis, and that its syntax is slighly better than Python for this purpose. It'll be a while until Python has the domain-specific packages and visualization packages of R, and most importantly the people behind them. But I'll agree on one very important point: most quant hedge funds do relatively elementary data analysis, and Python+Numpy+Pandas is a sensible choice as single language.

The greatest weakness and greatest strength of R is that it is not a strongly typed language. Therefore easy tasks in strongly typed languages such as re-factoring, auto-compiler checks, unit testing, etc. can be more difficult in R.

On the other hand, one can rapidly prototype in the R language. R is an interpreted language -- it will dynamically convert types. R is also an excellent tool for visualization and analysis (GGplot2 library). There is also a wonderful community of R developers that are creating new solutions for problems all the time.

The R Inferno is an essential read before you develop production code with R.

As a disclaimer, I'm a noted advocate of using Python to build production systems for quant finance (old talk but: http://python.mirocommunity.org/video/1531/pycon-2010-python...). I've been very successful at doing it and largely as the result of my example many other quant shops have chosen the Python route to excellent results. The pandas Python library (http://pandas.sourceforge.net) is an open-source outgrowth of my proprietary work.

I see a another fan of my work has already posted here =)

My question is: why program in C++? I don't think anyone will argue it's an insanely low productivity language relative to Python or R. But Python and R are slow for iterative, procedural code. The near panacea for Pythonistas is to use Cython (http://cython.org) to develop C speed code but take maybe only 1.5-2x longer than writing Python code (to get all the type declarations right etc.). You can also directly call methods in C / C++ libraries using Cython, so it really is the best of both world in my experience.

I think in general that hybrid systems are best avoided if at all possible since debugging across "the bridge" is a thorny problem. You typically end up with more code than you planned in the higher-productivity language (e.g. R). I like Python because Python is good at all the things that R is not good at. Yes, Python's statistics libraries are very weak (though we're making progress in http://statsmodels.sourceforge.net) compared with CRAN, but in quant finance it turns out that 90% of the modeling and data analysis that you actually do isn't that statistically sophisticated. It's largely a relational data manipulation and time series processing problem (which pandas takes care of in spades-- has much better integrated data alignment features than just about anything in R, too).

Python is also excellent for building GUIs. I've used wxPython and PyQt and found that I could hack together a GUI in an afternoon that would have taken a week or more to do in Java or C++.

Python/Pandas has a great mix of ease in development and ability to handle and manipulate large datasets. Python will also be much more flexible for low level integration than R, but the ability to easily build a GUI and manipulate datasets in a single language is not to be underestimated.
–
rhaskettApr 9 '14 at 22:58

Getting something up and running quickly -- i.e. data manipulation and exploration are activities R are adept at, and there are a plethora of packages to help you. Flexibility and speed (of research) are R's primary strengths. I feel memory and computing power are less expensive than the thought cycles used to explore an idea.

If you're entering a production level arms race, obviously R is not the answer. However, I find R acceptable for production -- enough to plug it into an institutional order management system. As long as your investment strategy is based on predictive market analytics, I don't see a drastic need for speed.

Switching from C/C++ to R has increased my productivity and shrunk code line-counts for similar tasks by roughly an order of magnitude. To give a small flavor of why, consider one of the most common patterns of iteration over some collection, selection and action:

declare iterator for collection
for element in collection
if (element meets condition)
do something with element

In R this construct typically shrinks to:

collection[condition] <- new_values

Where new_values is a vector expression as well. And this is not only limited to arrays/vectors.

In R, there's no need to declare iterator-variables and write loops to iterate over them because when acting on vectors (or higher dimensional data-structures) the loops are implied. Similarly, there's no need to use if because subsetting via [ ] implies a condition.

Add to this the fact that off-by-one bugs, outside-array and null dereferences are no longer an issue, that hashes (using $ list member references) are compact and part of the language, that visualizing data by turning it into a chart (see my avatar as an example of visualizing a 3-dimensional continuous-value table) is trivial, and you can start seeing the tremendous productivity jump.

Add ~4000 libraries in CRAN covering state-of-the-art statistics, data-mining and machine-learning libraries mean you often don't need to write code at all, just use what's out there. Many of these libraries (where it matters) are already written in a compiled language (C or Fortran) so efficiency is largely taken care of.

Are 1 million values in R an issue?

The OP question mentions working on data-sets of 0.5 million values and whether this may be an issue with R. Let's run a quick check:

The above 3 lines of R code complete instantaneously on my desktop. Memory consumption of all of R loaded with the session shows 120 MB virtual and 37 MB resident, so the answer is that ~1M element size of data-sets shouldn't generally be an issue. I've used 1B-items data-set sizes on big memory 64-bit machines. Reading data from a database using RODBC for example, or from a flat data file (csv, tsv, text), using read.table/read.csv or similar, instead, is trivial as well.

Inherent inefficiencies in R & possible remedies

Having said that, it is easy to write inefficient R code, both in terms of memory and speed. The most common cause I've seen is building a data-frame iteratively (adding one column at a time, using cbind() or similar) because the copy-arguments on every call makes this $O(n)$ process become an $O(n^2)$ process. Similarly, passing big data-structures as function arguments, like a full data-frame, when you only want to pass one or a subset of the columns, has its (pass by value copy) undesired cost. If the last issue is your #1 slowness cause, you want to look at data.table library which allows passing args by reference using :=, and learning about the <<- operator.

Reading Patrick Burns"The R inferno" (124 pages, available for free as PDF) is an excellent time investment as mentioned by Quant Guy above if you're serious about learning R and avoiding the pitfalls.

Also seconding Dirk's comment that by using Rcpp it is possible to avoid the above pitfalls where it matters and write the low level critical loop code in C/C++ where needed.

Bottom line:

Yes, time spent learning R is well worth it, but there's no one hammer to fit all needs. Use the tool that's best for the job.

Python is hot on the heels of R as an exploratory data analysis solution for finance, and it's a heck of a lot more fun to write code in (imho:). Plus, python tends to play well with others within a larger software ecosystem.

Who is hiring R programmers and does it look like a language of choice in a field (finance, medical, automotive) with a strong future?

I've been developing in C/C++ for almost 17 years now and it has easily kept me gainfully employed. I started out writing small DOS applications, did a fair amount of CRUD GUI Windows development, and I'm currently in the embedded field doing work on ARM hardware.

For me, I see a very bright future in embedded. ARM chips are now multi-core and this hardware, in my opinion, is showing up in everything from automobiles to TV sets to medical equipment.

I recently did an interview for Bloomberg (what a joke) and the whole interview was about testing my knowledge of C++.

I don't know much about R but what I can say, in looking back at my 17 years in this field and in working for some very large companies (Fotune-5) and very small companies that the languages I see the most often are C/C++, J2EE and C# as the top languages. And under these I see utility type languages in use such as Perl; mostly Perl in fact.

My compass check might be to do a DICE or MONSTER search and see who is looking for R programmers.

Ease of adding GUI like features and interactivity: python > c++ >> R.... Are you aware of shiny and rCharts, among others, for R? Also, would add Exploration(Visualization): ? > ? > ?
–
Daniel KrizianJun 14 '14 at 16:29

1

One word for R with data interactivity and ease of adding GUIs: Shiny. Very fast application development, very powerful for building ideas. Those inequalities should be flipped.
–
FXQuantTraderJul 8 '14 at 7:16

I'll echo the previous commenters - with the advent of Shiny, R is now the most convenient language for interactive data visualization. Can your language do all these things in a couple dozen lines of very readable code?
–
PaulJul 26 at 13:25