I can see the pros and cons. The pro is that it’s easier to line things up and make sure the constraints match, indexes match, and that every variable gets a prior. Michael objected that it broke the nice abstraction of the Stan language. But what does that mean? The discussion persisted offline in a dev list thread, and Michael and Sean’s comments reminded me of where the original design came from, so I decided to write 1000 words in reply.

Before the language

Stan started as just C++ code; Matt was working on what would eventually become the NUTS algorithm and I was working on building a simple version of HMC in C++, based on the discussion in David MacKay’s textbook on information theory (all the algorithms in that book are very easy to read, which is one of the main reasons I like it so much). I wrote up a multivariate normal with gradients calculated by hand and was gobsmacked at how well it worked compared to Gibbs or isotropic random-walk Metropolis (we all know why now after Michael’s very clear papers with geometric explanations).

Autodiff to the rescue

It was clear from the algorithm that from the HMC perspective, all you need is a differentiable log density function. That was the initial driving force—how did we get a differentiable log density? I begged for help on Andrew’s blog [this blog, that is] and John Salvatier suggested automatic differentiation. I still remember in the middle of an early meeting Andrew saying “Is that finite differences?” and my looking it up on the web on my iPad. We then got really excited about it, as reflected in this early post by Andrew on autodiff.

Once I got over my amazement at how well both HMC and autodiff worked, I started in with a very rudimentary autodiff that only supported built-in language arithmetic and unconstrained variables. The original idea for Stan is that we’d write C++ code for models and then autodiff them.

Rooted in BUGS

BUGS gives you a language that can be translated to log densities. In thinking about how BUGS models translated to an HMC context, I realized that we just needed to translate the model to a log density. Then we could use autodiff to get the gradients.

If you look at how BUGS models translate into log densities, you can do it bit-by-bit by translating each sampling statements (like y ~ dnorm(0, 1e-4) as a contribution to log density (here ). I hadn’t really thought at all beyond graphical models at this point. I just wanted a way to think about BUGS models because that’s how I learned how to think about Bayesian stats.

Hey, we could turn that into a language

I’m a computer scientist by training, so my immediate thought is that we could compile BUGS to C++. We wouldn’t even have to define our own domain-specific language.

Strong static typing

What I never liked about BUGS models is that it was a guessing game to figure out what the types of variables were (oh, this one shows up on the left of a Poisson, so it must be an integer; oh, there’s an index, so it must be an array). I’ve always believed in strong static typing for languages (enough so that I wrote two books on the subject back when I worked on programming language theory, compilers, and grammar). By that, I mean just that every variable gets a type defined when the program is written and those types don’t change.

I thought we’d get a ton of pushback on this choice from BUGS users. Surprisingly to me, the reaction’s largely been positive from users who find it helps make the models more readable.

The main complaint has instead been that there are three 1D array structures (real[], vector, and row_vector — the reason for that is so that the parser can infer the result type of linear algebra operations like multiplying a vectory by a row vector or vice-versa).

Not like typing in other probabilistic programming languages

I only wanted the user to have to think about shape. Variables should behave the same way everywhere, just be declared in different places. As Michael pointed out, that’s very different in some of the other languages where the type of an object depends on its usage (data or parameter) and in BUGS where the language never specifies what the variable is going to be used for—that’s only figured out dynamically when the data is read in.

So what I needed to do was somehow define that log density function. Throughout MCMC, the data remains constant and the parameters vary. One confusing point is that while the parameters vary, their values are determined by initialization and then the Markov transitions in the MCMC algorithm. From a Stan language point of view, they’re fed in externally by the sampler (or optimizer). The services the model needed to provide were those of an object storing data as member variables and providing a method to compute the log density.

From that object-oriented perspective (which is exactly how it’s coded in C++, by the way), the data block variable declarations give you the member variable declarations, which are the same as the arguments to the construct. The entire model class is immutable—once you construct it with data, nothing in it changes and it’s stateless.

The variable declarations in the parameters block give you the arguments to the log density function and the model block gives you the body of the log density function. The return value, the log density that gets accumulated, is implicit.

That’s all I really did in the first version. It had very limited expressiveness and all variable transforms were done explicitly (like log transforming a scale parameter or logit-transforming a probability parameter).

Transforms to unconstrained space

So I started doing transforms manually and had to learn about Jacobians. When I first worked through BDA for real, I came into Andrew’s office and declared the table of distributions in the back were messed up. I plugged the inverse of the outcome into a gamma function and didn’t get the inverse gamma. Ditto log normal. What happened? Uh, Jacobians, Bob. Luckily, I have kind teammates. Nobody laughed at me.

At that point, it became pretty clear how to do the transforms to the unconstrained scale to simplify our sampling and optimization algorithms. I was very proud of myself for working out the simplex transform in such a way that zero transforms to a unfirom simplex. The rest of the coding was just banging my head against C++. I still spend some time doing this, but luckily C++ usually cracks before my head these days. That was definitely not the case in the old days; Daniel and I were so frustrated trying to get early versions of Stan to build that I felt alternatively like crying and pitching my computer out the window and going back to Java. I still think C++ is designed by sadists with no value for human sanity and only detail-freak masochists can tolerate the pain levels it produces.

The other blocks

Once we had data and parameters, it was easy to define derived quantities that depended only on data (transformed data) or on parameters (transformed parameters). My thinking about those came straight out of Andrew and Jennifer’s variable classification in their regression book. In fact, you can still see some of their terminology floating around the AST and grammar code under the hood.

Generated quantities then seemed the natural way to do posterior predictive checks in a way that could be computed once per iteration rather than once per leapfrog step and with double types rather than autodiff types.

Like a Marvel movie, turns out there was more. This sort of fell out as we realized we weren’t using the graphical structure anywhere. We could write models with all sorts of different structure without having to fit it into a graphical modeling form. We can write whatever we want in the model block as long as it’s equal to the log posterior up to an additive constant that doesn’t depend on parameters.

This lets us do things like unfold conjugate structure using sufficient statistics allowing us to code the posterior more directly. We can also directly marginalize out discrete parameters and code complex likelihoods like HMMs using algorithms like the forward algorithm. Or implement Markov random field structure like in conditional random fields.

Users have coded iterative algorithms for densities and even written stepwise definite integrators directly in Stan code.

This all came pretty far along the development of version 1 (which we released August 2012). I started thinking it was just like BUGS, only it translated the sampling statements to log density increments rather than edges in a graphical model and it declared its parameters’ shapes and organized them into blocks.

I tend to think of Stan like other infrastructure projects—most people who use them don’t become developers of the infrastructure project. This is totally OK with us—it’s why we went with an open BSD license. Right now our Python interface is GPL-ed, so maybe that’ll relax soon, to make it easier to use the higher-level interfaces.

I don’t know much about python, but at least in R, it makes the decision that everything is a vector, basically.
Is it an int? Sure, it’s a number.
Is it a float? Sure, it’s a number.
Is it a double? Sure, it’s a number.

Moreover, they’re all numeric VECTORS, of length 1.

I’m not sure that’s a fantastic choice, but it makes statistical computation really easy, which I assume is the point.

So there’s at least that consistency to it.
Everything is a vector. Numbers are numerics. Characters are strings. Factors are factors. Logicals are logicals. All of them are in vectors ranging in length from 1 to whatever.

The insanity of that really only happens when R functions want to recycle arguments, but even that is at least useful at times, though can cause a lot of confusion when people don’t realize that.

It makes simple things easy if the interpreter guesses what you want correctly. But the big danger is that it can easily mask bugs. And you don’t get early warnings when things go wrong. And it’s very hard to write defensive code that checks user inputs.

The R type systems’ a bit more complicated than you make out. Integer numeric vectors and floating-point vectors are not quite the same, nor are matrices and vectors. And you get all kinds of cross-assignabilty among data frames, matrices, etc.

I think of them more as scripting languages and plotting languages than real programming languages.

I’d like them better as scripting languages if they had stronger typing. To me, programming without types is like trying to do science without keeping track of units. Sure, it saves some typing, but it’s actually harder, especially when things go wrong.

R’s much worse than Python (or better, I suppose, depending on how much you want your language to guess what you want to do) for exactly the reasons Stephen Martin outlines below as positives! I also wouldn’t shed a tear if R got rid of dynamic lexical scoping, either (not going to happen as it’d break the whole language to change course now). And Python could finally solve its 2.x and 3.x problems and somehow roll NumPy into the language with better matrix syntax. And maybe they could use a just-in-time compiler instead of their super super slow interpreters (last time I checked, R’s was about 10 times slower than Pythons, which as several orders of magnitude slower than C++ or Java or Fortran). Julia does some of these things, but it doesn’t enforce strong static typing.

When I moved from C to Python I hated it for exactly the same reasons you describe: It was impossible to decipher what any given variable was. But over time I decided I was the weird one and shut up and bit the bullet.

I think Python works great when the code you are dealing with is extremely well documented & functions etc. take arguments in an intuitive way and react predictably. But otherwise understanding code can be a nightmare.

To be clear, I’m not sure I would call them positives. Just that the oddities can be really useful at times as long as you /know/ about the oddities.

But yeah, I’ve said in the past that as a programming language, R is super weird and would have CS people screaming, but for statistical computations, it’s really easy. There aren’t many languages where you can do something like data[,c(‘var1Z’,’var2Z’,’var3Z’)] <- scale(data[,c('var1','var2','var3')]) in one line, without loops. But to programmers, that's terrible practice; automatic appending to a list? No declaration of data types or length? No error if dimensions mismatch [sometimes]?

And yeah, I know R does technically have more primitive data types, but the ones exposed to users typically are just masked in some way. It's all to make doing quick and dirty stats easy, not to have a normal, safe programming environment, for better or for worse. class(1) == class(1.0) == class(1/3), all of which are vectors of length 1, and not arrays.

Like, it's a bit crazy how R has, what, 4 different object systems?

But still, the goal was for statistical computations to be easy, and to be done interactively. It's mainly the package authors that have to suffer through the language oddities.

“programming without types is like trying to do science without keeping track of units”: this is a nice way of putting it. It amazes me how many colleagues prefer to save the effort of a few key strokes for the price of understanding the causes of many bugs or some pointless inefficiencies.

Also, true story, at one point the cache in my Firefox got somehow corrupted, and Bob’s avatar image was the image I would see on Slashdot in place of one of the “topic” icons. I don’t remember which one, maybe the world, or the stuffed shirt, whatever. It was always Bob there on slashdot… watching me.

You’re going to give me nightmares. I need a better image. But please, don’t make this project a cult of personality. It’s a big team effort. I just type faster and have fewer filters than the rest of the team (other than Andrew).

Is the AST documented someplace or exported and visualizable? I’d be curious about whether some of the optimizations (or model checking) described in the manual could be implemented against the AST before it all gets crosscompiled into C++. I haven’t had a chance to dive into the code to see how the grammar is implemented and was curious.

It’s documented in the AST code itself and through the unit tests. It pretty much just follows the BNF, which is in the manual.

As to the optimizations, that’s exactly what Sean said when he looked at all of our compound functions. For example, I just implemented a logistic regression function that’s about 3.5 times as fast as bernoulli_lpmf(y | x * beta) and will do the same for the other common GLMs. The question is, can we just notice this or y ~ bernoulli(x * beta) in the code and replace it with y ~ bernoulli_logit_regression(x, beta)?

In some cases, yes. In other cases, you’ll see loops in user code that would be much harder to disentangle into vectorized versions. For example, it’s common to see models translated from BUGS that get rendered in Stan as:

Oooh, are there plans for possible speedups to other linear models too? Not sure what else could be sped up about y ~ normal(X*B, sigma), but I think these types of optimizations could really help in cases like CFA, GRM, 2PL models and the like where there are several regressions.

Indeed. I’ve already worked out the gradients for basic linear regression and poisson regression with a log link. We just need to find the time to implement and test them all.

The question is really where to stop. And whether we can detect the patterns and just do something more efficient under the hood rather than continuing to introduce more and more specialized functions we expose to users.