User login

Navigation

Workshop on Probabilistic Programming in December

The machine learning community has started to realize the importance of expressive languages for specifying probabilistic models and also executing those models (i.e., performing "inference" by computing conditional distributions). A number of "probabilistic programming languages" have been proposed by machine learning/AI researches recently--Pfeffer's IBAL, and Goodman, Mansinghka, my, and Tenenbaum's CHURCH, and Winn's CSOFT.

Outside of the ML community, I know of a few languages, including Park, Pfenning, and Thrun's PTP, and Erwig and Kollmansberger's PFP/Haskell. In general, the work by ML researchers places more emphasis on the conditional execution of probabilistic programs (e.g., asking about the value of some variable given that the entire program takes on a particular value).

Of possible interest to the PL community are a number of interesting relationships between ideas in probability theory and programming languages (when used to specify so-called generative models). One of these is that purity/referential-transparency relaxes in the probabilistic setting to exchangeability (a property of a probability distribution being invariant to reordering). Other interesting theoretical connections relate to relaxed notions of halting (e.g., halting with probability one) and their effects on statistical inference (this is particularly relevant for so-called nonparametric distributions; e.g., see a recent workshop abstract of mine).

Along with my colleagues (Vikash Mansinghka (MIT), John Winn (MSR Cambridge), David McAllester (TTI-Chicago) and Joshua Tenenbaum (MIT)), we are organizing a workshop at the NIPS*2008 conference. While I've been reading LtU for some time, I've joined now to announce this workshop to the PL community because we definitely need the PL communities help in solving some open problems.

Probabilistic inference algorithms are very complicated beasts and writing universal inference algorithms that take as specifications "probabilistic programs" seems, at first blush, to implicate program analysis, compilation, partial evaluation and a host of other ideas from programming languages.

Comment viewing options

I wonder if you could be interested in this, a library for modeling and simulating (and proving things on) probabilistic (actually randomized, is that significantly different?) programs in the Coq proof assistant. There are examples of programs and estimation proofs at the end.

Ah yes, very nice. I had seen some past work by her, but not this. I'll have to pick through this. The question will be whether the program is capable of reasoning past loops that halt with probability 1 but not always. Thank you!

Yes, randomized and probabilistic line up. But randomized programming languages sounds wrong. Also, in machine learning and probabilistic AI, the probability models (described by these programs) are interpreted from a Bayesian perspective as representing degrees of belief.

Probabilistic programming languages should be very valuable in the future. They seem useful for real sensors and actuators. They enable robust flexibility and adaptability. Probability can easily be used to resist propagation of error or inconsistency, filtering out unintended meanings based on context. As we move to less precise HCI (e.g. gestures, voice, visual context, ubiquitous computing... instead of keyboard and mouse), I think probability will be very valuable in expressing human meaning to computers.

The normal issues with probabilistic programming seem to be:

the state-space very easily explodes because we're tracking multiple 'probable' states, which makes it difficult to scale

modular composition, especially in open systems (where parts of the model are hidden from other parts), can result in anomalies (i.e. where values are counted twice or possible cancellations are lost) due to losing track of where probability-values come from

I'm interested in ideas that might make probabilistic programming more scalable and modular. I'd especially like to see probabilistic models that can bridge services in open systems. But I'm a bit short of ideas on exactly how to achieve this.

One idea I do have is to expand and contract spatial probabilities on the temporal dimension. Even as we expand on probable futures, we narrow on probable histories. Result is spatial-temporal probability zipper with controlled costs.

In a concrete sense, algorithms and services would consume 'probabilistic' values and data, which might represent alternative possibilities or interpretations. But, rather than just expanding a set of values into an even larger set of outputs, the algorithms can feedback information to say: "no, this interpretation is probably NOT what you meant" or "this particular possibility deserves a bit more wait", and may actually reduce the number of possibilities they consider (and thus control the number they output). This feedback rushes backwards through the model, adjusting weights and affecting other consumers of the same data (e.g. if we feed the same data to multiple different algorithms). Though, there is still a challenge of ensuring the feedback quickly stabilize so we can make progress.

If I'm able to bound the number of possibilities considered at any given instant, then it may be we only consider the ten or hundred most likely possibilities - which has a predictable space and CPU cost of roughly ten or hundred times (which is high, but much nicer than exponential overheads). As possibilities are discarded, a few more might be considered. The benefits include more robust, adaptive, and intelligent models - i.e. that can implicitly and compositionally take a lot of context into account.

The probability-zipper design seems like it might be a good fit for reactive or dataflow programming models, where the feedback can be modeled effectively. I'm working on reactive dataflow models at the moment, but have yet to build in any support for modular probabilistic programming. (I'm thinking I might later try with an EDSL or model constructed above my reactive model. After I've proven the non-probabilistic version, of course.)

These ideas are a little vague, and I'm pretty sure that when you try to make them precise you'll see that it won't work. You can't quickly get the 100 most likely possibilities for example, since even computing a single possibility (and not even requiring that it be the most likely -- just ANY possibility) can take exponential time in a probabilistic programming language. Having a way to do that efficiently would prove P=NP (since by conditioning on the output of a boolean formula you would be able to quickly find a satisfying assignment, hence solving SAT).

I'd say that by far the biggest hope for probabilistic programming is machine learning and statistical inference. Probabilistic programming languages are a great way to define a probabilistic model. For this to be generally useful we probably need support for other approaches than MCMC, since that is not practical for many models and data set sizes. ML/MAP estimation (or variational bayes) is often more practical (or even the only practical option).

The exponential time in probabilistic programming arises from combinatorial explosions in the state space. I'm interested in supporting modular constraints that will prevent the explosion from ever happening.

Those 100 possibilities might not globally be the most likely; rather, they are estimated most likely based only on local knowledge in a modular system, perhaps represented as a probabilistic (or weighted) collection of values. There may be some feedback, indicating that some of those 100 possibilities are actually impossibilities (or just infeasible or very unlikely). From this feedback, someone upstream may choose to adjust their knowledge/state and, from their modified state, formulate new 'best possibilities'. At any given time, each algorithm should be propagating only a bounded number and computing in a bounded space.

While I acknowledge my ideas are still 'vague', it is not clear to me that they won't work. I certainly don't believe they require a solution to P=NP.

I do believe that machine learning could play a very useful part in processing probabilistic information. I see the role for ML to be the nodes (or individual modules). A challenge is composing multiple ML systems, at which point the sort of bounded probabilistic network I'm describing becomes relevant. I see a valuable role for lots of smaller probabilistic models composed in a graph, and machine-learning in-the-small (e.g. for stability and pseudo-state).

All I'm saying is that if you want to preserve the semantics of probabilistic programming then giving a polynomial time algorithm for computing a single valid possibility is equivalent to proving P=NP. If you want to go for a different semantics that might of course not be the case, but since you have not said anything about how your new semantics would be we of course cannot say anything meaningful about whether it would be possible in that setting.

Perhaps I should elaborate a bit on why it's equivalent to proving P=NP. Suppose you have a box in the real world that computes a boolean formula f(b1,b2,...,bN) where b1..bN are booleans. For example f(a,b,c,d) = (a || b) && (!a || b || !c). Initially you are uncertain about the values of the booleans, lets say you assign to each 50% chance to it being true and 50% to it being false. Now your program gets an input saying that the output of the box was true, i.e. the value of f(b1,b2,...,bN) is equal to true. The question is "what are the 100 most likely values of b1,b2,...,bN?". Even finding a single *possible* b1,b2,...bN is equivalent to solving SAT.

So either your semantics cannot model this example or it does not avoid exponential complexity. And if it cannot model this example that's quite a problem since modeling a deterministic boolean function is about the simplest thing you can imagine. For more complicated systems it only gets worse.

I don't think exponential complexity is a very valid critisism of probabilistic programming however, since (1) most practical things do not involve combinatorial explosion (2) if it does, then you're out of luck in any case or if the input sizes are small it's a good thing that you can express it. Just like the fact that you can write an exponential time algorithm in C is not a practically important criticism against C: for most things you do not need an exponential time algorithm, and if you do then the ability to express it is a good thing.

BTW I meant that a probabilistic programming language would be useful for implementing the ML algorithms themselves. Here is a nice post showing some machine learning models expressed with probabilistic programming: http://zinkov.com/posts/2012-06-27-why-prob-programming-matters/ (though the programming style of expressing everything backwards is a bit curious so read the examples starting at the last line and reading backwards).

You seem to be assuming a language where you can ask what inputs will produce a given output. As I understand it, that sort of reversibility is common to logic programming, yet completely orthogonal to probabilistic programming. That is, "probabilistic logic" is an intersection of distinct concepts. Probabilistic PLs tend to fall in this intersection, i.e. the high level specification of models mentioned in the OP. But probabilistic programming can be used with other paradigms.

Logic programming is exponential for reasons orthogonal to probabilistic computation. You expressed a SAT question, not simply "modeling a deterministic boolean function". We don't need probability to explain how the best known general solutions to SAT questions are NP.

It is perfectly reasonable to model f without any ability to reverse it. If the inputs have probability 50% of being true, the result will have some % probability of truth. The real problem will be losing info about relations between probabilities on input.

Exponential cost is a valid issue for any algorithm with that property. But it becomes a problem for whole programming models when it happens implicitly or is not well localized, i.e. when it is outside effective programmer control. Logic programming and probabilistic programming both have that character, at least for many languages. C language does not: you can express exponential algorithms in C, but it will be explicit and syntactically local.

Probabilistic programming is just weighted logic programming. Conditioning on observations is at the very essence of probabilistic programming.

If the inputs have probability 50% of being true, the result will have some % probability of truth.

SAT is also reducible to computing the % of truth of the result: if the chance of truth of the result is greater than 0%, then the formula is satisfiable. You can even get the b1,...bN out of it by asking a series of questions: the % truth of f(...) && b1, the % truth of f(...) && !b1, f(...) && !b1 && b2, etc. More generally, if you can ask for the probability distribution of the result, you can also reverse functions.

that probabilities are precisely computed (as opposed to being themselves subject to probabilistic models)

that computed probabilities are directly observable and comparable within the model, i.e. so you can compare to 0 (as opposed to being propagated implicitly and inaccessible to programmatic expression)

You can even get the b1,...bN out of it by asking a series of questions [...]

This technique you describe is completely orthogonal to whether the language of expression is probabilistic. Also, search on combinations of inputs for a function does not qualify as reversable functions, and does not generalize beyond a countable domain.

Weighted logic programming is just one example (albeit, a popular one) of probabilistic programming. Probability based methods have been applied to almost every paradigm.

The first two points are back into not well defined territory, so there's nothing we can conclusively say about that. I'm not sure if a model where those assumptions do not hold is still useful, since if you can't observe probabilities and you can't condition on observations either then what can you do?

This technique you describe is completely orthogonal to whether the language of expression is probabilistic. Also, search on combinations of inputs for a function does not qualify as reversable functions, and does not generalize beyond a countable domain.

It is not an exponential search on combinations, it's a linear "search". For example say we have f(a,b,c). If the probability of f = true is greater than 0 then f is satisfiable. Now we compute the probability of f(a,b,c) && a. If this is greater than 0 then we know there is a solution with a=true. If this is equal to 0 then we know there is a solution with a=false. So suppose there is a solution with a=false. Then the next question we ask is f(a,b,c) && !a && b, and now we know whether there is a solution with b=true or b=false. So in a linear number of tries we can discover a solution (a,b,c). If each try is polynomial time then reversal is also polynomial time. So being able to ask for the probability of a result quickly is equivalent to being able to reverse arbitrary functions quickly.

Modeling with probabilities of probabilities is still well defined and explored. Indeed, it's what all statistics and probabilistic world modeling is about, since there is no known means to measure exact probabilities in open systems, and is often too expensive to compute exact probabilities even relative to a large (yet incomplete) data-set. Also, metaprogramming with probabilistic models will have similar character even in closed systems where you assume probabilities can be exactly computed.

And the inability to explicitly apply conditional expressions to probabilities within a model doesn't mean observable and internal application behaviors do not depend conditionally on probabilities.

Now we compute the probability of f(a,b,c) && a.

There was a fallacy with your initial formulation. By some misguided reasoning, I earlier chose to ignore it and assume you misworded something.

Recognize that if `a` has only a 50% probability of being true, then so does `!a`. Setting a truth value to 50% probability doesn't actually reveal any information, and indeed can effectively remove a boolean from a simple conjunctive formula in simple propositional logic (since we can expand an `a` into `50%*a=True | 50%*a!=True` so long as we distribute the decision, and a!=True implies a=False). Now, if you tweaked the arguments a little, so we're using say `80%*a=True | 20%*a=False` then we can extract some useful information using the technique you describe. But there will still be some fraction of the 'true' probability that accounts for the formula being satisfiable for cases where a is negated.

The naive approach of computing whether probability is zero would not help as much as you imagine, though you could compute variations in probability of `f(a,b,c)` as you manipulate probabilities of `a`. The only sure, simple way to know that `a` is needed, is set probability of `a` being true to 0% (and hence the probability of `!a` to 100%, in a propositional logic).

But, hey, this is a probabilistic model. We generally don't demand absolute surety; high probabilisitic confidence is the goal. If you just want confidence, then you'll have a few more techniques available... but computing inputs based on output will still involve search in the space of probabilities.

being able to ask for the probability of a result quickly is equivalent to being able to reverse arbitrary functions quickly

I was unable to extract concrete meaning from your first paragraph. Could you give an example of a (hypothetical) probabilistic programming language that allows "modeling with probabilities" yet does not suffer from exponential complexity?

And the inability to explicitly apply conditional expressions to probabilities within a model doesn't mean observable and internal application behaviors do not depend conditionally on probabilities.

ConditionING on an observation does not have anything to do with conditionALS. If you have some probability distribution P that quantifies you belief about the world, then conditioning on an observation O just filters out the points in the space that are inconsistent with O, thereby reducing your belief space to a new P'. In mathematical terms: P'(X) = P(X | O). This is a very useful thing to have in a probabilistic programming language, and as far as I know all probabilistic programming languages have this facility or an equivalent one.

There was a fallacy with your initial formulation. By some misguided reasoning, I earlier chose to ignore it and assume you misworded something.

Okay, lets see where our understandings diverge.

Recognize that if `a` has only a 50% probability of being true, then so does `!a`.

So far we agree :)

Setting a truth value to 50% probability doesn't actually reveal any information, and indeed can effectively remove a boolean from a formula in simple propositional logic (since we can expand every `a` into `50%*a=True | 50%*a!=True`, and a!=True implies a=False).

This is not the case. The variable `a` can appear multiple times in the formula. Replacing each instance with a fresh 50% probability of being true and 50% probability of being false is not the same. For example consider the formula `a xor a`. This has 100% probability of being false regardless of the probability distribution over `a`. If we replaced each a with separate 50% false 50% true then we would conclude that there is a 50% probability of `a xor a` being true. But perhaps I have misunderstood what you mean by removing a variable. If so, please correct my understanding of what you meant.

Now, if you tweaked the arguments a little, so we're using say `80%*a=True | 20%*a=False` then we can extract some useful information using the technique you describe.

Actually for that algorithm it does not matter at all which probabilities you assign to a being true and false, as long as it's not zero.

The naive approach of computing whether probability is zero would not help as much as you imagine, though you could compute variations in probability of `f(a,b,c)` as you manipulate probabilities of `a`.

That is another equivalent way to think about it. Set the probability of a=false to zero, then check if the probability of f(a,b,c)=true also becomes zero. If so, we know that there is no solution with a=true, and instead we will go for a=false. If the probability of f(a,b,c)=true is not zero, then there must be a solution with a=true, and we can set a=true. Continue this process with b and then with c. So you see that you *can* reverse arbitrary (computable) functions this way.

Or, described yet another way. Suppose we want to find a solution to f(a,b,c)=true. First check if the probability of f(true,b,c)=true is nonzero. If so, then we have a value a=true, and continue the same process for b,c. Then check if the probability of f(false,b,c) is nonzero. If so, then we have a value a=false, and continue the same process for b,c. If both probabilities are zero, then there is no solution to f(a,b,c)=true.

The only sure, simple way to know that `a` is needed, is set probability of `a` being true to 0% (and hence the probability of `!a` to 100%, in a propositional logic).

We don't need to know that `a` is needed. We just need to know whether there is a solution with `a=true` or `a=false` (or both, or neither). You can deduce this even without setting the probability of `a=true` to zero (or one). For example you could also set the probability of `a=true` to 40% and then to 60%, and see which results in a higher probability of f(a,b,c)=true. If the probability of f(a,b,c)=true increases when we increase the probability of a=true, then we know that there is a solution with a=true (in fact we'd know that there are MORE solutions with a=true than with a=false).

I was unable to extract concrete meaning from your first paragraph. Could you give an example?

Sure. Bayesian spam filters are probabilistic yet:

cannot be trained on the full set of possible messages

the training algorithms cannot even keep the full set of known messages in memory

So instead we use incremental algorithms that probabilistically converge on valid estimates of probabilities.

As another example, bloom filters enable an imprecise, yet probabilistic and formally justifiable, estimate of collision in a value space.

In many cases, probabilistic computations will leverage probabilistic observations on larger values. E.g. to determine with some probability that a large array is sorted, you might sample it and determine whether the sample is sorted. In this case, your probability could be exact if you cared to expend time and memory to make it exact. But in probabilistic systems it is often sufficient to determine with high confidence that the array is sorted.

One way to control exponential complexity is to constrain expressiveness such that we can't ask problematic questions, such as those directly observing a probability. Values themselves can then become probabilistic models that are only partially (and probabilistically) observed. If the observation of one value can be directed based on observation of secondary values (e.g. from local knowledge or environment), and assuming side-effects are directed by partial observations on probabilistic models, we can still achieve useful behaviors in real systems.

The variable `a` can appear multiple times in the formula. Replacing each instance with a fresh 50% probability of being true and 50% probability of being false is not the same.

Indeed, and that is what I meant (and since clarified in an edit). This does not change the issue with your formulation.

ConditionING on an observation does not have anything to do with conditionALS.

That's what I thought you meant, initially. Then you complained about not being able to compare to zero% as an implying there was no conditioning on probabilities, and it became clear to me that you aren't being very precise with your terminology.

In mathematical terms: P'(X) = P(X | O). This is a very useful thing to have in a probabilistic programming language, and as far as I know all probabilistic programming languages have this facility or an equivalent one.

Probabilistic programming models do need to perform conditioning (i.e. be based on it) but do not need to expose conditional probabilities directly to developers (which is an issue of expressiveness).

If we want scalable, securable, modular probabilistic systems, then we'll ultimately need to control propagation of information that traces relationships between probabilities. This is one of the challenges I earlier mentioned needs addressing.

Actually for that algorithm it does not matter at all which probabilities you assign to a being true and false, as long as it's not zero.

You're depending on cancellation of elements:

f(a=True,b,c)&&a=True || f(a=False,b,c)&&a=False

If we get some % probability of being true (and assuming we can directly examine the probability, which is not a feature I would recommend), we'll know that there are solutions for f where `a` is true. We won't know is whether there are solutions for f where a is false.

I don't see how this helps or is 'quicker'. To locate a particular solution, you'll still need to select one of up to 8 possibilities, which means 8 search paths (though you're searching them hierarchically). Further, each individual expression might be more computationally expensive than a single iteration of a proper SAT solver.

you see that you *can* reverse arbitrary (computable) functions this way

No, I don't see this. I don't consider searching a function's domain for certain values in the range to be conceptually or practically the same as reversing a function. And it isn't clear to me that arbitrary computable functions can be addressed.

One way to control exponential complexity is to constrain expressiveness such that we can't ask problematic questions, such as those directly observing a probability. Values themselves can then become probabilistic models that are only partially (and probabilistically) observed. If the observation of one value can be directed based on observation of secondary values (e.g. from local knowledge or environment), and assuming side-effects are directed by partial observations on probabilistic models, we can still achieve useful behaviors in real systems.

I don't think further speculation from me in this direction is going to be useful, but if you have a concrete system that does not have exponential complexity yet still allows sufficient generality to deserve the term "probabilistic programming" I'd be glad to hear about it.

Then you complained about not being able to compare to zero% as an implying there was no conditioning on probabilities, and it became clear to me that you aren't being very precise with your terminology.

Can you point to where I am not being precise? I don't see anywhere I made that implication. The phrase "conditioning on probabilities" does not type check (you generally don't condition on probabilities, you condition on observations -- unless of course you happen to observe a probability but that's a strange thing to directly observe), so I'm not sure what you mean here... What I did say is that being able to observe probabilities leads to exponential complexity, and being able to condition leads to exponential complexity, if you can at least model boolean functions.

If we get some % probability of being true (and assuming we can directly examine the probability, which is not a feature I would recommend), we'll know that there are solutions for f where `a` is true. We won't know is whether there are solutions for f where a is false.

If we get some % probability of what being true? If you mean the formula you wrote, then that does not give any indication that there are solutions for which `a` is true, since it could well be that all that % probability is coming from the second alternative with a=false?

I don't see how this helps or is 'quicker'. To locate a particular solution, you'll still need to try up to 8 possibilities, and each individual expression will be more computationally expensive than a single iteration of a proper SAT solver.

No, you just need to try 3 possibilities. One for each variable. With each query you obtain the value of one variable in the solution. Perhaps it helps to write down the algorithm in actual pseudocode:

let f(b1,...bN) be a function that takes N booleans and produces a boolean.
let b1,...bN be booleans with some nonzero probability of being true and some nonzero probability of being false (doesn't matter how big that probability is)
let p be the probability of f(b1,...bN)=true
if p=0 then fail with an error message saying that there is no solution
sol[i] will be the solution for variable i
for n = 1 to N:
let p be the probability of f(sol[1],...sol[n-1],true,b_n+1,...bN)=true
if p>0: sol[n+1] = true
else: sol[n+1] = false

The phrase "conditioning on probabilities" does not type check (you generally don't condition on probabilities, you condition on observations -- unless of course you happen to observe a probability but that's a strange thing to directly observe)

Indeed it is strange to directly observe probabilities. How strange it is that your argument about reversing functions depends on it. I've certainly opposed that possibility for this whole discussion.

being able to observe probabilities leads to exponential complexity, and being able to condition leads to exponential complexity

Ability to observe probabilities only certainly leads to exponential complexity if those observations must be precise. Same with conditioning. If we allow a controlled level of imprecision - i.e. a loss of information and structure - then we can design systems that control computational resources as well.

The tricky part is ensuring that our expressions don't lose the important information, i.e. information and coupling that we plan to observe downstream. Addressing this is a problem of language design, and significantly of of controlling expressiveness to support modularity and composition.

No, you just need to try 3 possibilities. One for each variable.

True. Your search is hierarchical. And since you're expressing the search outside of the probabilistic language, you can presumably observe this hierarchy as a non-probabilistic computation over time, where you observe only three possibilities. (As opposed to a single probabilistic spatial-temporal structure with eight weighted leaves.) Where I got lost is when you called it quicker.

How strange it is that your argument about reversing functions depends on it.

The original proposal was to use conditioning on the value of f to do the same, but I later proposed this version because I (erroneously) believed that you wouldn't allow conditioning.

In any case probability is about quantifying one's belief about the world (in the case of bayesians). Observing your own internal probability distribution directly by making an external observation is certainly something strange. But what we're doing here in this algorithm is *calculating* what your own belief/probability distribution should be. That's a very different and reasonable and broadly accepted thing (no type errors here!), and doing this is also at the basis of the majority of probability and statistics. That's not to say that you would want to always compute exact probabilities in probabilistic programming for practical reasons, but in principle there is nothing wrong with computing probabilities.

Ability to observe probabilities only certainly leads to exponential complexity if those observations must be precise. Same with conditioning. If we allow a controlled level of imprecision - i.e. a loss of information and structure - then we can design systems that control computational resources as well.

I'm glad we agree :)

The tricky part is ensuring that our expressions don't lose the important information, i.e. information and coupling that we plan to observe downstream. Addressing this is a problem of language design, and significantly of of controlling expressiveness to support modularity and composition.

Yes. I think this can be done very well in an existing probabilistic programming setting. We already very often make simplifying assumptions for efficiency. The way to do this is to write a simplified model in a probabilistic programming language, e.g. by modeling two variables as independent when in reality they might not be.

Instead, I pursue approaches that allow rich models at 'nodes' but provide some means to control information propagated between nodes - while still enabling enough probabilistic structure between nodes that the system as a whole can achieve many valuable features of probabilistic models (adaptability, flexibility, robustness, a semblance of cognition, etc.). Nodes allow modeling of heterogeneous access to information, and modeling constraints on information flow (latency, detail). The individual nodes might be 'exponential' in the size of their observations, but would generally be limited to processing observations of a finite size (at any given time) such that the overall computation effort is bounded.

Probabilistic programming hinders modularity (and scalability, and security) precisely because of entanglement between dataflows. But I think this can be addressed if we're careful about it. One thought is to ensure that entanglement on independent dataflows is of a static nature (i.e. observable through the type system), which severely constrains entanglement but might be sufficient to address many use cases.

I think I can clarify what David's been trying to talk about here. It's about making a language where probabilistic algorithms are the norm. This is different from making a language that passes around precise (and thus fully queryable) representations of probability distributions, but it's not completely unrelated. Instead, these algorithms will tend to pass around imprecise approximations to the actual probability distributions, much like numeric algorithms deal with imprecise numbers, so that the system's space requirements don't explode.

As an example, suppose I want to do a mashup of all Buy More locations, Buy More inventories, and weather forecast data, to figure out how much inventory of delicate LED sweaters will spontaneously combust due to static discharge. However, the weather forecasts aren't guaranteed to be 100% accurate, and the Buy More employees aren't universally great at keeping inventory up-to-date. Even the Buy More location data has an annoying tendency to be under constant revision and obscurity, almost as though there were some secret organization plotting to keep it unreliable on purpose.

Fortunately, thanks to a popular programming language, all these data sources are available in a format that makes it clear when and where and how much the data might be incorrect. When I write my code, I can (mostly) pretend the data is precise...

...and I get the imprecise result I'm looking for. I can go on to set up an automatic alert to trigger precisely when this loss of inventory has probably gotten too large, but at that point I will need to be more explicit about managing the probability distributions. At the least, I might call a function isProbablyTrue( x ).

There's a certain point that's come up a few times in this discussion, and it's one of the challenges for designing this language:

I'm interested in ideas that might make probabilistic programming more scalable and modular. I'd especially like to see probabilistic models that can bridge services in open systems. But I'm a bit short of ideas on exactly how to achieve this.

The variable `a` can appear multiple times in the formula. Replacing each instance with a fresh 50% probability of being true and 50% probability of being false is not the same.

In an open system where `a` is a service used by several client services, and we write another service that depends on those, are there good after-the-fact ways to detect this hidden interaction (e.g. hidden Markov models)? How about good before-the-fact ways to publish it in the service contracts (e.g. provenance)?

If there are existing programming language efforts along these lines, it would be interesting to know how they deal with this diamond dependency problem. Even if they don't deal with it, it would be nice to have prior art to go off of.

Personally, I'm not even familiar with probabilistic algorithms themselves. Even if there isn't a programming language specialized for it, does any probabilistic algorithm research already use an approximate-distribution-passing technique like this?

I don't completely understand the first half of your post. The formula you wrote does not compute any probability, since for example it could easily be greater than 1. But maybe that's the point that it's not a precise probability but a heuristic guess at a probability? In that case I don't see how you can make a principled "probabilistic programming" paradigm out of that?

In an open system where `a` is a service used by several client services, and we write another service that depends on those, are there good after-the-fact ways to detect this hidden interaction (e.g. hidden Markov models)? How about good before-the-fact ways to publish it in the service contracts (e.g. provenance)?

The only correct way to deal with it is the same way you deal with it within a single program: keep these "entangled" variables in your probability distribution representation. So the client services would return a probability distribution over both their result as well as over `a`. For example if service X computes `if a=true then "foo" else "bar"` then it would return a structure like [0.2*{a=true,result="foo"},0.8*{a=flase,result="bar"}] if the probability distribution returned by the `a` service is [0.2*{a=true},0.8*{a=false}]. Then later services that combine multiple results can match up the correct parts. For example if you have another service Y that does `if a=true then "baz" else "quux"` then another service that computes the string concatenation X+Y, you'd get [0.2*{a=true,"foobaz"}, 0.8*{a=false,"barquux"}]. Note that by retaining the relationship with `a` we make sure that the combination "fooquux" does not happen.

Personally, I'm not even familiar with probabilistic algorithms themselves. Even if there isn't a programming language specialized for it, does any probabilistic algorithm research already use an approximate-distribution-passing technique like this?

There are several probabilistic programming languages, e.g. BUGS, STAN, HANSEI, Church. Probability itself is about *precisely* quantifying *uncertainty*, not about approximation. Some of these implementations do not allow you to directly read off probabilities, they only allow you to draw samples from a probabilistic program. For example if you sampled X+Y you could get "barquux", "barquux", "foobaz", "barquux". From large number of samples you can approximate the probability distribution, by counting how many times each option appears and dividing that by the total number of samples. Other implementations, like HANSEI, can compute the exact probability distribution.

Probabilistic programming isn't about computing probabilities any more than imperative programming is about computing imperatives or logic programming is about computing logics. Probabilistic programming is about using probability, deeply and pervasively, e.g. in modeling sensory observations and controlling effects, and for the decisions in between.

Indeed, we can have probabilistic programming where we never once have a probability-variable. And these models can still be principled, i.e. based deeply and formally on probabilistic principles.

The only correct way to deal with it: keep these "entangled" variables in your probability distribution representation

That's only necessary if you conflate 'correct' with 'precise'.

Given that our sensory data is imprecise at the very start, and our control of actuators is imprecise at the very end, I think that precision is not a most significant aspect of correctness. It is valuable, of course.

AFAICT, correctness is really a ensuring a system's behavior, which means protecting our abstractions. And that can be achieved by controlling our abstractions, so that we don't accidentally express something we can't protect.

Probability itself is about *precisely* quantifying *uncertainty*, not about approximation.

We don't need to observe probabilities, and approximations can be acceptable so long as we don't observe any inconsistency (or even if we do observe inconsistency, so long as we can robustly control it).

I certainly agree with this, I even gave an example, namely approaches based on sampling.

That's only necessary if you conflate 'correct' with 'precise'.

Correct in the sense that it satisfies the axioms of probability. Sure, a whole system's behavior can still be correct for some measure of correct even if a component doesn't precisely follow those axioms, for example some systems do fine if you ignore correlations between variables and therefore ignore these entanglement issues.

Depends on what you mean by probabilistic systems. I said probability. Probability is most certainly about quantifying uncertainty. Yes, some *systems* do fine with approximations (obviously). What I explained is that approximation (1) is orthogonal to probability (2). You can have approximate (2) algorithms in the sense that they don't (always) compute the exact or correct answer, like heuristic methods for approximating the traveling salesman problem, probabilistic algorithms like the one for primality testing based on Fermat's little theorem, etc. Probability (2) is a different thing that is modeling uncertainty, and you can do that exactly or approximately. Unless I'm severely mistaken, what we were talking about in the thread above is (2). I think Ross tried to explain that you might be talking about (1) while I was talking about (2). Now that I read the thread again I'm not sure anymore. It certainly did start about probabilistic programming in the sense of Church, STAN, HANSEI, etc. But then there is this paragraph:

As another example, bloom filters enable an imprecise, yet probabilistic and formally justifiable, estimate of collision in a value space.
In many cases, probabilistic computations will leverage probabilistic observations on larger values. E.g. to determine with some probability that a large array is sorted, you might sample it and determine whether the sample is sorted. In this case, your probability could be exact if you cared to expend time and memory to make it exact. But in probabilistic systems it is often sufficient to determine with high confidence that the array is sorted.

I believe we can satisfy the axioms of probability without "keeping `entangled` variables in your probability distribution representation". If you can't directly observe probabilities, then a fair number of axioms can't be observed to be broken, and are therefore satisfied.

Precision and quantification is one valid approach to achieve correct models. Another valid approach is to control observation and expression. You seem to consider the former possibility to be somehow 'more' correct than the latter. I consider them equally correct, and instead differentiate them on other properties (expressiveness, scalability, etc.).

Probability is most certainly about quantifying uncertainty.

Hmm. I think that's true, and untrue, in about the same sense that geometry is about quantifying shapes.

Bloom filters are a probabilistic model for approximation. Relevantly, they are an example of probabilistic models and probabilistic observations. (In real systems, initial input will always be probabilistic AND approximate. So the distinction you're attempting to make isn't necessarily a useful one outside of fictional context.)

If you can't directly observe probabilities, then a fair number of axioms can't be observed to be broken, and are therefore satisfied.

The question then comes down to what you can observe? At the bare minimum you probably want to support sampling from a distribution, and that way you can also indirectly observe probabilities given enough samples. But again if you have some way to make some things that are difficult to compute or otherwise problematic unobservable while still allowing to do something useful, I'd love to hear about it.

So the distinction you're attempting to make isn't necessarily a useful one outside of fictional context.

You're conflating goal with method here. Classifying things like statistical inference differently than bloom filters, randomized algorithms, etc. certainly is a useful distinction. In one case the goal is reasoning about uncertainty, and in the other you are exploiting (pseudo) randomness for a particular purpose (e.g. memory efficiency). Of course these are not two distinct classes, but rather two orthogonal axes. You can exploit randomness when doing approximate statistical inference, e.g. in MCMC algorithms. But you can also do statistical inference without randomness (e.g. when your model is simple enough to compute it exactly (aka conjugate priors), with variational bayes, or MAP estimation). Similarly you can implement a set data structure without randomness, but you can do an approximate version with randomness (bloom filters).

Another important question is: in which context can you make those observations? Particularly relevant are how these contexts might introduce noise, latency, or external influence. If we only have open loops, then many of the potential errors that might be revealed with closed-loop feedback can be avoided (or paved over).

And yet another important question is: how many observations are you allowed to make? Unless the approximations are awful, it can take a lot of observations to demonstrate an error outside the expected error boundaries. What if you're only limited to, say, 10 samples?

In practice, we often won't need many samples. Whether we speak of display elements, human observers, actuators, etc. - there is a practical limit in how much we can effectively render, effect, see and hear. And compute, of course. And those practical limits are, in many cases, relatively small and static (depending on the domain) - perhaps suitable for control via a type system.

In the design I'm contemplating, it is not possible to sample a distribution from within the model, and resources (which necessarily exist outside the model) would be limited to relatively few samples. And feedback from resources back into the model allows potential noise due to concurrent influence by other agents and the world itself.

I agree. We can develop an ontology of design patterns and abstractions used in probabilistic systems, and we can differentiate probabilistic languages based on which they support effectively.

Nonetheless, in practice most uncertainty is due to approximations upstream - ultimately going all the way to the eye or camera (or other sensors), which capture a rather incomplete picture of the world. Downstream nodes in a probabilistic system are generally not able to distinguish uncertainty inherent to data from uncertainty that results from to shortcuts and approximations in computation. Computation is ultimately a physical process, and the notions of uncertainty and approximation are deeply entangled in practice.

Concepts, abstractions, those sorts of things are not right or wrong, just varying degrees of useful. I can't say you're wrong to separate approximation from probability, to place them on separate axes. I can only question the utility of doing so, and of insisting others do so. When we invent 'toy problems', it may often seem that two concepts are orthogonal in some useful sense. I.e. when we draw a picture, it looks like:

But when we scale up, we might find that the axes aren't as orthogonal or useful as we initially thought, that the two notions are entangled in practice and at scale, perhaps due to various external causes.

I feel the separation you argue between probability and approximation is not very orthogonal, though it may appear to be for a set of small examples. I think the distinction is suspect in practice. But I also think there's something worth distinguishing on a slightly different axis, e.g. involving lossy vs. lossless subprograms.

I do feel randomness is easily and usefully distinguished from probabilistic systems. Indeed, I'd tend to go the opposite direction from random sampling and leverage probability as an opportunity for stable models (i.e. via stabilizing samples selected over time), which can result in robust and resilient state-like systems. (And it simultaneously reduces opportunity to observe errors in the probabilistic model.) Randomness also has its own entanglements - e.g. with state, space, and time.

And yet another important question is: how many observations are you allowed to make? Unless the approximations are awful, it can take a lot of observations to demonstrate an error outside the expected error boundaries. What if you're only limited to, say, 10 samples?

Doesn't matter, if you only had 1 sample then you could still observe anomalies, since any user of your system can simply replicate the whole system and take 1 sample from each. You could imagine a system that you can't replicate even in principle (like quantum money), but then we're well outside the realm of practical reality. In practice, if you don't track correlations between variables you have two options (1) admit that it doesn't follow the axioms of probability (2) somehow disallow any models that contain correlated variables in some observable way.

in practice most uncertainty is due to approximations upstream

For very general values of "approximation", yes, but there's still a distinction between *making* an approximation and *reasoning* about degrees of uncertainty. I agree that your pictures could apply to some other case, but not to this one. I gave plenty of real world examples in each quadrant. Here are some more:

Miller-Rabin primality testing, monte carlo integration, testing if an array is sorted by comparing a few elements, etc.

Note that the things in the left column are not single algorithms, but rather whole classes of methods. Depending on whether you are a true bayesian or not, you'd move some methods from the middle left to the top left. Also, arguably, if you do conjugate priors, belief propagation or variable elimination with floating point arithmetic, it should move to the middle left, or we can further split up the approximate row to specify what kind of approximation is made. "probabilistic programming" is completely about the left column (actually pretty much all statistical inference is some special case of probabilistic programming, so we could well give the left column the title "probabilistic programming"). My point is, the left column and the last two rows are on completely different axes.

That won't help if the system is deterministic, nor if the system is effectful. I'm surprised you thought it might. Are you confusing probabilistic samples with non-deterministic ones?

there's still a distinction between *making* an approximation and *reasoning* about degrees of uncertainty

Locally, it may seem that way. But upstream, your reasoning depends on approximate data. And downstream, you can only communicate or actuate approximations. And when forming and writing down your probabilistic model: that's approximate, too.

I gave plenty of real world examples in each quadrant. Here are some more

I see you provide techniques, a few of which involve closed-world calculations. I don't see examples of what I consider specific, real-world problem.

Problems are things like: face recognition, natural language processing, cooperative robotic behavior for mapping an enemy bunker, managing traffic lights for optimal efficiency, detecting suspicious behavior near a building, locating potholes in a video stream while driving. Placed in a specific context, those may even become real-world problems (instead of hypothetical problems).

It may seem that approximation vs. probability makes sense for specific techniques. That is because specific techniques apply to relatively small facets - incomplete sub-problems. You seem to use probabilistic programming to address specific sub-problems, not as a general purpose language. (Your use of an imperative language to search the `f(a,b,c)&&a` examples was quite telling.)

In the larger context, approximation and probability are closely aligned at every step in every process. I don't know of any way to separate them. I am not convinced that I should: such coupling is an opportunity.

That won't help if the system is deterministic, nor if the system is effectful. I'm surprised you thought it might. Are you confusing probabilistic samples with non-deterministic ones?

No I'm not. If the sampling is deterministic then it doesn't follow the axioms of probability in the first place. Again, if you believe that you can make a system where you can make useful observations that satisfy the laws of probability, yet you can't observe correlations (or some other thing that's problematic in a distributed context), show it.

Locally, it may seem that way. But upstream, your reasoning depends on approximate data. And downstream, you can only communicate or actuate approximations. And when forming and writing down your probabilistic model: that's approximate, too.

You are confusing different kinds of approximations. The fact that a probabilistic model is not fully accurate doesn't mean that it can't make exact predictions given the model. By analogy: I'm saying "you can simulate Newtonian mechanics exactly in some special cases, or approximately with algorithms like Verlet integration" ergo, whether some algorithm is simulating Newtonian mechanics is on a different axis than whether the algorithm is using an approximation. You say: "but that is not an exact simulation! newtonian mechanics is not exactly accurate since you need general relativity. so it is not useful to disinguish the concept `newtonian mechanics` from the concept `approximation`". It's the last part of this reasoning that I disagree with: approximation is a concept and newtonian mechanics is a concept, but they are not the same concept. Just like probability and approximation are not the same concept.

I see you provide techniques, a few of which involve closed-world calculations. I don't see examples of what I consider specific, real-world problem.

You don't need me to use google to find examples, but suffice it to say that the techniques I mentioned cover 99% of statistical machine learning, including face recognition, natural language processing, etcetera. As an example, take latent dirichlet allocation, which is often used for topic modeling (which is itself part of natural language processing). There are several algorithms for doing inference in that model, some doing variational bayes, some using Gibbs sampling (which is a subclass of the class of MCMC algorithms), and there are algorithms based on a modified form of belief propagation. These algorithms are all approximate/monte carlo because exact inference is intractable for this model. But when we take another example, e.g. naive bayes, bayesian linear regression, etc., we can do exact inference using conjugate priors.

I don't know of any way to separate them. [probability and approximation]

I do. See the table. It gives examples of both things that are approximations but are not about probability and things that are not approximations but are about probability.

I'm sure these answers will be skillfully taken out of context and misinterpreted, and used to run off on completely irrelevant tangents once more :( Perhaps it's my own fault for not being clear enough, but in this case it does seem futile.

If the sampling is deterministic then it doesn't follow the axioms of probability in the first place.

Which axiom of probability is violated? I believe this is a misconception on your part. A few counterpoints:

The axioms of probability apply to a probabilistic model, which can be considered apart from our sampling of it.

We can deterministically sample or query a probabilistic model with a PRNG or similar. There is no essential reason that samples on a probabilistic model should not be deterministic.

We can constrain clients (observers of the model) to a fixed number of samples or queries (or rate, for a temporal model).

We can constrain the expressiveness of the queries themselves, i.e. so we only return small, modular aspects of a model (rather than a distributed value).

Sampling is not essential to probabilistic modeling; rather, it's a useful technique to bind a probabilistic model to a non-probabilistic system (such as imperative code, or a discrete event system)

Samples don't actually need to be in the model, since the sampling system may itself be a probabilistic model (i.e. "there is a ~98% probability that this sample is in the model")

I believe it is possible to weaken probabilistic systems quite a bit from traditional probabilistic languages, and yet still remain probabilistic and useful.

if you believe that you can make a system where you can make useful observations that satisfy the laws of probability, yet you can't observe correlations, show it

While "I'll believe it when I see it" is a fine jaded and skeptical attitude, it is also a consumerist attitude. Very well. If I get around to pursuing this in a few years (as I wish to for HCI reasons), I'll show something then.

Also, I never said the technology won't allow you to observe correlations. I'm saying the technology won't allow you to prove your observed correlations to be incorrect with respect to a probabilistic model due to the limitations on sample size and the probabilistic/approximate nature of the sampling itself. There is a difference.

I'm saying "you can simulate Newtonian mechanics exactly in some special cases, or approximately with algorithms like Verlet integration" ergo, whether some algorithm is simulating Newtonian mechanics is on a different axis than whether the algorithm is using an approximation.

I'd note: The measurements used in your simulation are approximate (or fictional), the world model used in your simulation is approximate (and fictional), and your Newtonian model is approximate, and ultimately the outputs of your simulation are only an approximate fragment of the model itself. The process as a whole is a long exercise in approximation.

Approximations are useful. They are even more useful if they allow us to understand how close to the truth we are likely to get, i.e. tracking accumulation of error and stability of potential slippery slopes.

Precision is useful in its own contexts, most commonly when modeling or reflecting upon the computation itself and interaction with services. This isn't an essential characteristic of computers and services, but is common to their popular languages and design.

You say that we can 'exactly' model Newtonian mechanics in some special cases. What you fail to say is that those special cases are entirely fictional, because otherwise you will not have 'exact' inputs. In practice, there's no value gained in precise modeling of Newtonian mechanics because we don't have precise measurements or world models in the first place. Might as well use inexact numbers and benefit from performance boosts - it's okay, so long as it doesn't significantly impact the computation relative to the measurement error.

the techniques I mentioned cover 99% of statistical machine learning

Not nearly. They don't cover data set acquisition and input. They don't cover data representation (a 'black art' of machine learning, according to one paper I read). They don't cover configuration and tuning of the learning model (state space, etc.). They don't cover integration of the model into a useful, real-world system.

I'm sure your techniques cover at least 99% of the statistical machine learning that they cover.

See the table.

One distinction I do find useful is problems (requirements, desiderata) vs. solutions (designs, strategies, techniques). Your table is nice, but it is about solutions. And it isn't even about full solutions, just about tiny facets of solutions. It provides no advice on separating the uncertainty and approximation naturally entangled in every real problem.

I wanted my "How many sweaters?" example to show that the prospective language design would support general-purpose use. For questions of all kinds, we often ask "Are you sure?" after we see the answer. If that follow-up answer is implicitly tracked as metadata, we get it mostly for free.

To tie some more things together, when I said "approximate-distribution-passing," when you said "From [...] samples you can approximate the probability distribution," and when David was talking about passing around collections of 100 samples, we were talking about the same things. I think the sample-generating systems you cite could be extremely helpful to look into, especially if they maintain intermediate sample collections or other approximate distributions as they go.

Is there a point to designing a programming language specifically to support machine learning? That means, rather than just use probability in programming, the language would help you suck in data to generate models that yield those probabilities?

A language for ML could be useful. But I suspect it would be more... profitable to hammer on Julia or Octave, improving the APIs and libraries for useful ML algorithms. Thing is, other than great utility of matrices and probability, I'm not aware of any common principles for ML. It's still a field in early discovery, and has been accelerating over the last five years.

I'm interested in applying exponential decay models to ML, and I'm doing so in my day job. I think that has potential to become another cornerstone. But I'm not there yet.