Sunday, July 30, 2017

In the last post we talked about early abstractions that allow you to design generic interfaces which can be polymorphic in the type parameter. Unless you abuse the type system of a permissive language like Scala, if you adhere to the principles of parametricity, this approach helps you implement abstractions that are reusable under various contexts. We saw this when we implemented the generic contract of the mapReduce function and its various specializations by supplying different concrete instances of the Monoid algebra.

In this post we will take a look at the other end of the spectrum in designing functional domain models. We will discuss evaluation semantics of model behaviors - the precise problem of when to commit to specific concrete evaluation semantics. Consider the following definition of a domain service module ..

Such definitions are quite common these days. We have a nice monadic definition going on which can be composed as well to implement larger behaviors out of smaller ones ..

def processPayments() = for {
p
Can we improve upon this design ?

Committing to the concrete early - the pitfalls ..

One of the defining aspects of reusable abstractions is the ability to run it under different context. This is one lesson that we learnt in the last post as well. Make the abstractions depend on the least powerful algebra. In this example our service functions return Either, which is a monad. But it's not necessarily the least powerful algebra in the context. Users may choose to use some other monad or may be even applicative to thread through the context of building larger behaviors. Why not keep the algebra unspecified at the service definition level and hope to have specializations in implementations or even in usages at the end of the world ? Here's what we can do ..

A top level service definition that keeps the algebra unspecified. Now if we want to implement a larger behavior with monadic composition, we can do this ..

// weaving the monad
def processPayments()(implicit me: Monad[M]) = for {
p
Note that we are using only the monadic bind in composing the larger behavior - hence the least powerful algebra that we can use is that of a Monad. And we express this exact constraint by publishing the requirements of the existence of an instance of a Monad for the type constructor M.

What about Implementation ?

Well, we could avoid the committment to a concrete algebra in the definition of the service. What about the implementation ? One of the core issues with the implementation is how you need to handle errors. This is an issue which often makes you commit to an implementation when you write the interpreter / implementation of a service contract. You may use Failure for a Try based implementation, or Left for an Either based implementation etc. Can we abstract over this behavior through a generic error handling strategy ? Some libraries like cats offers you abstractions like MonadError that helps you implement error reporting functionality using generic monadic APIs. Here's how we can do this ..

Note we needed a monad with error handling capabilities and we used MonadError for that. Note that we have kept the error type in MonadError as Throwable, which may seem a bit unethical in the context of pure functional programming. But it's also true that many libraries (especially Java ones) or underlying abstractions like Future or Try play well with exceptions. Anyway this is just a red herring though it has nothing to do with the current topic of discussion. The moot point is that you need to supply a MonadError that you have instances of.

.. and that's exactly what we will commit to. We are still dealing with a generic Monad even in the implementation without committing to any concreate instance.

End of the World!

The basic purpose why we wanted to delay committing to the concrete instance was to allow the users the flexibility to choose their own implementations. This is what we call the principle of delayed evaluation. Abstract early, evaluate late and decouple the concerns of building and the evaluation of the abstractions. We have already seen the 2 of these principles - we will see that our design so far will accommodate the third one as well, at least for some instances of M.

The user of our API has the flexibility to choose the monad as long as she supplies the MonadError[M, Throwable] instance. And we have many to choose from. Here's an example of the above service implementation in use that composes with another service in a monadic way and choosing the exact concrete instance of the Monad at the end of the world ..

import cats._
import cats.data._
import cats.implicits._
// monix task based computation
object MonixTaskModule {
import monix.eval.Task
import monix.cats._
val paymentInterpreter = new PaymentServiceInterpreter[Task]
val emailInterpreter = new EmailServiceInterpreter[Task]
for {
p
MonixTask is an abstraction that decouples the building of the abstraction from execution. So the Task that we get from building the composed behavior as in the above example can be executed in a deferred way depending of the requirements of the application. It can also be composed with other Tasks to build larger ones as well.

Vertical Composition - stacking abstractions

When you have not committed to an implementation early enough, all you have is an unspecified algebra. You can do fun stuff like stacking abstractions vertically. Suppose we want to implement auditability in some of our service methods. Here we consider a simple strategy of logging as a means to audit the behaviors. How can we take an existing implementation and plugin the audit function selectively ? The answer is we compose algebras .. here's an example that stacks the Writer monad with an already existing algebra to make the payments function auditable ..

We took the decision to abstract the return type in the form of a type constructor early on. But committed to the specific type only during the actual usage of the service implementation. Early abstraction and late committment to implementation make great composition possible and often in ways that may pleasantly surprise you later ..

Sunday, June 25, 2017

Let's talk genericity or generic abstractions. In the last post we talked about an abstraction Money, which, BTW was not generic. But we expressed some of the operations on Money in terms of a Money[Monoid], where Monoid is a generic algebraic structure. By algebraic we mean that a Monoid

is generic in types

offers operations that are completely generic on the types

all operations honor the algebraic laws of left and right identities and associativity

But when we design a domain model, what does this really buy us ? We already saw in the earlier post how law abiding abstractions save you from writing some unit tests just through generic verification of the laws using property based testing. That's just a couple of lines in any of the available libraries out there.

Besides reducing the burden of your unit tests, what does Money[Monoid] buy us in the bigger context of things ? Let's look at a simple operation that we defined in Money ..

add is a naive implementation though it's possibly the most frequent one that you will ever encounter in domain models around you. It picks up the Map elements and then adds the ones with the same key to come up with the new Money.

Why is this a naive implementation ?

First of all it deconstructs the implementation of Money, instead of using the algebraic properties that the implementation may have. Here we implement Money in terms of a Map, which itself forms a Monoid under the operations defined by Monoid[Map[K, V]]. Hence why don't we use the monoidal algebra of a Map to implement the operations of Money ?

|+| is a helper function that combines the 2 Maps in a monoidal manner. The concrete piece of code that you wrote in the naive implementation is now delegated to the implementation of the algebra of monoids for a Map in a completely generic way. The advantage is that you need (or possibly someone else has already done that for you) to write this implementation only once and use it in every place you use a Map. Reusability of polymorphic code is not via documentation but by actual code reuse.

valuation gives a standard implementation folding over the List that it gets. Now let's try to critique the implementation ..

1. The function does a foldLeft on the passed in collection payments. The collection only needs to have the ability to be folded over and List can do much more than that. We violate the principle of using the least powerful abstraction as part of the implementation. The function that implements the fold over the collection only needs to take a Foldable - that prevents misuse on part of a user feeling like a child in a toy store with something more grandiose than what she needs.

2. The implementation uses the add function of Money, which is nothing but a concrete wrapper over a monoidal operation. If we can replace this with something more generic then it will be a step forward towards a generic implementation of the whole function.

3. If we squint a bit, we can get some more light into the generic nature of all the components of this 2 line small implementation. zeroMoney is a zero of a Monoid, fold is a generic operation of a Foldable, add is a wrapper over a monoidal operation and creditsOnly is a mapping operation over every payment that the collection hands you over. In summary the implementation folds over a Foldable mapping each element using a function and uses the monoidal operation to collapse the fold.

The implementation is generic and the typesystem will ensure that the Money that we produce can only come from the list of payments that we pass. In the naive implementation there's always a chance that the user subverts the typesystem and can play malice by plugging in some additional Money as the output. If you look at the type signature of mapReduce, you will see that the only way we can get a B is by invoking the function f on an element of F[A]. Since the function is generic on types we cannot ever produce a B otherwise. Parametricity FTW.

mapReduce is completely generic on types - there's no specific implementation that asks it to add the payments passed to it. This abstraction over operations is provided by the Monoid[B]. And the abstraction over the form of collection is provided by Foldable[F]. It's now no surprise that we can pass in any concrete operation or structure that honors the contracts of mapReduce. Here's another example from the same model ..

We want to compute the maximum credit payment amount from a collection of payments. A different domain behavior needs to be modeled but we can think of it as belonging to the same form as valuation and implemented using the same structure as mapReduce, only passing a different instance of Monoid[Money]. No additional client code, no fiddling around with concrete data types, just matching the type contracts of a polymorphic function.

Looks like our investment on an early abstraction of mapReduce has started to pay off. The domain model remains clean with much of the domain logic being implemented in terms of the algebra that the likes of Foldables and Monoids offer. I discussed some of these topics at length in my book Functional and Reactive Domain Modeling. In the next instalment we will explore some more complex algebra as part of domain modeling ..

Sunday, June 18, 2017

In a domain model, when you have a domain element that forms an algebraic abstraction honoring certain laws, you can get rid of many of your explicitly written unit tests just by checking the laws. Of course you have to squint hard and discover the lawful abstraction that hides behind your concrete domain element.

Consider this simple abstraction for Money that keeps track of amounts in various currencies.

Needless to say we will have quite a number of unit tests that check for addition of Money, including the boundary cases of adding to zeroMoney.

It's not very hard to see that the type Money forms a Monoid under the add operation. Or to speak a bit loosely we can say that Money is a Monoid under the add operation.

A Monoid has laws that every instance needs to honor - associativity, left identity and right identity. And when your model element needs to honor the laws of algebra, it's always recommended to include the verification of the laws as part of your test suite. Besides validating the sanity of your abstractions, one side-effect of verifying laws is that you can get rid of many of your explicitly written unit tests for the operation that forms the Monoid. They will be automatically verified when verifying the laws of Monoid[Money].

strive to find abstractions in your domain model that are constrained by algebraic laws

check all laws as part of your test suite

you will find that you can get rid of quite a few explicitly written unit tests just by checking the laws of your abstraction

and of course use property based testing for unit tests

In case you want to take a look at the full code base, it's there on my Github repo. In the next post we will take the next step towards modeling with generic algebraic code using the Monoid pattern from this example. Code written in parametric form without depending on specialized concrete types can be more robust, easier to test and easier to reason about. I have also discussed this at length in my book Functional and Reactive Domain Modeling. I plan to supplement the materials covered there with more examples and code patterns ..

Saturday, June 13, 2015

Even though I got my copy of Prof. Eugenia Cheng's awesome How to Bake π a couple of weeks back, I started reading it only over this weekend. I am only on page 19 enjoying all the stuff regarding cookies that Prof. Cheng is using to explain abstraction. This is a beautiful piece of explanation and if you are a programmer you may get an extra mile out of the concepts that she explains here. Let's see if we can unravel a few of them ..

She starts with a real life situation such as:

If Grandma gives you five cookies and Grandpa gives you five cookies, how many cookies will you have ?

Let's model this as box of cookies that you get from your Grandma and Grandpa and you need to count them and find the total. Let's model this in Scala and we may have something like the following ..

case class CookieBox(count: Int)

and we can define a function that gives you a CookieBox containing the total number of cookies from the 2 boxes that we pass to the function ..

The trouble here is that cookies do not obey the rules of logic, so using math to study them doesn't quite work. .. We could impose an extra rule on the situation by adding "... and you're not allowed to eat the cookies". If you're not allowed to eat them, what's the point of them being cookies ?

This is profound indeed. When we are asked to count some stuff, it really doesn't matter if they are cookies or stones or pastries. The only property we need here is to be able to add together the 2 stuff that we are handed over. The fact that we have implemented howManyCookies in terms of CookieBox gives the little child the opportunity to cheat by using the eat function. More information is actually hurting us here, being concrete with data types is actually creating more avenues for incorrect implementation.

Prof. Cheng is succinct here when she explains ..

We could treat the cookies as just things rather than cookies. We lose some resemblance to reality, but we gain scope and with it efficiency. The point of numbers is that we can reason about "things" without having to change the reasoning depending on what "thing" we are thinking about.

Yes, she is talking about generalization, being polymorphic over what we count. We just need the ability to add 2 "things", be it cookies, monkeys or anchovies. In programming we model this with parametric polymorphism, and use a universal quantification over the set of types for which we implement the behavior.

def howMany[A](gm: A, gp: A) = //..

We have made the implementation parametric and got rid of the concrete data type CookieBox. But how do we add the capability to sum the 2 objects and get the result ? You got it right - we already have an abstraction that makes this algebra available to a generic data type. Monoids FTW .. and it doesn't get simpler than this ..

trait Monoid[T] {
def zero: T
def append(t1: T, t2: T): T
}

zero is the identity function and append is a binary associative function over 2 objects of the type. So given a monoid instance for our data type, we can model howMany in a completely generic way irrespective of whether A is a CookieBox or Monkey.

With the above implementation of howMany, the little child will not be able to cheat. By providing a simpler data type we have made the implementation more robust and reusable across multiple data types.

Next time someone wants me to explain parametricity, I will point them to Page 19 of How to Bake π.

Thursday, March 26, 2015

Some time back I blogged about the possibilities that probabilistic techniques and randomization bring on to the paradigm of stream computing. Architectures based on big data not only relate to high volume storage, but also on low latency velocities, and this is exactly where stream computing has a role to play. I discussed a few data structures like bloom filters, count min sketch and hyperloglog and algorithms like Locality Sensitive Hashing that use probabilistic techniques to reduce the search and storage space while processing huge volumes of data.

Of late, I have been studying some of the theories behind machine learning algorithms and how they can be used in conjunction with the petabytes of data that we generate everyday. And the same thing strikes here - there are algorithms that can model the most perfect classifier. But you need randomization and probabilistic techniques to make them scale, even at the expense of a small amount of inaccuracy creeping within your model. In most cases we will see that the small inaccuracy that comes within your algorithm because of probabilistic bounds can be compensated by the ability to process more data within the specified computational timeline. This is true even for some of the basic algorithms like matrix multiplication that form the core of machine learning models.

The contents of this post is nothing original or new. It's just to share some of my thoughts in learning the usage of approximation techniques in building machine learning classifiers.

Matrix Multiplication

Not only these specialized data structures or algorithms, randomization has been found to be quite effective for processing large data sets even for standard algorithms like matrix multiplication, polynomial identity verification or min cut identification from large graphs. In all such cases the best available algorithms have computational complexity which works well for a small data set but doesn't scale well enough with the volumes of data.

Consider a case where we are given 3 matrices, $A$, $B$ and $C$ and we need to verify if $AB = C$. The standard algorithm for matrix multiplication takes $\Theta(n^3)$ operations and there's also a sophisticated algorithm that works in $\Theta(n^{2.37})$ operations. Instead let's consider some randomization and choose a random vector $\bar{r} = (r_1, r_2, .. r_n) \in \{0, 1\}^n$. Now we can compute $AB\bar{r}$ by first computing $B\bar{r}$ and then $A(B\bar{r})$. And then we compute $C\bar{r}$. If we find $A(B\bar{r}) \neq C\bar{r}$, then $AB \neq C$. Otherwise we return $AB = C$. Instead of matrix-matrix multiplication our randomized algorithm uses matrix-vector multiplication, which can be done in $\Theta(n^2)$ operations the standard way.

Obviously a $\Theta(n^2)$ algorithm has a lower computational complexity than $\Theta(n^3)$ and scales better with larger data sets. Now the question is how accurate is this algorithm ? Is it guaranteed to give the correct answer every time we run it ? As with other probabilistic algorithms, there's a chance that our algorithm will return a wrong result. But as long as we can show that the chance is minimal and can be reduced by tuning some parameters, we should be fine.

It can be shown that if $AB \neq C$ and if $\bar{r}$ is chosen uniformly at random from $\{0, 1\}^n$ then $Pr(AB\bar{r} = C\bar{r}) <= 1/2$. But the trick is that we can run our randomized algorithm many times choosing $\bar{r}$ with replacement from $\{0, 1\}^n$. If for any of these trials we get $AB\bar{r} \neq C\bar{r}$, then we can conclude $AB \neq C$. And the probability that we get $AB\bar{r} = C\bar{r}$ for all $k$ trials despite $AB \neq C$ is $2^{-k}$. So for $100$ trials, the chance of error is $2^{-100}$, which we can see is really small. The detailed proof of this analysis can be found in the excellent book Probability and Computing by Michael Mitzenmacher & Eli Upfal.

Matrix multiplication is something that's used heavily especially in implementing machine learning classifiers. And if we can tolerate that little chance of error we get an algorithm with lower computational complexity that scales much better.

Stochastic Gradient Descent

Consider another use case from core machine learning classifier design. Gradient descent is a standard way to minimize the empirical risk for measuring training set performance. The empirical risk is given by the following equation:
$$E_n(f) = (1/n)\sum_i l(f_w(x_i),y_i)$$
where $l$ is the loss function that measures the cost of predicting $f_w(x_i)$ from $n$ training examples where the actual answer is $y$ and $f_w(x)$ is the function parameterized by the weight vector $w$. Each iteration of gradient descent updates the weights $w$ on the basis of the gradient of $E_n(f_w)$ according to the following iterative step:

$$w_{t+1} = w_t - \gamma (1/n) \sum_i \nabla_w Q(z_i, w_t)$$
where $\gamma$ is an adequately chosen gain. Note that a single update step for the parameter runs through all the training examples and this gets repeated for every update step that you do before convergence. Compare this with Stochastic Gradient Descent (SGD) where the update step is given by the following:

$$w_{t+1} = w_t - \gamma \nabla_w Q(z_t, w_t)$$
Note instead of running through all examples and compute the exact gradient, SGD computes the gradient based on one randomly picked example $z_t$. So, SGD does a noisy approximation to the true gradient. But since it does not have to process all the examples in every iteration it scales better with a large data set. In this paper on Large Scale Machine Learning With Stochastic Gradient Descent, Leon Bottou classifies the error in building the classifier into 3 components:

Approximation Error, which comes from the fact that the function $f$ that we choose is different from the optimal function $f^*$ and we approximate using a few examples

Estimation Error, which comes from the fact that we have a finite number of training examples and would have gone away with infinite number of them

Optimization Error, which comes from the fact that we are using an inferior algorithm to estimate the gradient

With normal gradient descent we will have low optimization error since we run through all the training examples in every iteration to compute the gradient, which is clearly superior to the algorithm of SGD that does a noisy approximation. But SGD will report a lower approximation and estimation error since we will be able to process a larger dataset within the stipulated computation time. So it's a tradeoff of that we make using SGD, but clearly we scale better with larger data sets.

Singular Value Decomposition

Singular Value Decomposition is a dimensionality reduction technique to unearth a smaller number of intrinsic concepts from a high dimensional matrix by removing unnecessary information. It does so by projecting the original matrix on to lower dimensions such that the reconstruction error is minimized. What this means is that given a matrix $A$ we decompose it into lower dimensional matrices by removing the lesser important information. And we do this in such a way that we can reconstruct a fairly close approximation to $A$ from those lower dimensional matrices. In theory SVD gives the best possible projection in terms of reconstruction error (optimal low rank approximation). But in practice it suffers from scalability problems with large data sets. It generates dense singular vectors even if the original matrix is a sparse one and hence is computationally inefficient, taking cubic time in the size of the data.

This can be addressed by another algorithm, the CUR algorithm which allows larger reconstruction error but lesser computation time. CUR decomposes the original matrix into ones of lesser dimensions but uses a randomized algorithm in selection of columns and rows based on their probability distribution. Now it can be shown that CUR reconstruction is just an additive term away from SVD reconstruction and it's a probabilistic bound subject to the condition that we select a specific range of columns and rows from $A$. The computational bound of CUR is of the order of the data set, which is much less than that of SVD (which as I mentioned earlier is cubic). This is yet another example where we apply randomization and probabilistic techniques to scale our algorithm better for larger data sets in exchange for a little amount of inaccuracy.

These are only a few instances of probabilistic bounds being applied to solve real world machine learning problems. There are a lots more. In fact I find that scalability of machine learning has a vey direct correlation with application of probabilistic techniques to the model. As I mentioned earlier the point of this post is to share some of my thoughts as I continue to learn techniques to scale up machine learning models. Feel free to share your ideas, thoughts and discussions in comments.

Wednesday, February 11, 2015

I have been doing quite a bit of domain modeling using functional programming mostly in Scala. And as it happens when you work on something for a long period of time you tend to identify more and more patterns that come up repeatedly within your implementations. You may ignore these as patterns the first time, get a feeling of mere coincidence the next time, but third time really gives you that aha! moment and you feel like documenting it as a design pattern. In course of my learnings I have started blogging on some of these patterns - you can find the earlier ones in the series in:

In this continuing series of functional patterns in domain modeling, I will go through yet another idiom which has been a quite common occurrence in my explorations across various domain models. You will find many of these patterns explained in details in my upcoming book on Functional and Reactive Domain Modeling, the early access edition of which is already published by Manning.

One of the things that I strive to achieve in implementing domain models is to use the type system to encode as much domain logic as possible. If you can use the type system effectively then you get the benefits of parametricity, which not only makes your code generic, concise and polymorphic, but also makes it self-testing. But that's another story which we can discuss in another post. In this post I will talk about a pattern that helps you design domain workflows compositionally, and also enables implementing domain invariants within the workflow, all done statically with little help from the type system.

As an example let's consider a loan processing system (simplified for illustration purposes) typically followed by banks issuing loans to customers. A typical simplified workflow looks like the following :-

The Domain Model

The details of each process is not important - we will focus on how we compose the sequence and ensure that the API verifies statically that the correct sequence is followed. Let's start with a domain model for the loan application - we will keep on enriching it as we traverse the workflow.

Note we have a bunch of attributes that are defined as optional and will be filled out later as the loan application traverses through the sequence of workflow. Also we have declared the class private and we will have a smart constructor to create an instance of the class.

Wiring the workflow with Kleisli

Here are the various domain behaviors modeling the stages of the workflow .. I will be using the scalaz library for the Kleisli implementation.

applyLoan is the smart constructor that creates the initial instance of LoanApplication. The other 2 functions approve and enrich perform the approval and enrichment steps of the workflow. Note both of them return an enriched version of the LoanApplication within a Kleisli, so that we can use the power of Kleisli composition and wire them together to model the workflow ..

When you have a sequence to model that takes an initial object and then applies a chain of functions, you can use plain function composition like h(g(f(x))) or using the point free notation, (h compose g compose f) or using the more readable order (f andThen g andThen h). But in the above case we need to have effects along with the composition - we are returning Option from each stage of the workflow. So here instead of plain composition we need effectful composition of functions and that's exactly what Kleisli offers. The andThen combinator in the above code snippet is actually a Kleisli composition aka function composition with effects.

So we have everything the workflow needs and clients use our API to construct workflows for processing loan applications. But one of the qualities of good API design is to design it in such a way that it becomes difficult for the client to use it in the wrong way. Consider what happens with the above design of the workflow if we invoke the sequence as enrich andThen approve. This violates the domain invariant that states that enrichment is a process that happens after the approval. Approval of the application generates some information which the enrichment process needs to use. But because our types align, the compiler will be perfectly happy to accept this semantically invalid composition to pass through. And we will have the error reported during run time in this case.

Remembering that we have a static type system at our disposal, can we do better ?

Phantom Types in the Mix

Let's throw in some more types and see if we can tag in some more information for the compiler to help us. Let's tag each state of the workflow with a separate type ..

trait Applied
trait Approved
trait Enriched

Finally make the main model LoanApplication parameterized on a type that indicates which state it is in. And we have some helpful type aliases ..

These types will have no role in modeling domain behaviors - they will just be used to dispatch to the correct state of the sequence that the domain invariants mandate. The workflow functions need to be modified slightly to take care of this ..

Note how we use the phantom types within the Kleisli and ensure statically that the sequence can flow only in one direction - that which is mandated by the domain invariant. So now an invocation of enrich andThen approve will result in a compilation error because the types don't match. So once again yay! for having the correct encoding of domain logic with proper types.

Thursday, January 01, 2015

I look forward to 2015 as the year when randomized algorithms, probabilistic techniques and data structures become more pervasive and mainstream. The primary driving factors for this will be more and more prevalence of big data and the necessity to process them in near real time using minimal (or constant) memory bandwidth. You are given data streams where possibly you will see every data only once in your lifetime and you need to churn out analytics from them in real time. You cannot afford to store all of them in a database on disk since it will incur an unrealistic performance penalty to serve queries in real time. And you cannot afford to store all information in memory even if you add RAM at your own will. You need to find clever ways to optimize your storage, employ algorithms and data structures that use sublinear space and yet deliver information in real time.

Many such data structures are already being used quite heavily for specialized processing of data streams ..

These data structures are becoming more and more useful as we prepare to embrace and process larger data sets with fairly strict online requirements. And it has started making a difference. Take for example Impala, the open source analytic database from Cloudera that works on top of Hadoop. Impala's NDV aggregate function (number of distinct values) uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space. This blog post has the details of the performance improvement that it offers in comparison to the standard distinct count. The immensely popular NoSQL store Redis also offers a HyperLogLog implementation that you can use to get an approximation on the cardinality of a set using randomization. Salvatore has the details here on the implementation of HyperLogLog algorithm in Redis.

The most important reason these algorithms and data structures are becoming popular is the increased focus on our "online" requirements. We are not only processing bigger and bigger data set, we need results faster too. We just cannot afford to push all analytics to the batch mode and expect results coming out after an overnight batch processing. Various architectural paradigms like the lambda architecture also target to address this niche area. But before investing on such complex architectures, often some neat data structures that use probabilistic techniques and randomization may offer a much lighter weight solution that you are looking for.

Consider processing the Twitter stream and generating analytics (of whatever form) online. This means that immediately after seeing one twitter feed you must be able to predict something and update your model at the same time. Which means you need to memorize the data that you see in the feed, apply it to update your model and yet cannot store the entire hose that you have seen so far. This is online learning and is the essence of techniques like stochastic gradient descent that help you do this - the model is capable of making up to date predictions after every data that you see. John Myles White has an excellent presentation on this topic.

Consider this other problem of detecting similarities between documents. When you are doing this on a Web scale you will have to deal with millions of documents to find the similar sets. There are techniques like minhash which enable you to compress documents into signature matrices. But even then the scale becomes too big to be processed and reported to the user in a meaningful amount of time. As an example (from Mining Massive Datasets), if you process 1 million document using signatures of length 250, you still have to use 1000 bytes per document - the total comes to 1 gigabyte which very well fits into the memory of a standard laptop. But when you check for similar pairs, you need to process (1,000,000 choose 2) or half a trillion pairs of documents which will take almost 6 days to compute all similarities on a laptop. Enter probabilistic techniques and locality sensitive hashing (LSH) algorithm fits this problem like a charm. Detecting similarity is a problem that arises in recommender systems with collaborative filtering and LSH can be used there as well. The basic idea of LSH as applied to similarity detection is to use hashing multiple number of times and identify candidate pairs that qualify for similarity checking. The idea is to reduce the search space using probabilistic techniques so that we can eliminate a class of candidates which have very low chance of being similar.

Here I have only scratched the surface of the areas where we apply randomization and probabilistic techniques to solve problems that are very real today. There are plentiful other areas in data mining, graph clustering, machine learning and big data processing where similar techniques are employed to reduce the curse of dimensionality and provide practical solution at scale. 2014 has already seen a big surge in terms of popularizing these techniques. I expect 2015 to be bigger and more mainstream in terms of their usage.

Personally I have been exploring data stream algorithms a lot and have prepared a collection of some useful references. Feel free to share in case you find it useful. I hope to do something more meaningful with stream processing data structures and online learning in 2015. Have a very happy and joyous new year ..