Style 30 – Map Reduce

Constraints

Input data is divided in chunks, similar to what an inverse multiplexer does to input signals.

A map function applies a given worker function to each chunk of data, potentially in parallel.

A reduce function takes the results of the many worker functions and recombines them into a coherent output.

Map reduce should be a familiar paradigm to most developers nowadays. In today’s style we’ll use a very simple map-reduce flow to solver the term frequency problem.

As you can see from the above diagram, we’ll take an input text and split it into X no. of chunks which can be processed in parallel. To do this, we’ll first add a partition function below.

Each chunk of the original input text can then be split and filtered (using the list of stop words) independently.

Two things to note about the splitWords function below:

a) it returns an array of (word, count) tuples

b) it doesn’t perform any local aggregation, so if the same word appear twice in its chunk then it’ll appear as (word, 1) twice in the returned array

Once all the chunks has been processed, we can reduce over them with the countWords function below.

Here, I have taken an imperative approach for efficiency sake, and because that’s what Crista had done in her Python solution. Alternatively, you could have written something along the lines of the following.

which is much more concise and functional in style, but about 25% slower (~100ms) than the imperative approach.

Next, we’ll add a sort function that’ll sort the array of (word, count) values in descending order by the count (i.e. the snd element in the tuple).

And finally, we wire everything together using pipes.

Job done!

or is it?

So far, I have followed Crista’s approach in her Python solution, and the solution meets the constraints of this style. But, we haven’t taken advantage of the potential for parallelism here.

Version 2 – Parallel Processing

To make the processing of the chunks happen in parallel, let’s modify the splitWords function to make it async.

Notice that not much has changed from our non-async version! Except for the minor optimization we added to aggregate the words in the chunk instead of returning duplicated (word, 1) tuples.

The next thing we need to do, is to wrap the original pipeline in an async workflow.

Here, once the partitions have been created, we’ll process them in parallel and wait for all the parallel processing to complete using Async.Parallel before proceeding with the rest of the pipeline as before.

You can find the source code for this exercise here (v1) and here (v2 – async).

Style 29 – Dataspaces

Constraints

Existence of one or more data spaces where concurrent units store and retrieve data.

No direct data exchanges between the concurrent units, other than via the data spaces.

To get started, we’ll define our dataspaces – one to store the words we need to process, and one to store the partial frequencies from each concurrent unit processing the words (we’ll see what this means soon).

Next, we’ll define the processWords function that will be executedconcurrently.

Each concurrent unit will poll the wordSpace dataspace for words to process and create a word frequencies dictionary for the words that it has processed. Upon exhausting all the available words, each concurrent unit will save the locally aggregated word frequencies into the freqSpace dataspace.

Next, we’ll read the text from Pride & Prejudice and add the words into our wordSpace dataspace for processing.

In Crista’s solution, she kicked off 5 concurrent threads to process the words and waited for all of them to finish before merging the partial results in the freqSpace dataspace. I’m not sure if this fork-join approach is a necessary part of this style, but it seems a reasonable choice here.

To follow the same approach, we can use F#’s Async.Parallel method.

Here, I chose to use Async.RunSynchronously to synchronously wait for the parallel tasks to finish (this is the same approach Crista took in her solution). Alternatively, you can make the wait happen asynchronously by capturing the result of Async.Parallel instead (see Version 2below).

The next step is pretty straight forward. Iterate through the partial results in the freqSpace dataspace and aggregate them into a single word frequencies dictionary, then return the word frequencies as a sorted array.

Finally, take the top 25 results from the sorted array and display them on screen.

Version 2 – Async all the way

If you didn’t like the synchronous waiting in the fork-join approach above, here’s a modified version of the solution that is async all the way.

So first, we’ll capture the parallel processing of words (and subsequently ignoring the results) as an Async<unit>. Notice that at this point we haven’t done any work yet, we merely captured the asynchronous computation that we will perform (which is one of the key differences between async in C# and F#).

Inside another async { } block, we can action the parallel processing, asynchronously wait for its completion (i.e. do! processAllWords) and then merge the partial results in the freqSpace dataspace as before.

Finally, we’ll kick off the entire train of asynchronous computations that we have composed together with Async.Start.

And voila, now everything runs asynchronously end-to-end

You can find the source code for this exercise here (v1) and here (v2 – async all the way).

Style 28 – Actors

Constraints

The larger problem is decomposed into ‘things’ that make sense for the problem domain.

Each ‘thing’ has a queue meant for other things to place messages in it.

Each ‘thing’ is a capsule of data that exposes only its ability to receive messages via the queue.

Each ‘thing’ has its own thread of execution independent of the others.

I have been looking forward to doing this style every since the Letterbox style months ago So without any delay, let’s get started!

F# has built-in support for the Actor Model via the MailboxProcessor type, and we’ll use it to represent each of the ‘things’ in our solution here.

First, we’ll create an actor to load and store the words that we need to process. Since MailboxProcessor is generic, we can define what messages it can receive with a DataStorageMessage union type that allows two kinds of messages:

LoadWords – which contains the path to the local file to load words from

NextWord – once loaded, this message fetches the next available word

In the snippet below, you can see that we have a recursive receive loop which asynchronously waits for message to arrive in its mailbox and handles them accordingly. One thing to note here is that, if a LoadWords message is received before all the loaded words are processed the current behaviour is to override the remaining words list. This can give you unexpected results, the current design leaves the responsibility of ensuring this doesn’t happen with higher level abstractions (see the controller actor below).

Next, we’ll add another actor to manage the stop words. Similar to the dataStorageManager, we’ll first define the messages that can be sent to our actor:

LoadStopWords – which contains the path to the local file to load stop words from

IsStopWord – which passes in a word and expects a boolean as reply

As we did in the dataStorageManager, whenever the stopWordsManager actor receives a LoadStopWords message it’ll replace the current list of stop words. All future words passed in via the IsStopWord message will be checked against the updated list of stop words.

Next, we’ll add an actor to track the word frequencies. This actor can accept three messages:

Add – adds another word to the current word frequencies

TopN – fetches the current top N words with their corresponding frequencies

Reset – resets the count

Again, the code snippet below should be pretty straight forward, although there is actually a frailty here due to the use of Seq.take. One thing that often annoys me about Seq.take is that, unlike Enumerable.Take, it throws when there is insufficient number of elements in the sequence! In this case, if the calling agent ask for TopN too early or with a large N it’s possible to kill our actor (which is also something that we’re not handling here). It’s fine given the context of this exercise, but these are things that you need to consider when writing production-ready code.

Lastly, we’ll add an actor to orchestrate the control flow of our program, that will accept a simple Run message. Here, I’ve added an AsyncReplyChannel<unit> to the Run message so that the caller has a deterministic way to know when the program has completed.

When the controller receives a Run message, it’ll initialize the other actors (which happens concurrently due to the asynchronous nature of messaging) and then process all the words from Pride and Prejudice by recursively fetch words from the dataStorageManager until there’s no more. One thing to note is that, because a MailboxProcessor process messages one-at-a-time, so even if the controller receives multiple Run messages at the same time it’ll still process them one at a time and we don’t even have to use locks!

To run our program, we’ll kick things off by sending a Run message to the controller. I opted to run this synchronously, but you could just easily ignore the reply and run the program asynchronously with Async.Ignore and Async.Start.

I’m a massive fan of Erlang and the Actor Model, they’re highly related since Erlang implements the Actor Model but shouldn’t be mixed up – e.g. code hot swapping and supervision trees are features of Erlang and Erlang OTP, and are not prescribed as part of the Actor Model (which is a theoretical model for describing computation).

If you haven’t already, please go ahead and watch this Channel9recording of a conversation between Carl Hewitt and Erik Meijer on the Actor Model – what it is, and what it isn’t. I also did a write up to summarise the key points.

Style 27 – Lazy Rivers

Constraints

Data comes to functions in streams, rather than as a complete whole all at at once.

Functions are filters / transformers from one kind of data stream to another.

Given the constraint that data is to come in as streams, the easiest way to model that in F# is using sequences.

First, let’s add a function to read the text from an input file as a seq<char>.

Then, we’ll add another function to transform this sequence of characters into a sequence of words.

Next, we’ll filter this sequence to remove all the stop words and return the non-stop words as another seq<string>.

So far everything’s pretty straightforward, but things get a bit tricky from here on.

To count and sort the non-stop words, we can return the running count as a sequence after each word, but that’s terribly inefficient.

Instead, we can batch the input stream into groups of 5000 words. We’ll update the word frequencies with each batch and produce a new sorted array of word counts for the output sequence.

In the snippet above, focus on the Seq.scan section (ps. Seq.scan is similar to Seq.fold, except it returns all the intermediate values for the accumulator, not just the final accumulated value).

Here, given the current word frequencies map and a batch of words, we’ll return a new map (remember, a F# Map is immutable) whose counts have been updated by the latest batch.

Because Seq.scan returns results for all intermediate steps including the initial value (the empty map in this case), we have to follow up with Seq.skip 1 to exclude the empty map from our output.

Finally, to string everything together, we’ll print the top 25 words for each of the outputs from countAndSort. It takes quite a few iterations, but you’ll see the result slowly emerging in the process.

Unlike the other styles, we’ll get a few sets of outputs – one for each batch of words processed.

Style 26 – Spreadsheet

Constraints

The problem is modeled like a spreadsheet, with columns of data and formulas.

Some data depends on other data according to formulas. When data changes, the dependent data also changes automatically.

Since we’re modelling the problem like a spreadsheet, let’s first define a few type alias to help us establish our domain language.

a column is referenced by a string – e.g. column “A”, “B”, “C”, and so on

a column can have explicit values, or a formula

a formula references one or more other columns and uses their data to calculate the display value for the column

We can model a Column with a union type, where a column is:

a) given explicit values; or

b) given a formula, and its display value is calculated from that formula.

A Spreadsheet is a collection of columns, whose value can be retrieved or updated using a ColumnRef, e.g.

let spreadsheet = Spreadsheet()

spreadsheet.[“A”] <- Choice1Of2 [| “hello”; “world” |]

let columnA = spreadsheet.[“A”] // [| “hello”; “world” |]

Note that our custom indexer accepts a Choice<DisplayValue, Formula> in its setter instead of a Column. I made this decision because Column.Formula needs both the Formula and its calculated DisplayValue, and here it shouldn’t be the caller’s responsibility to exercise the formula and calculate its DisplayValue.

After a column’s value is updated, we’ll recalculate the DisplayValue of all the columns in the Spreadsheet. This is a naive approach, but sufficient for the task at hand.

During this step, we’ll need to recursively evaluate the DisplayValue of the columns, sometimes a column might be evaluated multiple times if it’s referenced by several formulae. Again, there’s room for optimization here by caching previously calculated values.

Next, we’ll start inputting data into our Spreadsheet.

To make it easier to work with the Choice<DisplayValue, Formula> type, let’s add two helper functions.

We’ll leave column A and B blank for now.

Column C will use the data from Column A (all words) and Column B (stop words) to calculate all the non-stop words.

Column D will contain all the distinct words from Column C (non-stop words).

Column E references both Column C (non-stop words) and Column D (distinct non-stop words), and counts the frequency for each of these words.

An important detail to note here is that, positionally, each count in Column E is aligned with the corresponding word in Column D.

Column F is where we’ll store the output of the program.

Because Column D (unique non-stop words) and Column E (counts) are positionally aligned, so we can use Array.zip to combine the two columns to calculate a sorted array of word frequencies.

Now that all the columns are set up, we can go back and input the text from Pride and Prejudice and the list of stop words into Column A and B respectively.

Doing so will trigger the DisplayValue of all the other columns to be recalculated.

Finally, we’ll take the top 25 rows from Column F and print them to complete our program.

I hope you’ve enjoyed today’s style, apologies for the lack of updates the last two weeks, I have been busy working out some stuff on a personal front. I’ll resume the two posts per week routine, which means we should finish this series in just over a month!

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.