I was posed an interesting question earlier in the week which I thought would be of interest sharing.

Excel’s graphing functions allow you to plot a moving average as a trendline. It’s handy if you want to get an overall picture of what’s been going on with your data recently, e.g. your volume of sales over the last rolling 6 week period, but what if you want to get the data on your worksheet, to output for further analysis?

Fixed Period

The standard rolling average simply takes the mean of the last n data points where n is the period of the moving average. Say your desired period is 6; on a column of data in A2:A16, against the 6th data point you would input the formula:

=AVERAGE(A2:A6)

and copy down for the rest of the data set.

Variable Period

All well and good. You could plot this data and see that it marries up with the moving average trendline.

But what if you want to have a variable period. You don’t want to have to re-write the formula each time. So let’s add a cell with the period value, and have the formula compute everything automatically.

First things first, we need to have the formula display the average, only if the number of cells down into the list is greater or equal to the period. i.e. if the period is 2, we show from the 2nd value down. If the period is 5 we show from the 5th value down, etc. We can imagine this as <This Row>+1 >= <Period> + <First Row>. Here, I’ve placed the period value in C1:

In Excel, we can write the <has the period passed> formula as:

=ROW(A2)+1>=$C$1+ROW($A$2)

The period value and the first row are fixed, so they’re locked with dollar signs.

The next step is to calculate the target “From” and “To” cells between which to derive the average. The “To” cell is easy, it’s always going to be <This Row>, the “From” cell is simply (<This Row> – <Period>) +1, or for the first cell:

=ROW(A2)-$C$1+1

All we do next is use the INDIRECT function to convert our values into a range which can be averaged:

=INDIRECT(“A”&ROW(A2)-$C$1+1&”:A”&ROW(A2)))

And then, we only apply that formula if the “After period” check returns true:

Cumulative to Rolling

You’ll notice that the above formulas leave a gap at the beginning of the data while the period value is reached so that n values are always being averaged. For most situations, this is what you’ll want, it’s more accurate and applies the same “smoothing factor” to all averages (as an aside, the rolling average is a form of low pass filter on your data). Another type of moving average is the “Cumulative Average” where the average is taken from the first data entry to the current value along the entire series. As the period grows with every new data point. the smoothing factor increases and the moving average tends towards the average of the entire data set. A cumulative average can be calculated in the following formula by setting the period value to the number of data points in your set.

One way of filling the initial gap in the traditional moving average is to build a cumulative average for the first few data points until the desired value is reached, then switch over to a simple moving average. Effectively, the period value grows to a maximum determined by you. The downside is that as the smoothing is not equal across the range, higher frequency “noise” will show at the beginning of the averages – you will need to decide if this is acceptable.

As this time, we’ll be displaying averages along the whole data set, there’s no need for an IF statement to check if we should display a value, there will be one in all cases. What we need to calculate is the starting cell, the “From” cell. This needs to snap to the first cell, until we can roll along with the length of the rolling period. In general, we can write this as <First Cell> + Max(0, <Data point number> – <Period>)

Where <Data point number> refers to the index value of this data point, so first data point = 1, second data point = 2 etc. In Excel, we can write this as:

=ROW(A$4)+MAX(0,ROW(1:1)-$D$1)

Again, the “To” cell is simply <This Row>

So we can again use the INDIRECT function to build the range over which we wish to compile the average:

=AVERAGE(INDIRECT(“A”&ROW(A$4)+MAX(0,ROW(1:1)-$D$1)&”:A”&ROW(A4)))

Central Rolling

Another type of rolling average often used is the central rolling average. The other types above have all averaged the data up to a certain point. In contrast, the central rolling average takes into consideration a number of values ahead as well as behind the central point. This has the effect of decreasing the lag of a significant jump in the data.

In terms of building this on the sheet, it’s important to note that the period of this type of rolling average has to be an odd number; there is the central value, plus x values ahead and x values behind. Thus it’s possible to express the period either in terms of the number of values on each side, x, or as the whole number, which equates to 2x + 1. For consistency with the previous methods, I’ll talk about the period being the whole range of values centred around the mid-point.

Again, to ensure no gaps at either end, it’s possible to ramp up to the period value in terms of entries used and correspondingly “ramp down” at the end. The caveat being as before that the same smoothing value is therefore not achieved across the whole data set.

Here, the start cell is obtained in a very similar way as the above. The difference being because the number of cells referenced before the central cell is less than half the period. In fact it’s the “(Period +1) divided by 2, rounded down”
There are two obvious methods of rounding down with Excel functions, here I’m using TRUNC as all values will be positive (see the Excel help for a good description of choosing an appropriate rounding function)

=ROW(A$4)+MAX(0,ROW(1:1)-TRUNC(($D$1+1)/2))

The To cell is built in a similar way, but capping the result to the final cell:

Just been looking at the RANK and RANK.EQ functions and noticed a horrible discrepancy between what the help file states and what Excel lets you do.

The general form of the Rank functions (RANK, RANG.AVG and RANK.EQ) is

=RANK.EQ(number to look up, ref to look it up in, search order[optional])

For the second parameter, ref, the help documentation states:

Ref Required. An array of, or a reference to, a list of numbers.

All well and good, says I can use an array. However, trying this proves otherwise. The following formula won’t even be accepted in XL2010, it doesn’t throw a #VALUE or #N/A error:

=RANK(2,{1,2,3,4,5})

Weirdly enough, you can use an array for the first parameter. If A1:A5 contain 5, 4, 3, 2 and 1, the following:

=SUM(RANK({1,2},A1:A5))

when array entered, returns 9 as you would expect, the sum of 4 and 5.

This also means that any function returning an array isn’t permitted. =RANK(2,ROW(1:5)) refuses to be accepted.

That’s not to say that you can’t use any function as the ref parameter; if it returns a reference, no problem. It’s perfectly fine to do this:

=RANK.EQ(2,CHOOSE(2,A1:A10,B1:B10,C1:C10))

This all makes the RANK functions fairly useless when building array formulae. There are workarounds using other nested functions, but it would have been really nice to have this working as specified in the manual.

Be sure to read the previous entry to get up to speed before progressing with this post.

Overcoming the limitation of the INDEX / MATCH combination returning the same value twice if the frequency value crops up multiple times requires us to cheat a little and modify the value so that it becomes different to any other. We’re fortunate in this case that the output of the FREQUENCY function will always be a whole number. A single value can’t occur five-and-a-half times in a particular range. This means we’ve got all the space after the decimal point to play with and still keep the important part of the data, the whole number, intact.

Below is the original data with the result of the FREQUENCY function next to it in Column B. The adjacent column is the result of the same function, but with the ROW number of the cell attached after the decimal point:

We do this with a formula of =(FREQUENCY(A2:A29,A2:A29)&”.”&ROW(A2:A29))+0
We use the ampersand operator to concatenate a decimal point plus the row number. As the decimal point is added as text, the output will be of type text. The addition of zero at the end forces Excel to convert back to a number.

Take a look at D2: the value of 2 has a frequency of 6 and this data is in row 2, hence the final result of “6.2”

However, also look at cell D20. The data entry for 6 has a frequency of 6 and is in row 20. This gives a result of “6.20” – a big problem considering that 6.2 and 6.20 aren’t going to be treated as different numbers when it comes to comparison; they’re both simply 6.2. Leaving the values as this will result in the same problem as before since MATCH isn’t going to differentiate between them. So how do we force a difference between 6.2, 6.20 and 6.200?

We can get round this little conundrum by also concatenating a fixed number of zeros and another digit before adding the zero to force the type conversion back to a number. Cunning eh?
So by changing the formula to:
=(FREQUENCY(A2:A29,A2:A29)&”.”&ROW(A2:A29)&”00000001″)+0

…we get the following output:

Now compare D2 and D20. These values are:
6.200000001 and
6.2000000001

…which are obviously different. The length of the added string has been carefully judged. The worst case scenario (In XL2007 and later) would be a clash between rows 1 and 1 million. This number ensures that this wouldn’t cause a problem.

Right? Well, not quite. Try this and you’ll see it results in an error.

The issue with FREQUENCY()

One intrinsic property of the FREQUENCY function is that the returned array is 1 row bigger than the range specified in the bins input (the second value passed in the function.) This extra value is the count of items not fitting into any of the bins specified. In examples such as this, where we’re analysing a range against itself, this final value will always turn out to be zero. This catches many people out when using FREQUENCY inside other array formulae as the extra row is often not or mis-handled throwing #N/A errors across the entire output. All the arrays must be of the same size when running them against each other in Array formulae.

So we need to beef up the other, now too short, array that’s causing the issue. Since FREQUENCY(A2:A29,A2:A29) will return an array 1 bigger than the inputs, it’s the subsequent ROW function that needs to be upped by 1 to measure up.

Final Final Formula!

The FREQUENCY and ROW issue crops up twice in the formula, so tweaking both of these results in a final formula of:

Caveats

If there are shared frequency values, these will be unsorted in the final output, and you won’t necessarily know that there is a tie-break situation

If the number of tied results takes the count above whatever your n value is, you won’t see them all.

Both of these issues can be mitigated somewhat by performing a reverse lookup to see what the frequency value is for each of the results. If it’s the same for all the last few values, it’s worth doing a little more digging. This can be achieved by the following formula:

=INDEX(FREQUENCY(A$2:A$29,A$2:A$29),MATCH(M2,A$2:A$29,0)) [CTRL+SHIFT+Enter] and copy down for all the results.

If you want to have a play around with the file I used to make this entry, you can download it here.

I came across a problem a short while ago of how to determine the top 5 most common numbers (though in fact the data could be anything) in a range. Excel handily provides the “MODE()” function for determining the most common entry, but no more than that. Newer versions of Excel feature the “MODE.MULT” function, but this only returns multiple values that share the top spot, not the successors. Unfortunately, you can’t use the syntax of MODE(<range>,n) as you can for say LARGE()

I wanted to share this one as it provides a good example of the general process used to build a complex array formula.

The exact method for this process had eluded me for some time – I was able to come up with a basic version within a few minutes, but working round a major flaw took some thinking.

Here’s some example data we’re going to work from:

I’ve chosen the exact order of the numbers above to trigger some of the “gotcha”s that caused me the difficulty. The first column contains the 28 data points we’re analysing. The 2nd column is the result of the FREQUENCY function entered as an array running the input range as both the input and the bins (=FREQUENCY(A2:A29,A2:A29).)
This returns an array of the count of each entry on and only on the first time any particular value crops up. Any subsequent values return a zero. As an aside, this is the behaviour exploited for the original ‘count unique’ formula mentioned in an earlier post.

So far so good. Onto building the formula.

The way I, and many others go about working out a complex array formula is to break it into the constituent parts, and then piece them all together at the end bit by bit. Take a look at the Frequency column in the data above; you’ll see that in this data set, there are 6 twos, 4 threes, 2 fours, 1 five, 7 nines and 6 sixes. A total of 6 discrete original data values, and we want to know the top 5. A cursory glance over this small data set tells you that this is all of them apart from that single five. Five is indeed the loneliest num…. never mind.

So for the next stage of the build, we want to know the 5 highest frequency entries. We run a LARGE function over the frequency values to find these.

This shows us as the array result of a formula that the 5 highest frequencies are the nine, a couple of sixes, a four and a two, which agrees with the cursory glance from before. Perfect.

The next step here is to use these results and then perform a lookup to find out the original data values that caused them. A nested INDEX/MATCH combination does the job nicely:

=INDEX(A2:A29,MATCH(G2:G6,B2:B29,0))

The final stage is to conglomerate all the helper columns into one final formula so that it refers only to the original data. Working right to left from the helper columns, replace anything referencing one of the helper build columns (B, G and H in this example) to the equivalent to its left.
So here, the INDEX/MATCH combination has two references that aren’t the original data in Column A, G2:G6 and B2:B29. Replace these with the formula from these ranges.

G2:G6 refers to our LARGE formula: =LARGE(B2:B29,ROW(1:5))
and B2:B29 refers to the FREQUENCY formula: =FREQUENCY(A2:A29,A2:A29)

These substitutions result in a formula of:
=INDEX(A2:A29,MATCH(LARGE(B2:B29,ROW(1:5)),FREQUENCY(A2:A29,A2:A29),0))

This substitution has resulted in one other reference to a helper column, B2:B29. Run through and replace this with its respective formula:
=INDEX(A2:A29,MATCH(LARGE(FREQUENCY(A2:A29,A2:A29),ROW(1:5)),FREQUENCY(A2:A29,A2:A29),0))

And there we have it, the process of building a complex array formula! The secrets are:

Use helper columns to get you to a place where you’ve got the intended result

Working backwards (right to left) across the helper columns, substitute in relevant sub-formula until you are only referencing the original data

Eagle eyed viewers may have noticed that this formula doesn’t quite give the final result we’re after. Where there are two values with the same frequency, the MATCH function picks up the first found original value for both (2 in this case). Stay tuned for the next post where I’ll talk about how to get round this limitation.

In building formulae, there comes a time when you need to reference a series of numbers in an array.

A common way of doing this is to use the ROW() function. At its most basic, this function returns the row number for an address or range provided. For example:

=ROW(B6)

will return “6” representing that the row number of B6 is of course 6. The related function, =COLUMN(B6) would have returned “2” representing that B is the second column.

If we instead array-enter the formula with CTRL-SHIFT-ENTER and use a range for the input of the function, the returned array will contain all the rows referenced within the range.

As you can see, extending past the maximum results in an #N/A error.

Effectively, we can use this to create an array of all the integers between two values. e.g. ROW(3:9) would return an array of {3,4,5,6,7,8,9}. For a modified series, e.g. a series of square numbers, we simply operate on the returned array. For example, to calculate the sum of all the squares of 1 to 100 (12 + 22 + 32 + 42 + 52 …etc)

=SUM(POWER(ROW(1:100),2)) <array entered > does the job and returns 338350

Similarly, Archimedes’ convergent series ( 1/4 + 1/16 + 1/64 + 1/256…) can be shown (for the first 10th terms) with the following formula: =SUM(4^-(ROW(1:10))) <CTRL-SHIFT-ENTER>

Variably sized arrays

This is all very well when we know the size of the array to create, but what if we don’t? What if we want to change the size of the array based on, say, the number of items in a list.

We can do this by combining the ROW() function from above with the ROWS() and INDIRECT() functions as follows:

=ROW(INDIRECT(“1:”&ROWS(<Range>)))

This builds an array of the row numbers from 1 to the number of rows contained within <Range> using the INDIRECT function to convert a text string to a fully reference-able range. For instance, if the range is B2:B12:

This process has many uses when creating adaptable formulae. A simple example though would be to create a sorted array of a set of numbers using the SMALL() function by taking the created array as the list of index positions.

i.e. to sort the range A2:A12:

=SMALL(A2:A12,ROW(INDIRECT(“1:”&ROWS(A2:A12)))

While this may come in handy in places where you can’t filter in place, for example on a dashboard sheet, there are doubtless many other ways of achieving the same goal. Note that INDIRECT is a volatile function and as such every time you make a change on the sheet, this will trigger the recalculate flag. The knowledge is far more useful when it comes to building other formulae which I’ll delve into in later posts.

If you deal with large amounts of data in Excel on a regular basis, chances are that at some point you’ve had to deal with managing duplicate entries within it. My last post dealt with counting up how many discrete entries you had in your list — now we’ll talk about identifying which entries have identical siblings elsewhere.

Excel helpfully has a method for filtering out duplicates: Data – Filter – Advanced Filter, and tick the Unique Records Only option. In Excel 2010, this is now fairly nifty and you can filter records in place and just keep 1 of each value, much improved on the Excel 2003 facility. 2010 also has a dedicated “Remove Duplicates” option which allows for validation of duplication across multiple columns.

This might be all you need, and if so fine, but if you want to know what’s a duplicate and what isn’t either just for reference or to know what’s going to be deleted, there are varying methods you can use to achieve this.

Method 1 – simple and quick

Ideal if: you are free to re-arrange the order of your data and you can create a helper column

In Excel 2003, this was my preferred method for identification and removal of duplicates. Create a column somewhere alongside the column you’re using to check for duplicates, like so:

Next, apply autofilter to the table (shortcut ALT+D,F,F) and arrange by the column you want to use as the basis for the duplicates. Here, I’m doing that on the first column.

I’ve also typed a formula into the first entry of the helper column and copied it down. That formula is “=B2=B1” where B2 is the first entry in the list and B1 is the header row. Provided your header row isn’t the same as the first entry, this will always return FALSE, we’re simply checking to see if the entry is the same as the one above it:

Any value of “TRUE” means the entry is a duplicate. You can set the autofilter on the helper column to just display the TRUE values and delete them if need be. With practice, you can accomplish these steps and have your duplicates removed in about 15 seconds. Simply delete the helper column after you’re done.

Method 2 – without rearranging

If you can’t get away with re-arranging the data, or you want to be able to perform other re-arrangements while still keeping the duplicate checking, you’ll need another method. This method can either be performed on a helper column, or on the conditional formatting to highlight those entries that are duplicates.

The example formula to use in this case is:

=MATCH(<Cell>,<Column>,0)<>ROW(<Cell>)

e.g. =MATCH(B2,B:B,0)<>ROW(B2)

This is then copied down to all the cells in your helper column:

This formula performs a match of your data entry against the whole column and compares it to the

Note that this (and the following) formula require that there is no additional data in the column above the range you’re checking and that any header text is not repeated in the data.

Method 3 – Slight alternative

Another goal you might want to achieve is to identify those entries which have a duplicate entry somewhere further down the list.

For this, we can take a similar approach to the above and say “If the count of this entry is greater than 1 AND this is the first match, return TRUE”:

=AND(COUNTIF(<Column>,<Cell>)>1,MATCH(<Cell>,<Column>,0)=ROW(<Cell>))

e.g. =AND(COUNTIF(B:B,B2)>1,MATCH(B2,B:B,0)=ROW(B2))

This was actually the formula that led me to the faster “count distinct” method from my last post. As this and all the above formulae return TRUE/FALSE values, they can be used for conditional formatting as well. In the image below, the above formula is in both the helper column and on the conditional formatting on column B:

So here is a visual alert to the fact that “This entry is duplicated further down your list”. You could even combine this with Method 2, one to point out the originals, one to point out the duplicates.

Welcome. Today, a post on boosting performance with counting distinct values in a range.

The official Microsoft help file gives a number of methods for determining the number of distinct (sometimes referred to as unique, but more on that in a sec) cell values in a particular range. These vary, according to whether you want to count:

Just numbers

Numbers and text

Including Blank Cells… etc

The method I see in most frequent usage is the one I call the “Sum if frequency match” array formula. This follows the pattern of:=Sum(If(Frequency(Match(<Range>,<Range>,0),Match(<Range>,<Range>,0))>0,1,0))
Entered with Ctrl+Shift+Enter to force calculation as an array formula.

This will result in the total number of distinct1 values, numbers, text, results of other formulae2 and blank cells for the <range> provided.

For instance, if you have a range encompassing the following data:

A

AB

ABC

1

AB

ABC

A

The formula will return a value of 4, for each of the distinct values: A, AB, ABC and 1.

Explanation

The formula works because of the following:

Matching a range with itself within an array formula returns an array of the position of the first match for each value. In the above example, this would return an array of {1,2,3,4,2,3,1}. The second time that the values, AB, ABC and A come round, they result in 2,3 and 1, the positions that they first occurred.

Performing a FREQUENCY() on this set of data against itself (incidentally not strictly the original way Frequency() was meant to be used) returns an array for the number of times the first occurrence appears in the match sets. Again, for the above data, this returns {2,2,2,1,0,0,0} – that is, there are two “A”s, two “AB”s, two “ABC”s, one “1” and then all the remaining values have already appeared before.

The remaining 2 outside functions simply count how many times in the above that there are numbers greater than zero. You can clearly see why the final result is 4.

This is all well and good… until you want to know the same result for a dataset with thousand and thousands of values. This method then tends to take a performance hit and get quite slow. This is especially noticeable if you’re running many of these type formulae on 1 sheet.

The Alternative

Thankfully, I found a way of performing it much faster. I was aware that simple MATCH functions are pretty quick – it seems to be the FREQUENCY function holding everything back. To that end, with 1 caveat, the following function calculates the same result in a much shorter time frame:

=SUM((MATCH(A1:A100,A:A,0)=ROW(A1:A100))*1)
Also array entered with Ctrl+Shift+Enter

Here, we perform a variation of matching a range with itself, and match it with the entire row. Thus any matches will return the row number of the first match. Then we check, if that result is equal to the row number then it’s indeed a first match and can be counted as distinct. If it’s a subsequent match, this check will return a FALSE value and won’t be counted. The “*1” simply changes the TRUE/FALSE values into 1s and 0s to be summed.

Speed Improvements

I’ve found that this method typically runs in under half the time of the original Microsoft helpfile suggested method, typically around 40% of the original time and the formula is also pretty easy to remember. The speed benefits no doubt derive from the fact that it runs just one MATCH(), a comparison, a simple multiplication and a SUM.

Caveat and Workaround

Because you’re comparing a range with the entire column that it’s in, there can be a problem if the range you want to check isn’t at the top of your sheet. If any of the terms in your range also appear in the same column above the range, they won’t get counted; the first match will be less than even the first entry of the range and so the check will always return false. This normally shouldn’t be a problem, but it’s something to keep in mind. It’s also worth pointing out that this means your header name should not appear in your data list.

If this is a problem, we need to make a slight modification to the formula so that you’re comparing with the relative row numbers of the range, rather than the entire column. This can be done as so:=SUM(((MATCH(<Range>,<Range>,0)=ROW(<Range>)-MIN(ROW(<Range>)-1)))*1)

You lose the ease of memorisation, but retain the speed increase.

Points

1 Many people use the term “Unique” for this, but this could be confused with asking “How many times are there values that appear just once (are unique) in a range?” (working out how to do just that will be the topic of a future post). Distinct is a less ambiguous term that users of SQL will be all too familiar.

2 Be careful if your range includes formulae that output numbers as text. Numbers formatted as text and numbers stored as numbers are not equivalent and will be counted as two distincts. This may or may not be the result you’re after. A similar warning goes for numbers formatted as dates or anything else. If one cell contains the number ‘40706’ and another contains the date “12-Jun-2011”, these will count as 1 distinct value as both are stored as ‘40706’