A handful of posts ago I looked at some FREQUENCY() formulae which could give a count of distinct or unique values in a given column. The formulae became quite complicated and slow – particularly when dealing with mixed data types – so, in this post, I’m going to have a go at creating some VBA User Defined Functions to do the same thing. A UDF is simply a custom function (written in any language) which is called from a Range or, in other words, used in a worksheet formula.

Before I continue, let me clarify the difference between what I call a ‘distinct’ count and a ‘unique’ count. Suppose we have a list as follows:

a, a, b, b, c, d, e, e, f

From this list, I would say there are 6 distinct values: a, b, c, d, e and f. I use the word ‘distinct’ to mean ‘different’ – the same as it is used in the SQL world .

From this list, I would say there are 3 unique values: c, d and f. The unique values are the ones which appear exactly once in the list.

The two VBA approaches I’m going to look at are:

A Collection object

A Dictionary object

As a starting point I’m going to create some basic UDFs which give a distinct count using each of these approaches. The distinct FREQUENCY() formula I’m going to compare those UDFs against is:

This formula was broken down and explained in detail in my other post, so please have a read through that if you’re not sure how it works. I’ll design the basic UDFs in a somewhat contrived fashion so that they give the same results as this formula and then I’ll do some benchmark testing to see which one is faster to calculate.

Download Example Workbook

Basic Distinct Count Using A Collection Object

Public Function COUNTDISTINCTcol(ByRef rngToCheck As Range) As Variant
Dim colDistinct As Collection
Dim varValues As Variant, varValue As Variant
Dim lngCount As Long, lngRow As Long, lngCol As Long
On Error GoTo ErrorHandler
varValues = rngToCheck.Value
'if rngToCheck is more than 1 cell then
'varValues will be a 2 dimensional array
If IsArray(varValues) Then
Set colDistinct = New Collection
For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
For lngCol = LBound(varValues, 2) To UBound(varValues, 2)
varValue = varValues(lngRow, lngCol)
'ignore blank cells and throw error
'if cell contains an error value
If LenB(varValue) > 0 Then
'if the item already exists then an error will
'be thrown which we want to ignore
On Error Resume Next
colDistinct.Add vbNullString, CStr(varValue)
On Error GoTo ErrorHandler
End If
Next lngCol
Next lngRow
lngCount = colDistinct.Count
Else
If LenB(varValues) > 0 Then
lngCount = 1
End If
End If
COUNTDISTINCTcol = lngCount
Exit Function
ErrorHandler:
COUNTDISTINCTcol = CVErr(xlErrValue)
End Function

In summary, each item in the Collection must have a unique key and that unique key must be a String. If the code tries to create a duplicate key then an error is raised. The error is ignored due to the On Error Resume Next directive. A feature of the VBA Collection object is that the key is case-insensitive.

The LenB() function is used to check if the cell is blank; if the cell contains an error value then an error will be raised at this point and the UDF will return #VALUE!. I’ll rant about the confusion in Excel between empty and blank cells in a future blog post (::grin::) but when I say blank I mean the cell either contains:

absolutely nothing (it is empty)

a zero length string

only a cell prefix character (usually ')

Basic Distinct Count Using A Dictionary Object

I’ve actually written two Dictionary object UDFs for benchmarking because there are a couple of ways the Dictionary object can be created and maintained. Firstly, a new Dictionary object could be created every time the UDF is called or, secondly, a Dictionary object could be created on the first call and then cleared (but not destroyed) by using the Dictionary.RemoveAll() method on subsequent calls.

The Dictionary class is part of the Microsoft Scripting Runtime library (scrrun.dll) and since I’ve used early binding in my code, a reference needs to be added to it in the VBA IDE under Tools > References.

And here’s the UDF which retains the existing Dictionary object after the first call and subsequently clears it. The Static keyword is used instead of Dim so that the Dictionary object reference is retained between function calls:

Public Function COUNTDISTINCTdicStatic( _
ByRef rngToCheck As Range) As Variant
Static dicDistinct As Scripting.Dictionary
Dim varValues As Variant, varValue As Variant
Dim lngCount As Long, lngRow As Long, lngCol As Long
Dim strValue As String
On Error GoTo ErrorHandler
varValues = rngToCheck.Value
'if rngToCheck is more than 1 cell then
'varValues will be a 2 dimensional array
If IsArray(varValues) Then
If dicDistinct Is Nothing Then
Set dicDistinct = CreateObject("Scripting.Dictionary")
dicDistinct.CompareMode = TextCompare
Else
dicDistinct.RemoveAll
End If
For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
For lngCol = LBound(varValues, 2) To UBound(varValues, 2)
varValue = varValues(lngRow, lngCol)
'ignore blank cells and throw error
'if cell contains an error value
If LenB(varValue) > 0 Then
'cast everything to a string
'so dictionary is not type sensitive
strValue = CStr(varValue)
If Not dicDistinct.Exists(strValue) Then
dicDistinct.Add strValue, vbNullString
End If
End If
Next lngCol
Next lngRow
lngCount = dicDistinct.Count
Else
If LenB(varValues) > 0 Then
lngCount = 1
End If
End If
COUNTDISTINCTdicStatic = lngCount
Exit Function
ErrorHandler:
COUNTDISTINCTdicStatic = CVErr(xlErrValue)
End Function

Checking The UDFs

First let’s verify that the UDFs all return the same result as the FREQUENCY() formula.

As you can see, all four formulae are type-insensitive (they don’t distinguish between text, numbers, logicals etc) and case-insensitive (for example, they don’t distinguish between "A" and "a" ).

Calculation Benchmarking

The two main factors when benchmarking these UDFs are the size of the precedent range and the cardinality (the uniqueness) of the values, so I tested them against a variety of range sizes using the two extreme scenarios: all the values being different and all the values being the same. Here are the results:

The FREQUENCY() formula performs the best when all the values are the same but becomes extremely slow when all the values are different; in fact, I gave up timing it beyond 10,000 cells.

There’s no material difference between the two Dictionary object UDFs which surprised me. I expected the static one to be faster until the number of items in the Dictionary from the previous call became very large, thus slowing down the RemoveAll() method.

All of the formulae are pretty fast when all the values are the same.

The Collection object UDF would generally seem to be the best overall choice.

However, the testing so far has been a little bit unfair on the Dictionary object UDFs. This is because I went out of my way to make all the UDFs return the same results and, in so doing, I converted all the cell values to String types to make the Dictionary UDFs type-insensitive and used a Dictionary.CompareMode of TextCompare to make them case-insensitive. These both hit the performance. If I make the Dictionary UDFs type-sensitive (no conversion to String types) and case-sensitive (BinaryCompare) then the benchmarking looks a little different:

Now the Dictionary UDFs are the fastest in both scenarios. Given the performance tests and the features of each UDF, my general choice would be to go with the Dictionary object. If you specifically want a type-insensitive and case-insensitive UDF then go with the Collection object.

Enhanced Distinct Count – Dictionary Object

I’ve enhanced the basic, static Dictionary UDF below to give it a bit more punch. One of the changes I’ve made is to reduce the precedent range if a whole column reference has been passed in. To do that I’ve used the Worksheet.UsedRange property which Charles Williams recently blogged about: it can become extremely slow if there are a lot of cells containing data or formatting on the worksheet (it’s not the size of the used range that matters) because it internally reads from the Cell table. Have a read through his blog post – you might decide that you would prefer to remove it or use an alternative approach.

Public Function COUNTDISTINCT( _
ByRef rngToCheck As Range, _
Optional ByVal blnCaseSensitive As Boolean = True _
) As Variant
Static dicDistinct As Object
Dim varValues As Variant, varValue As Variant
Dim lngCount As Long, lngRow As Long, lngCol As Long
On Error GoTo ErrorHandler
'minimise the precedent range in case of full column references
'for more information on performance of Worksheet.UsedRange
'see Charles Williams' blog:
'http://fastexcel.wordpress.com/2012/12/02/writing-efficient-udfs-part-11-full-column-references-in-udfs-used-range-is-slow/
Set rngToCheck = Intersect(rngToCheck.Worksheet.UsedRange, rngToCheck)
If Not rngToCheck Is Nothing Then
'assign cell value(s) into memory so they
'are faster to work with
varValues = rngToCheck.Value
'if rngToCheck is more than 1 cell then
'varValues will be a 2 dimensional array
If IsArray(varValues) Then
If dicDistinct Is Nothing Then
Set dicDistinct = CreateObject("Scripting.Dictionary")
dicDistinct.CompareMode = BinaryCompare
Else
dicDistinct.RemoveAll
End If
For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
For lngCol = LBound(varValues, 2) To UBound(varValues, 2)
varValue = varValues(lngRow, lngCol)
'ignore error values
If Not IsError(varValue) Then
'ignore blank cells
'including formulae which return ""
If LenB(varValue) > 0 Then
'if we have a string then let's allow for case sensitivity
If VarType(varValue) = vbString Then
If Not blnCaseSensitive Then
varValue = UCase(varValue)
End If
End If
If Not dicDistinct.Exists(varValue) Then
dicDistinct.Add varValue, vbNullString
End If
End If
End If
Next lngCol
Next lngRow
lngCount = dicDistinct.Count
Else
'ignore if cell contains an error or is blank
If Not IsError(varValues) Then
If LenB(varValues) > 0 Then
lngCount = 1
End If
End If
End If
End If
COUNTDISTINCT = lngCount
Exit Function
ErrorHandler:
COUNTDISTINCT = CVErr(xlErrValue)
End Function

Note the following points:

It will count number, text and logical data types but it ignores error values such as #N/A and DIV/0!.

It ignores empty (and blank) cells.

It is case-sensitive by default.

It differentiates between data types. For example, these two formulae would be considered to be different: =TRUE() ="True" , as would ="1" and =1.

A reference to the Microsoft Scripting Runtime library is required. If you do not include the reference then the Dictionary object needs to be declared as an Object type and BinaryCompare set as a constant equal to 0.

It can work on data across multiple columns.

Here are a couple of examples to validate the UDF’s results.

Enhanced Unique Count – Dictionary Object

The code to get a unique count is very similar.

Public Function COUNTUNIQUE( _
ByRef rngToCheck As Range, _
Optional ByVal blnCaseSensitive As Boolean = True _
) As Variant
Static dicDistinct As Object
Dim varValues As Variant, varValue As Variant, varItems As Variant
Dim lngCount As Long, lngItem As Long
Dim lngRow As Long, lngCol As Long
On Error GoTo ErrorHandler
'minimise the precedent range in case of full column references
'for more information on performance of Worksheet.UsedRange
'see Charles Williams' blog:
'http://fastexcel.wordpress.com/2012/12/02/writing-efficient-udfs-part-11-full-column-references-in-udfs-used-range-is-slow/
Set rngToCheck = Intersect(rngToCheck.Worksheet.UsedRange, rngToCheck)
If Not rngToCheck Is Nothing Then
'assign cell value(s) into memory so they
'are faster to work with
varValues = rngToCheck.Value
'if rngToCheck is more than 1 cell then
'varValues will be a 2 dimensional array
If IsArray(varValues) Then
If dicDistinct Is Nothing Then
Set dicDistinct = CreateObject("Scripting.Dictionary")
dicDistinct.CompareMode = BinaryCompare
Else
dicDistinct.RemoveAll
End If
For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
For lngCol = LBound(varValues, 2) To UBound(varValues, 2)
varValue = varValues(lngRow, lngCol)
'ignore error values
If Not IsError(varValue) Then
'ignore blank cells
'including formulae which return ""
If LenB(varValue) > 0 Then
'if we have a string then let's allow for case sensitivity
If VarType(varValue) = vbString Then
If Not blnCaseSensitive Then
varValue = UCase(varValue)
End If
End If
'if it already exists then keep a counter on
'how many times it occurs
If dicDistinct.Exists(varValue) Then
dicDistinct.Item(varValue) = dicDistinct.Item(varValue) + 1
Else
'else add it with an occurence of 1
dicDistinct.Add varValue, 1
End If
End If
End If
Next lngCol
Next lngRow
'we're only interested in values which appeared exactly once
varItems = dicDistinct.Items
For lngItem = LBound(varItems, 1) To UBound(varItems, 1)
If varItems(lngItem) = 1 Then
lngCount = lngCount + 1
End If
Next lngItem
Else
'ignore if cell contains an error or is blank
If Not IsError(varValues) Then
If LenB(varValues) > 0 Then
lngCount = 1
End If
End If
End If
End If
COUNTUNIQUE = lngCount
Exit Function
ErrorHandler:
COUNTUNIQUE = CVErr(xlErrValue)
End Function

Wrapping It Up

Using a well-written UDF is a good solution to this problem. The complications apparent in the FREQUENCY() formula are hidden away in the UDF’s code which allows for a simple formula. The calculation performance of the Dictionary UDF is superior to the FREQUENCY() formula, particularly when there are a lot of different values in the precedent range. The flexibility afforded by using code also allows for simple adjustments to enhance the features of the UDF, such as making it case-sensitive.

Are there other good ways to get a distinct (or unique) count? Yes, there are. Two good options would be to use the advanced filter or a pivot table to get a distinct list and then use a simple formula to count the items in that list. I’ve got a blog post planned for automating the advanced filter using VBA, so I’ll cover that in more detail then.

6 Responses to Count Distinct Or Unique Values – VBA UDF

Hi Colin. I did a post at http://dailydoseofexcel.com/archives/2013/10/23/dictionaries-can-be-rude/ recently that did some testing of deduping using dictionaries vs other approaches that may be of interest. Among other things, it turns out that if you have lots of unique items, if you sort them first before you add them to the dictionary, things run considerably faster.