Sorting the the LookIn list and using Binary Search to compare the items in the LookFor list

Using linear search of the LookIn list for each item in the LookFor list

Create a Collection containing the LookIn list and check it for each item in the LookFor list

Create a Dictionary containing the LookIn list and check it for each item in the LookIn list

Use an already sorted LookIn List and Binary Search

Look for partial matches using InStr

Overview of the IsInList 2 UDF

The IsInList2 UDF is written as an array function returning an array of True/False. Its designed to be called as a multi-cell array function entered in the column next to the LookFor list, so that you can Filter for False to find all the items in the LookFor list that don’t exits in the LookIn list.

For simplicity the UDF is written assuming that both lists are Ranges with at least 2 items: so the first task is to get the values from the Ranges into variant arrays.
(I have ignored the optimisation of using MATCH directly on the Range).

Then the output array is created as the smaller of the calling cells and the the LookFor list

Then if its an exact match the data is either Sorted, added to a Collection or to a Dictionary

Then the function iterates through the LookFor list using the appropriate method subroutine and stores the result in the output array

The QuickSort Sub

Here is the code for a QuickSort of a variant containing an array. The Quicksort would be substantially faster if it was strongly typed and handled a vector rather than a 1-column matrix.

The Binary Search Function

For simplicity this function is also written to handle a variant containing an array, and would be faster if more strongly typed.
Its not designed as a UDF: its called from IsInList2.

Function BSearchMatch(LookupValue As Variant, LookupArray As Variant) As Boolean
'
' use binary search to find if lookupvalue exists in lookuparray
'
Dim low As Long
Dim high As Long
Dim middle As Long
Dim varMiddle As Variant
Dim jRow As Long
'
jRow = 1
BSearchMatch = False
low = 0
high = UBound(LookupArray)
Do While high - low > 1
middle = (low + high) \ 2
varMiddle = LookupArray(middle, 1)
If varMiddle >= LookupValue Then
high = middle
Else
low = middle
End If
Loop
'
If (low > 0 And high <= UBound(LookupArray)) Then
If LookupArray(high, 1) > LookupValue Then
jRow = low
Else
jRow = high
End If
End If
If LookupValue = LookupArray(jRow, 1) Then BSearchMatch = True
End Function

The Exact Linear Search Function

LMatchExactV is not designed as a UDF: its called from IsInList2.

Function LMatchExactV(LookupValue As Variant, LookupArray As Variant) As Boolean
'
' use linear search to find if LookupValue exists in LookupArray
' LookupArray must be a a 2-dimensional variant array of N rows and 1 column
'
Dim j As Long
'
LMatchExactV = False
For j = 1 To UBound(LookupArray)
If LookupValue = LookupArray(j, 1) Then
LMatchExactV = True
Exit For
End If
Next j
End Function

The Partial Match Linear Search Function

This function use InStr to check for partial match. It is not designed as a UDF: its called from IsInList2.

Function LMatchInV(LookupValue As Variant, LookupArray As Variant) As Boolean
'
' use linear search and Instr to find if LookupValue is within any value in LookupArray
' LookupArray must be a a 2-dimensional variant array of N rows and 1 column
'
Dim j As Long
Dim strLook As String
'
LMatchInV = False
strLook = CStr(LookupValue)
For j = 1 To UBound(LookupArray)
If InStr(LookupArray(j, 1), strLook) > 0 Then
LMatchInV = True
Exit For
End If
Next j
End Function

The IsInList2 Array Function

Because the function uses the Dictionary Object from VBScript you need to add a reference to the Microsoft Scripting Runtime.

The Function takes 2 optional parameters which control the method used:

Performance Comparison

The performance tests are all done using ranges containing a character followed by 5 digit random integer numbers.

The timings are all in Milliseconds.
LookFor and LookIn give the number of rows.

Linear Search timings on smaller numbers of rows are slightly pessimistic because of a high % of miss-matches
BSearchOnly gives timings for Binary Search on already sorted data.
Partial gives timings for Linear Search with partial matching using InStr.

Because Excel VBA array formulas cannot return more than 64K rows you cannot use the function directly for more than 64K rows of LookFor without using more than one array formula.

LookFor

LookIn

Linear

QSortBsearch

Collection

Dict

BsearchOnly

Partial

2

2

0.29

0.29

0.29

0.46

0.29

0.29

50

50

4.42

4.14

4.05

4.06

3.91

4.43

200

200

12.2

5.59

5.08

4.40

4.49

14.2

500

500

55.5

8.9

7.3

4.9

5.66

68.0

2000

2000

824

27.0

19.9

8.03

12.2

995

50000

50000

–

800

515

252

279

–

50000

200000

–

2583

1290

1044

350

–

50000

1048756

–

14204

6720

23650

–

–

Overall the sequence from fastest method to slowest method is:

Dictionary

Binary Search Only (but the data has to already be sorted)

Collection

Quick Sort plus Binary Search

Linear Search

Partial Linear Search

Conclusion

The best method is to use the VBScript Dictionary object, unless you have more than 500K rows after which the Collection object starts to win.

The coding for Dictionary is very simple and it has an Exists method which Collection lacks.

Below about 2000 rows you probably won’t notice the difference between any of the methods except for Linear search.

So now I just have to try the C++ XLL version … I reckon that will beat Dictionary! (Yes, looks like XLL wins: 1923 millisecs for 50000 in 1048756)

I can now give myself a pat on the back, as I correctly picked the collection class for my 1million rows project! I must have the coding instincts of some sort of binary spiderman, I was biten by a radio active byte…

Well truth be told, I would have liked to use as Dictionary, but I wanted to use as little “none native” as possible… although in the same app I’ll have to install an .OXC, so I’m not sure what sense that makes!

Yes the only function written as a UDF is IsInList2 and there is no error checking at all: its all designed as a speed test rather than production code!!! So IsInList2 only works when fed Multi-cell ranges, and there is no checking whether the functions are given ranges or arrays or vectors or scalars etc etc. And the BSearch routine assumes a binary case-sensitive Quicksort rather than an Excel Sort, and it probably won’t work with Unicode etc.

Okay, my apologies…I didn’t understand until just now that BSearchMatch is NOT one of the tests, but instead is a function CALLED by one of the tests. Sorry for all the comments. Feel free to delete them.

Hi,
Interesting stuff. Yesterday I ran into your blog while I was looking what the Internet provides us with concerning the comparison of 2 lists by means of VBA Dictionary Scripting.
Hereunder is what I myself wrote down yesterday, but without making use of VBS.
It might be slightly off topic because I do not understand e.g. the meaning of an expression like “ranges containing a character followed by 5 digit random integer numbers”. I took 6 digit integers between 100000 and 999999. I placed them in columns 1 and 2 of an Excel 2007 spreadsheet and ran my script. I measured time from the moment on that all the numbers were read into the memory and ready for processing. I added some small particles of script to print the numbers that the 2 lists have in common, but treated them so far as comment.
1 Sub CompareLongListsByIndex()
2 Dim StartTime As Double
3 Dim arrLISTS As Variant
4 Dim A() As Byte, B() As Byte
5 Dim C() As Byte, D() As Long
6 Dim i As Long, r As Long, rr As Long
7 Dim min As Long, max As Long
8 Dim t As Variant
9 min = 100000
10 max = 999999
11 r = 0
12 arrLISTS = Sheets(“SORT”).Cells(1, 1).CurrentRegion
13 StartTime = Timer
14 r = UBound(arrLISTS, 1)
15 ReDim A(min To max)
16 ReDim B(min To max)
17 For i = min To max
18 A(i) = 0
19 B(i) = 0
20 Next
21 For i = 1 To r
22 A(arrLISTS(i, 1)) = 1
23 B(arrLISTS(i, 2)) = 1
24 Next
25 Set arrLISTS = Nothing
26 ReDim C(min To max)
27 ‘ReDim D(0 To r)
28 For i = min To max
29 C(i) = A(i) * B(i)
30 j = j + C(i)
31 ‘D(j) = D(j) + i * C(i)
32 ‘Debug.Print j ‘; D(j)
33 Next
34 t = Format((Timer – StartTime) * 1000, “00.000000000000”) & ” ms”
35 Debug.Print t; j
36 End Sub

Of course it is no problem if you remove my post because of offtopicness.
If it is on topic, I might post an improved script that is faster for the larger lists and does not have the limitations mentioned above.
Regards,

Hi Stephan,
Thats an interesting approach.
To compare with the code I posted could you change it so that:
– the numbers in the lists are prefixed by a non-numeric character
– an array of True/False is returned to the sheet containing the data (true for every item in the first list that appears in the second list, else false)
– the 2 lists do not have to be the same length
The time to beat is 2.2 seconds to compare a list containing 1 million items with another list containing 1 million items and return the corresponding 1 million true/falses.

Now I understand, that is to say, not exactly, but nevertheless how. It seems that the time spent on this operation is mainly dependant on the time consumed by printing the array with the results. Yet another question: the numbers with the prefixes may be split up in non-numerical data and numerical data to process and then to be joint again to present the result? If necessary I can give each number a different prefix as to show I am not deceiving. I will give it a try, because it is possible.

The code above takes about 3 times as much time to process the data as the code below does. I used VBA’s Timer to that, starting on line 24 of both code sets. The difference might be caused not only by the amount of data to be processed, but mainly, I suppose, by the double loop printed on lines 33 through 40 in the code above. Just one loop is possible at the cost of the use of an IfThenElse-construct, which on the average I consider to be slowlier than conditionally handling data by Boolean-constructs. The code above is almost the same as the so called Count Sort method. Its limits are soon reached in case the actual numbers have many digits or when you have to compare two lists that contain real text. The advantage is that the length of the LookFor list has hardly any consequences for the speed of data handling. In case the limits are surpassed, there are 2 main approaches if one wants to stick to the indexing method: split up both lists by their last digit: you get 10 pairs of lists to be treated as one pair is (this makes of course n sense for text) process the data according to the Radix Sort method, using the ASCII-code; as far as sorting data is concerned, there is hardly any limit to the length of the lists or the number of characters used for the individual items; this method is very attractive when you have to sort e.g. a very wide range of values, say 10^6 numbers of uneven length in terms of digits contained by an array starting at 1 end ending at 10^100; this can be done on almost any personal computer that has VBA or something like that, although it is recommendable to import your data to and export your data from a spreadsheet but preferably from and to e.g. Notepad.

Now I understand, that is to say, not exactly, but nevertheless how. It seems that the time spent on this operation is mainly dependant on the time consumed by printing the array with the results. Yet another question: the numbers with the prefixes may be split up in non-numerical data and numerical data to process and then to be joint again to present the result? If necessary I can give each number a different prefix as to show I am not deceiving. I will give it a try, because it is possible.

Thanks for MicroTimer. Took me from confusion to clarity in 10 lines or less. Just loop 10000 times and divide? Ever hear of a random walk ? Lots of bogus variability in times in consequence. And why wait 10 min to get what MicroTimer tells you instantly.?