Using ColdFusion Structures To Remove Duplicate List Values

The other week, I posted a solution for Ray Camden's Friday Puzzler for Santa's list. It involved using the ColdFusion structure (struct) as a utility to clean lists. I thought it worked quite nicely and thought that I would talk about it here quickly. The beauty of the ColdFusion struct is that it is an object that maps keys to values. This mapping feature forces the struct to only have unique keys as there would be no way to map two duplicate keys to two different values. This mapping can be leveraged to clean the list of duplicate values.

To demonstrate this, let's build a list that has duplicate values:

<!--- Create a group values that have duplicate values. --->

<cfsavecontent variable="strListData">

Ghost Busters

What About Bob

Broken Flowers

What About Bob

Lost In Translation

Stripes

Scrooge

What About Bob

Lost In Translation

</cfsavecontent>

<!--- Create a list based on the list data. --->

<cfset lstMovies = strListData.Trim().ReplaceAll(

"[\r\n\t]+",

","

) />

This is a bit of a round-about way to create a list, but I am doing this for readability (so you can really see the list). If we output the list above:

<!--- Break list on to different lines. --->

#lstMovies.ReplaceAll( ",", "<br />" )#

... we get:

Ghost BustersWhat About BobBroken FlowersWhat About BobLost In TranslationStripesScroogeWhat About BobLost In Translation

Now, we can loop over the list and add them to the ColdFusion struct. Remember, due to the mapping nature of the struct, duplicate keys will override each other, thereby removing duplicates:

<!--- Create a structure to hold the movie titles. --->

<cfset objMovies = StructNew() />

<!--- Loop over the list to add mappings to the struct. --->

<cfloop index="strMovie" list="#lstMovies#" delimiters=",">

<!---

Store key/value pair. In this case, we don't

really care about the value (hence storing ""

as value). We only care about the key.

--->

<cfset objMovies[ strMovie ] = "" />

</cfloop>

<!--- Convert the now unique key values into a list. --->

<cfset lstMovies = StructKeyList( objMovies ) />

Now, outputting that list:

<!--- Break list on to different lines. --->

#lstMovies.ReplaceAll( ",", "<br />" )#

... we get:

Lost In TranslationWhat About BobStripesScroogeGhost BustersBroken Flowers

Notice that this time, the duplicates have been removed. Notice also that the order of the list has been changed. That usually won't matter as I don't think you should require proper ordering in a list that is required to remove duplicate values.

I am not sure how this compares (performance-wise) to the loop/check strategy that constantly checks for duplicates. I almost want to say that this would be faster as there really is no logic involved. There's no CFIF statements, no conditionals; all we do is set values and then join them together. It's that easy.

One final note: this is not case sensitive. The structure, like most all ColdFusion, treats mixed cases as the same value. If you want to do this type of algorithm with case sensitivity, you can create a Java 2 Hashtable and use it in similar way. I won't go into that though, as that is not something any of us really need to do (or otherwise, we probably shouldn't be using lists perhaps?).

Thanks! I've been looking for a way to speed up the creation of a unique 5 character id list I create for emails tracking in one of our products. The best way I could find to keep the list unique was just using string contains on a list but your struct alternative is MUCH faster. I'm still checking my code but here are some figures:

100,000 IDs- List/String (not going to do it now, tried in the past and its like 20 minutes)- Struct = 50266ms (WOW!)

I have it flushing counts every 100 ID attempts and it pauses around 36800 and 74900 every time so there is some type of conversion going on around those sizes. I noticed similar pauses occurring when using strings.

The algorithm is close to linear instead of growing in time based on the size of the list. Don't remember all that Big O notation stuff to explain that further.

Regular expressions are totally awesome, but they require a lot of parsing and processing. And, the larger the string, the longer it takes in what seems a non-linear fashion... that's not from official testing, that's just what I have noticed.

Glad you like. One thing to note as well is that you can do StructKeyArray() as well if you need to clean an array rather than a list. This will just return the struct-keys into a one-dimensional array.

The regular expression idea interests me. I'm curious about the efficiency of checking for duplicates across a list in a single regular expression search. Not as efficient as structs for deduping but if you have to find dupes it may be faster than looping for every item in list.

First you would need to sort the list, then run a regular expression such as the following "(,|$)([^,]+),\2(,|^)".

Just remembered though that REFind only returns the first match so that would require some looping to find the next match. hmmm

Regular expression are totally awesome... but you want to avoid having to parse a string too many times. I would guess that any kind of looping in conjunction with regex is gonna be slower than using arrays and structs.

I'm not actually advocating a loop through each piece of data, rather an REFind on an entire list string using the delimiter in regular expression to detect dupes. The benefit with a regular expression in this instance over an array loop is that you are only looping as many times as there are matches and you are able to start where you left off.

For example with the regular expression, if there are no duplicates you would do a single REFind which would find no duplicates and you would be done. If it found one you would start another REFind from where you found the first dupe to look for the next.

A loop over each item in the list or array on the other hand would require an action on ever item in the list to compare to every other item in the list. That quickly looses efficiency because processing increases exponentially, which was the need for the struct work around of course.

Anyways, something to think about and I might play with when I have time.

Though the RE method may help us find duplicates be it tell us they exist or tell us which are the duplicates the original purpose of this post was to REMOVE duplicates not to find them.

For it's purpose this works well enough in comparison to the alternatives out there which serve ONLY this purpose and no extra.

As a thought should you find your data in a Query object (say you pulled a list from you database and you only want the unique entries for a particular column where the db does not have this constraint imposed) a QofQ may work and scale far better than this method. Where the method here is linear (O(n)) for those that know the notation it is possible the QoQ method may be < O(n) for larger datasets.

This is a great approach! Thanks for laying out the logic so easily and cleanly. I remember relying on a UDF from cflib.org that also removed duplicates. This one would loop through each list item and add it to a second list only if it hadn't been seen before. As the second list gets larger then searching gets slower. The method shown here would definitely be faster as others have already pointed out. Shoot, Adobe might as well take the code here and make a built-in function out of it. I can dream, can't I?

I would like to order the results based on the number of occurrences, if a result or particular ID shows up more than once it would have a higher position in the array or list. Then the ones that have less would follow behind.

I don't think that's what Jody was exactly looking for but at least it's close.

If there's an array with duplicates I believe Jody wants them ordered from most to least? Obviously the easiest way is likely:

1. Create the 'unique key' structure defined above with value = 02. Loop the array incrementing a counter each time a value is seen3. Iterate the new structure creating again ANOTHER structure where the keys are the numerical counters stored in struct1. When you have 2 keys with the same counter make the value a list or an array of the appropriate text keys.

My solution was based off an earlier comment where Rich showed some code on how to create the unique key struct and maintain a count of how many times each duplicate was seen. Either way, looks like there's a couple of ways to get the same result.

<!--- CREATE THE LIST FOR THE RESULTS || THIS LIST CONTAINS ONLY THE POSITION OF THE ARRAY SO WE CAN THEN LOOP OVER THE LIST FOR APPRIPORATE RESULTS ---> <cfset finalResults_List = "">

<!--- NOW THAT THE SCRUBED URLS ARE CONVERTED TO A LIST WE CAN NOW DO A LOOP OVER THEM AND NOW COMPARE THEM ---> <!--- THE LOOP WILL PREPEND ( ADD TO THE END OF ) THE LIST WHICH THERITICALLY MEANS THAT IT WILL ADD THE FIRST ONE TO THE START ETC... --->

<cfloop from="1" to="#arraylen(results_scrub_url)#" index="curRes">

<!--- THIS LOOP ALLOWS US TO LOOP OVER THE REMAINING RESULTS TO SEE IF WE HAVE A MATCH IF WE HAVE A MATCH WE ARE GOING TO PREPEND THE RESULT TO THE LIST ---> <cfloop from="1" to="#arraylen(results_scrub_url)#" index="checkRes">

No good :(Gave me a "500 server internal error" on CF8 when playing this trick with list consisting of 1 element [37077015].In all other cases it worked like a charm.Think I'll have to try Java method instead :(

Hmm, I am not sure why this would error out with only one list item. I can't see anything that would cause this require two or more list items. What was the exception that is actually raised?

@Paul,

Using the hash set in Java seems like a really cool approach. I believe that under the hood, ColdFusion's struct is actually extending a Map class (FastHashMap if I remember). Looks like the Set class tree is the more appropriate approach. Thanks for the link!

I believe the reason for the exception was not the list length but it's items. 37077015 is a big numeric value. Making a structure key from it might cause some problems.Exception was 500 internal server error and then many encoded characters like "‹ ÅZ{sâ8 ÿ;SµßAÇ" :(

Hi Ben. I ran across this post trying to find a solution to my problem of cleaning duplicates from a 2D array. I can't convert the array to a list because it is not 1D. Do you have any suggestions on how to do this for 2D arrays?Thanks.

No, the widths are not consistent for each row. Right now I am not doing anything with the duplicate values, they're just sitting there being duplicates (lol). I don't know how to get rid of them in the array is the reason.The array is getting duplicate values because I have 2 different queries that are populating the array, each of which may be returning the same value, thus the duplicate.I think I am just going to have to figure out a way to combine the queries, or maybe a way to make it a 1D array, get rid of dups, then go get the other data I need using the array value.