defitemset_1_frequent(df,min_supp):sum_column=df.sum()result=[]forindex,countinsum_column.items():ifcount/df["0"].count()>min_supp:result.append(set([index]))returnresultitemsets={}# dict of lists of setsitemsets[1]=itemset_1_frequent(df,MIN_SUPP)print(itemsets)

defapriori_gen(itemsets):# simple approach, not the one on the paper# it works but does not seem as optimized as the paper approachitems=set()forx_itemsetinitemsets:foriteminx_itemset:items.add(item)C_k=set()forx_itemsetinitemsets:foriteminitems:ifitemnotinx_itemset:C_k.add(frozenset([*x_itemset,item]))elt_to_remove=set()forx_itemsetinC_k:foreltinx_itemset:subset=set([*x_itemset])subset.remove(elt)ifsubsetnotinitemsets:elt_to_remove.add(x_itemset)forremoveinelt_to_remove:C_k.remove(remove)returnC_kapriori_gen([{'1','2','3'},{'1','2','4'},{'1','3','4'},{'1','3','5'},{'2','3','4'}])

Out[5]:

{frozenset({'1', '2', '3', '4'})}

In aPriori, the are three ways of stating that an itemset is infrequent. It could be because:

it is not generated (in the first part of apriori_gen)

it is pruned (second part of apriori_gen, based on the fact that if the itemset {beer} is non-frequent, the itemset {beer, milk} is non-frequent too)

it does not have minimum support (we count in data)

With the approach here, the number of generated itemsets is greater than the one on the paper, leading to a diminution of performances.

defaPriori():itemsets={}# dict of lists of setsitemsets[1]=itemset_1_frequent(df,MIN_SUPP)last_itemset=itemsets[1]itemset_size=1while(last_itemset):itemset_size+=1candidates=apriori_gen(last_itemset)candidates_count={}forindex,rowindf.iterrows():# print(row)local_candidates=subset(candidates,row)forlocal_candidateinlocal_candidates:candidates_count[id(local_candidate)]=candidates_count.get(id(local_candidate),0)+1# check minimum supportforcandidateincandidates:ifcandidates_count[id(candidate)]/df["0"].count()>MIN_SUPP:itemsets.setdefault(itemset_size,[]).append(candidate)last_itemset=itemsets.get(itemset_size,0)returnitemsetsprint(aPriori())# let's print the time it takes with this approachstart_time=time.time()aPriori()print("APriori took {} seconds".format(time.time()-start_time))

In 1994, databases accesses were costly. Today, a dataset of this size fits in RAM. So instead of taking each row, and extracting subsets, we can take each itemset, and count the number of occurences. Theoriticaly complexities are the same.

In [10]:

defaPrioriRAM():itemsets={}# dict of lists of setsitemsets[1]=itemset_1_frequent(df,MIN_SUPP)last_itemset=itemsets[1]itemset_size=1row_list=[]# each row becomes a dictforindex,rowindf.iterrows():elements=[]forheader,valueinrow.iteritems():ifvalue==1:elements.append(header)row_list.append(frozenset(elements))while(last_itemset):itemset_size+=1candidates=apriori_gen(last_itemset)forcandidateincandidates:count=0forrowinrow_list:ifcandidate.issubset(row):count=count+1ifcount/df["0"].count()>MIN_SUPP:itemsets.setdefault(itemset_size,[]).append(candidate)last_itemset=itemsets.get(itemset_size,0)returnitemsetsstart_time=time.time()print(aPrioriRAM())print("APrioriRAM took {} seconds".format(time.time()-start_time))

It is better ! In fact, with this version we use python set, and we use set operations wich are more performant (https://stackoverflow.com/questions/27674289/the-complextiy-of-python-issubset) in python that what I did previously. Indeed, it is faster to just check if item's names are in another set (O(1), depending if colision on the hashtable), instead of doing it on a panda serie (O(n)) like I did previously with this line:

ifnotrow[elt]

In [11]:

# we redefine apriori_gen in order to be closer to the paper# this function need to keep the order or element in sets !# we could try with an orderedSet, but we don't have that by default in python# we will just convert itemsets to sorted listdefapriori_gen(itemsets):sorted_itemsets=[]foritemsetinitemsets:sorted_itemsets.append(sorted(list(itemset)))C_k=set()foritemset1insorted_itemsets:foritemset2insorted_itemsets:ifitemset2[:-1]==itemset1[:-1]anditemset2[-1]!=itemset1[-1]:C_k.add(frozenset(itemset1+[itemset2[-1]]))elt_to_remove=set()forx_itemsetinC_k:foreltinx_itemset:subset=set([*x_itemset])subset.remove(elt)ifsubsetnotinitemsets:elt_to_remove.add(x_itemset)forremoveinelt_to_remove:C_k.remove(remove)returnC_kapriori_gen([{'1','2','3'},{'1','2','4'},{'1','3','4'},{'1','3','5'},{'2','3','4'}])

Out[11]:

{frozenset({'1', '2', '3', '4'})}

In [12]:

# Now we check if this new approach is betterstart_time=time.time()print(aPrioriRAM())print("APrioriRAM took {} seconds".format(time.time()-start_time))

As you can see my two implementations of apriori_gen are producing similar results. Theoriticaly, the second should be better, because it generates less candidates in the first part. In fact, I don't have databases optimizations to make an efficient SQL query like in the paper, so my implementation is not optimal (self join in O(n^2) ...).