In two days, this Kaggle competition will
end. I took part in
it because it was the kind of competition I enjoy: the problem is offered as
is, as you would find it in a real-world environment, meaning that the building
of the dataset, the feature engineering and all the associated decisions are
part of the fun.

However, I didn't want to use machine learning. The competition forums are
already full of people training neural networks, so I tried another approach. I
just wanted to see how the data was structured, and then see if there was some
extremely low hanging fruit I could pick up to obtain something better than the
benchmark.

All the code I wrote can be found
here. My current
score in the private Leaderboard is 0.3598, which is not bad, all things
considered. No models were trained during this low-cost experiment. The
README.md file in that folder includes some notes I took during the
development; they may make sense.

Data was offered as a set of CSV files, so the first thing I did was to insert
them into several SQLite
tables
so I could query without reading the whole thing into memory (I run everything
on a 2007 laptop. Low-cost, I said.)

What is being asked of us? To predict whether a certain user is going to reorder
a certain product. For this, we are given a historical of orders per user (some
have more, some have less), the distance in time between them in relative terms
(we know an order happened 3 days after the last one, but we don't know if it
was November at the time) and some other information I didn't use.

We can do something a bit better: let's study, per user, how long is the average
order, and reorder the most common
products.
This already requires a bit of Python (perhaps everything could be done in SQL,
but using Python to add some computations made the whole thing much more
convenient to write.) Still, the code is quite short:

importsqlite3importpandasaspdimportcsvimportgzipfromcollectionsimportdefaultdictif__name__=='__main__':conn=sqlite3.connect('data/instacart.db')c=conn.cursor()# Obtain the average number of sales per customerq=""" SELECT user_id, AVG(n_items) as avg_items FROM ( SELECT o.user_id AS user_id, o.order_id, COUNT(*) as n_items FROM order_products__prior opp JOIN orders AS o ON o.order_id = opp.order_id GROUP by o.user_id, o.order_id ) b GROUP BY user_id """# This does the same (?; re-check) and is much fasterq=""" SELECT user_id, MIN(n_items) AS min_items, AVG(n_items) AS avg_items, MAX(n_items) AS max_items FROM orders o INNER JOIN (SELECT order_id, COUNT(*) AS n_items FROM order_products__prior GROUP BY order_id) avg ON avg.order_id = o.order_id GROUP BY user_id """print"Getting order stats..."c.execute(q)order_stats=dict()print"Assigning to dictionary..."forrowinc:order_stats[row[0]]=(row[1],row[2],row[3])# For every customer, sort the bought items in descending popularityq=""" SELECT o.user_id AS user_id, opp.product_id AS product_id, COUNT(*) AS n -- improve here, probably FROM order_products__prior opp JOIN orders o ON o.order_id = opp.order_id GROUP BY o.user_id, opp.product_id ORDER BY o.user_id, n DESC """print"Getting product frequency..."c.execute(q)print"Assigning next order per user..."next_order=defaultdict(list)forrowinc:iflen(next_order[row[0]])<order_stats[row[0]][1]:# the averagenext_order[row[0]].append(row[1])# Now just let's assign ordersprint"Generating CSV file..."q="SELECT order_id, user_id FROM orders WHERE eval_set = 'test'"c.execute(q)result=[]result.append(['order_id','products'])forrowinc:result.append([row[0]," ".join([str(x)forxinnext_order[row[1]]])])# Write compressed CSV filewithgzip.open('/tmp/submission.csv.gz','wb')asf:csvwriter=csv.writer(f,delimiter=',',quotechar='"')forrowinresult:csvwriter.writerow(row)

You can see I was playing around a bit with the query at the beginning, trying
to do some optimizations that made the whole thing run a lot faster. This yields
0.3282527 on the public leaderboard, which is not bad at all.

Without moving too much from this approach, how can we improve the result? We
can weight each order to vary their importance in the next order. For instance,
if something was just ordered, we can decrease the weight in the next one, but
we'd have to study the periodicity for every user, and I wanted to keep this as
simple as possible. There's a generic rule we could try: seasonal products have
seasonal weights. Given that Instacart offers a lot of fruits and vegetables,
I'm just going to assume that everything is seasonal, and that an order from
next week carries a certain weight, but an order from six months ago carries
almost no weight, all the way up to orders from exactly one year ago (the
maximum time difference in the dataset), where the weights reach their maximum
again.

This is done without much difficulty
here
using a cosine and written into a temporary table so I can easily join after
using a query not very different from the one used in the first attempt. Several
variations of this approach gave me the 0.3598 public score I ended up with.