Recurrent Neural Network Based Subreddit Recommender System

Introduction

As part of a project course in my second semester, we were tasked with building a system of our chosing that encorporated or showcased any of the Computational Intelligence techniques we learned about in class. For our project, we decided to investigate the application of Recurrent Neural Networks to the task of building a Subreddit recommender system for Reddit users. In this post, I outline some of the implementation details of the final system. A minimal webapp for the final model can be interacted with here, The final research paper for the project can be found here and my collaboraters on the project are Barbara Garza and Suren Oganesian. The github repo for the project can be found here with this jupyter notebook being here.

Model Hypothesis

The goal of the project is to utilize the sequence prediction power of RNN's to predict possibly interesting subreddits to a user based on their comment history. The hypothesis of the recommender model is, given an ordered sequence of user subreddit interactions, patterns will emerge that favour the discovery of paticular new subreddits given that historical user interaction sequence. Intuitively speaking, as users interact with the Reddit ecosystem, they discover new subreddits of interest, but these new discoveries are influenced by the communities they have previously been interacting with. We can then train a model to recognize these emergent subreddit discoveries based on users historical subreddit discovery patterns. When the model is presented with a new sequence of user interaction, it "remembers" other users that historically had similiar interaction habits and recommends their subreddits that the current user has yet to discover.

This sequential view of user interaction/subreddit discovery is similiar in structure to other problems being solved with the use of Recurrent Neural Networks, such as Character Level Language Modelling and Automatic Authorship Detection. Due to the successes of these similiarly structured problems, we have decided to explore RNN models for the subbreddit Recommendator System.

The Data

The secret sauce in any machine learning system, we need data. Reddit provides a convenient API for scrapping its public facing data, and the python package PRAW is a popular and well documented wrapper that we used in this project. With the aim of developing sequences of user subreddit interactions, all we need for our raw data is a list of 3-tuples in the form [username,subreddit,utc timestamp]. The following script provides a helper function to collect and store random user comment data from Reddit's streaming 'all' comments. Note that PRAW authentication config data needs to be stored in a file named 'secret.ini' with:
[reddit]
api_key: key
client_id: id
client_api_key: client key
redirect_url: redir url
user_agent: subreddit-recommender by /u/upcmaici v 0.0.1

Data Munging

We need to parse the raw data into a structure consumpable by a supervised learning algorithm like RNN's. First we build a model vocabulary and ditribution of subreddit popularity from the collect raw data. We use this to build the training dataset, the subreddit interaction sequence for each user, ordered and then split into chunks representing different periods of Reddit interaction and discovery. From each chunk, we can randomly remove a single subreddit from the interaction as the "discovered" subreddit and use it as our training label for the interaction sequences. This formulation brings with it a hyperparameter that will require tuning, namely the sequence size of each chunk of user interaction periods. The proposed model utilizes the distribution of subreddits existing in the dataset to weight the random selection of a subreddit as the sequence label, which gives a higher probability of selection to rarer subreddits. This will smoothen the distribution of training labels across the models vocabulary of subreddits in the dataset. Also, each users interaction sequence has been compressed to only represent the sequence of non-repeating subreddits, to eliminate the repeatative structure of users constantly commenting in a single subreddit, while providing information of the users habits in the reddit ecosystem more generally, allowing the model to distinguish broader patterns from the compressed sequences.

defchunks(l,n):n=max(1,n)return(l[i:i+n]foriinrange(0,len(l),n))defnormalize(lst):s=sum(lst)normed=[itm/sforitminlst]normed[-1]=(normed[-1]+(1-sum(normed)))#pad last value with what ever difference neeeded to make sum to exactly 1returnnormed

"""This routine develops the models vocabulary and vocab_probs is also built, representing the inverse probability
of encounting a paticular subreddit in the given dataset, which is then used to bias the selection of rarer
subreddits as labels to
smooth the distribution of training labels across all subreddits in the vocabulary"""df=pd.DataFrame(raw_data,columns=['user','subreddit','utc_stamp'])train_data=None#free up train_data memoryvocab_counts=df["subreddit"].value_counts()tmp_vocab=list(vocab_counts.keys())total_counts=sum(vocab_counts.values)inv_prob=[total_counts/vocab_counts[sub]forsubintmp_vocab]vocab=["Unseen-Sub"]+tmp_vocab#build place holder, Unseen-Sub, for all subs not in vocabtmp_vocab_probs=normalize(inv_prob)#force probs sum to 1 by adding differenc to "Unseen-sub" probabilityvocab_probs=[1-sum(tmp_vocab_probs)]+tmp_vocab_probsprint("Vocab size = "+str(len(vocab)))

Vocab size = 3546

sequence_chunk_size=15defremove_repeating_subs(raw_data):cache_data={}prev_usr=Nonepast_sub=Noneforcomment_datainraw_data:current_usr=comment_data[0]ifcurrent_usr!=prev_usr:#New user found in sorted comment data, begin sequence extraction for new userifprev_usr!=Noneandprev_usrnotincache_data.keys():#dump sequences to cache for previous user if not in cachecache_data[prev_usr]=usr_sub_sequsr_sub_seq=[comment_data[1]]#initialize user sub sequence list with first sub for current userpast_sub=comment_data[1]else:#if still iterating through the same user, add new sub to sequence if not a repeatifcomment_data[1]!=past_sub:#Check that next sub comment is not a repeat of the last interacted with sub,#filtering out repeated interactionsusr_sub_seq.append(comment_data[1])past_sub=comment_data[1]prev_usr=current_usr#update previous user to being the current one before looping to next commentreturncache_datadefbuild_training_sequences(usr_data):train_seqs=[]#split user sub sequences into provided chunks of size sequence_chunk_sizeforusr,usr_sub_seqinusr_data.items():comment_chunks=chunks(usr_sub_seq,sequence_chunk_size)forchnkincomment_chunks:#for each chunk, filter out potential labels to select as training label, filter by the top subs filter listfiltered_subs=[vocab.index(sub)forsubinchnk]iffiltered_subs:#randomly select the label from filtered subs, using the vocab probability distribution to smooth out#representation of subreddit labelsfilter_probs=normalize([vocab_probs[sub_indx]forsub_indxinfiltered_subs])label=np.random.choice(filtered_subs,1,p=filter_probs)[0]#build sequence by ensuring users sub exists in models vocabulary and filtering out the selected#label for this subreddit sequencechnk_seq=[vocab.index(sub)forsubinchnkifsubinvocabandvocab.index(sub)!=label]train_seqs.append([chnk_seq,label,len(chnk_seq)])returntrain_seqs

We transform the munged-data into a pandas dataframe for easier manipulation. Note that the subreddits have been integer encoded, indexed by their order in the vocabulary.

Tensorflow Model Architecture

Originally, we built the model directly on-top of tensorflow, using the fantastic tutorials from R2RT as reference. However, building and managing various neural network architectures with Tensorflow can be cumbersome, and higher level wrapper packages exist to abstract away some of the more tedious variable and graph definition steps required for tensorflow models. We chose the tflearn python package, which has an API similiar to sklearn, which the team had more experience with. With tflearn, it's rather easy to plug and play with different layers, and we experimented with LSTM, GRU and multi-layered Bi-Directional RNN architectures.

Model Training

We split the model into train/test sets and begin training. Here we use the default training parameters, but the model can be tuned for epochs, internal units, dropout, learning-rate and other hyperparameters of the chosen RNN structure.

It can be difficult to tell how well the model is performing simply by staring at the flipping numbers above, but tensorflow provides a visualization tool called tensorboard and tflearn has different prebuilt dashboards which can be changed using the tensorboard_verbose option of the DNN layer.

Visualizng the Model

As part of the model, a high dimension embedding space is learnt representing the subreddits in the vocabulary as vectors that can be reasoned about with "distance" from each other in the embedding space, and visualized with dimensionality reduction techniques, similiar to the concepts used in word2vec. The tutorial by Arthur Juliani here was used to build the embedding visualization.

frombokeh.plottingimportfigure,show,output_notebook,output_filefrombokeh.modelsimportColumnDataSource,LabelSet#control the number of labelled subreddits to displaysparse_labels=[lblifrandom.random()<=0.01else''forlblinvocab]source=ColumnDataSource({'x':lowDWeights[:,0],'y':lowDWeights[:,1],'labels':sparse_labels})TOOLS="hover,crosshair,pan,wheel_zoom,zoom_in,zoom_out,box_zoom,undo,redo,reset,tap,save,box_select,poly_select,lasso_select,"p=figure(tools=TOOLS)p.scatter("x","y",radius=0.1,fill_alpha=0.6,line_color=None,source=source)labels=LabelSet(x="x",y="y",text="labels",y_offset=8,text_font_size="10pt",text_color="#555555",text_align='center',source=source)p.add_layout(labels)#output_file("embedding.html")output_notebook()show(p)

Saving the Model

To save the model for use in making real-world predictions, potentially as part of a webapp, we need to freeze the tensorflow graph and transform the variables into constants to maintain the final network. The tutorial here walks us through how to accomplish this.

fromtensorflow.python.frameworkimportgraph_utildeffreeze_graph(model):# We precise the file fullname of our freezed graphoutput_graph="/tmp/frozen_model.pb"# Before exporting our graph, we need to precise what is our output node# This is how TF decides what part of the Graph he has to keep and what part it can dump# NOTE: this variable is plural, because you can have multiple output nodesoutput_node_names="InputData/X,FullyConnected/Softmax"# We clear devices to allow TensorFlow to control on which device it will load operationsclear_devices=True# We import the meta graph and retrieve a Saver#saver = tf.train.import_meta_graph(input_checkpoint + '.meta', clear_devices=clear_devices)# We retrieve the protobuf graph definitiongraph=model.net.graphinput_graph_def=graph.as_graph_def()# We start a session and restore the graph weights# We use a built-in TF helper to export variables to constantssess=model.sessionoutput_graph_def=graph_util.convert_variables_to_constants(sess,# The session is used to retrieve the weightsinput_graph_def,# The graph_def is used to retrieve the nodes output_node_names.split(",")# The output node names are used to select the usefull nodes)# Finally we serialize and dump the output graph to the filesystemwithtf.gfile.GFile(output_graph,"wb")asf:f.write(output_graph_def.SerializeToString())print("%d ops in the final graph."%len(output_graph_def.node))

defload_graph(frozen_graph_filename):# We load the protobuf file from the disk and parse it to retrieve the # unserialized graph_defwithtf.gfile.GFile(frozen_graph_filename,"rb")asf:graph_def=tf.GraphDef()graph_def.ParseFromString(f.read())# Then, we can use again a convenient built-in function to import a graph_def into the # current default Graphwithtf.Graph().as_default()asgraph:tf.import_graph_def(graph_def,input_map=None,return_elements=None,name="prefix",op_dict=None,producer_op_list=None)returngraph

grph=load_graph("/tmp/frozen_model.pb")

x=grph.get_tensor_by_name('prefix/InputData/X:0')y=grph.get_tensor_by_name("prefix/FullyConnected/Softmax:0")# We launch a Sessionwithtf.Session(graph=grph)assess:# Note: we didn't initialize/restore anything, everything is stored in the graph_defy_out=sess.run(y,feed_dict={x:[[1]*sequence_chunk_size]})print(y_out)# [[ False ]] Yay, it works!

Final Recommender

Using the frozen model, we can predict the most likely subreddits to be of interest to a user by collecting Reddit data for a specific user and provide final recommendations based on the most common subreddits with the highest probabilities from the RNN predictions for each of the subreddit sequence chunks of the user.

fromcollectionsimportCounterdefcollect_user_data(user):#Import configuration parameters, user agent for PRAW Reddit objectconfig=configparser.ConfigParser()config.read('secrets.ini')#load user agent stringreddit_user_agent=config.get('reddit','user_agent')client_id=config.get('reddit','client_id')client_secret=config.get('reddit','client_api_key')#initialize the praw Reddit objectr=praw.Reddit(user_agent=reddit_user_agent,client_id=client_id,client_secret=client_secret)praw_user=r.get_redditor(user)user_data=[(user_comment.subreddit.display_name,user_comment.created_utc)foruser_commentinpraw_user.get_comments(limit=None)]returnsorted(user_data,key=lambdax:x[1])#sort by ascending utc timestampdefuser_recs(user,n_recs=10,chunk_size=sequence_chunk_size):user_data=collect_user_data(user)user_sub_seq=[vocab.index(data[0])ifdata[0]invocabelse0fordatainuser_data]non_repeating_subs=[]fori,subinenumerate(user_sub_seq):ifi==0:non_repeating_subs.append(sub)elifsub!=user_sub_seq[i-1]:non_repeating_subs.append(sub)user_subs=set([vocab[sub_index]forsub_indexinnon_repeating_subs])sub_chunks=list(chunks(non_repeating_subs,chunk_size))user_input=pad_sequences(sub_chunks,maxlen=chunk_size,value=0.,padding='post')x=grph.get_tensor_by_name('prefix/InputData/X:0')y=grph.get_tensor_by_name("prefix/FullyConnected/Softmax:0")withtf.Session(graph=grph)assess:sub_probs=sess.run(y,feed_dict={x:user_input})#select the subreddit with highest prediction prob for each of the input subreddit sequences of the userrecs=[np.argmax(probs)forprobsinsub_probs]filtered_recs=[filt_recforfilt_recinrecsiffilt_recnotinuser_sub_seq]top_x_recs,cnt=zip(*Counter(filtered_recs).most_common(n_recs))sub_recs=[vocab[sub_index]forsub_indexintop_x_recs]returnsub_recs

user_recs("ponderinghydrogen")

['fantasyfootball', 'PS3']

The Web App

Those are all the pieces required to build a functioning subreddit recommender system that users can try! Using Flask, a simple web app can be made taking as input any valid reddit user name and outputting recommendations for that user. A minimal web app doing just that can be interacted with here

Final Thoughts

The model being served in the above webapp is an under-tuned and under-dataed proof-of-concept single layer RNN, but it is still surprisingly capable of suggesting interesting subreddits to some testers I've had use the app. Neural Networks really are powerful methods for tackling difficult problems, and with better and better Machine Learning research and tooling being released daily, and increasingly powerful computers, the pool of potential problems solvable by a group of determined engineers keeps getting larger. I'm looking forward to tackling the next one.