Ten years ago, Netflix started the Netflix challenge. A contest to see if the community could come up with a movie recommendation approach that beat their own by 10%. One of the primary modeling techniques that came out of the contest was a set of sparse matrix factoring models whose earliest description can be found at Simon Funk’s website. The basic idea is that the actual ratings of movies for each user can be represented by a matrix, say of users on the rows and movies along the columns. We don’t have the full rating matrix, instead, we have a very sparse set of entries. But if we could factor the rating matrix into two separate matrices, say one that was Users by Latent Factors, and one that was Latent Factors by Movies, then we could find the user’s rating for any movie by taking the dot product of the User row and the Movie column.

One thing that is somewhat frustrating about coding Funk’s approach is that it uses Stochastic Gradient Descent as the learning mechanism and it uses L2 regularization which has to be coded up as well. Also, it’s fairly loop-heavy. In order to get any sort of performance, you need to implement it in C/C++. It would be great if we could use a machine learning framework that already has other learning algorithms, multiple types of regularization, and batch training built in. It would also be nice if the framework used Python on the front end, but implemented most of the tight loops in C/C++. Sort of like coding the algorithm in Python and compiling with Cython.

My current favorite framework is Keras. It uses the Theano tensor library for the heavy lifting which also allows the code to run on a GPU, if available. So, here’s a question, can we implement a sparse matrix factoring algorithm in Keras? It turns out that we can:

Ta da! We’ve just created a left embedding layer that creates a Users by Latent Factors matrix and a right embedding layer that creates a Movies by Latent Factors matrix. When the input to these is a user id and a movie id, then they return the latent factor vectors for the user and the movie, respectively. The Merge layer then takes the dot product of these two things to return rating. We compile the model using MSE as the loss function and the AdaMax learning algorithm (which is superior to Sparse Gradient Descent). Our callbacks monitor the validation loss and we save the model weights each time the validation loss has improved.

The really nice thing about this implementation is that we can model hundreds of latent factors quite quickly. In my particular case, I can train on nearly 100 million ratings, half a million users and almost 20,000 movies in roughly 5 minutes per epoch, 30 epochs – 2.5 hours. But if I use the GPU (GeForce GTX 960), the epoch time decreases to 90s for a total training time of 45 minutes.

Let’s see, it’s been about 5 (?!) years since I’ve written anything here. Since then, I’ve gotten married and now have a two year old daughter. My old company was acquired by a large defense contractor, I left a few months ago and I’m now working on machine learning for a medical device company. This space may wind up being used to make machine learning notes to myself.

The past week has been a series of ups and downs, culminating in the saga for the last half of the week. My car. My Celica. The first new car I ever owned started making a noise on Wednesday. The noise was due to extremely low engine oil (hypothetical question – what’s the point of an oil pressure light if it doesn’t come on until after the car is making noise?). The good folks at Jiffy Lube wouldn’t actually work on the car, but were kind enough to fill up the oil and suggested that I take it to an automotive shop immediately. I called up Wasp Automotive and made an appointment for the next day. I had heard a lot of good things about Wasp and even had that confirmed after I made the appointment.

About 2:30 on Thursday, Wasp calls up and says that my car, my baby, my constant companion for the past ten and a half years was dead. Dead. Dead! DEAD!! Well, actually, it’s not dead yet, but it had a terminal injury and would be dead in maybe 100 miles. PANIC! No, don’t panic. I was expecting the noise to be fatal all day on Thursday, so I had checked out a few car places and was contemplating buying a car that night.

One really shouldn’t impulse buy a car. Instead, I made an appointment to see a Prius over at Carmax on Friday. I had a recommendation for Auction Direct and I saw an interesting Prius at another used car dealership nearby. I called Bill Rankin who generously offered to loan me their truck for a few days. I picked up my car from Wasp, drove it to Bill’s, borrowed his truck and was set for the night.

Friday, I left work early, picked up my title and spare key from K and drove in to Raleigh. The Prius at the used car dealership was okay. It seemed to be in mediocre shape, tires weren’t worn down, but they weren’t new either, however it was a 2009. I left there and drove down to Carmax for my appointment to see a 2008 Prius. This one was about $1k more, same mileage, but had more features. The problem was that I really hated the service at Carmax. It was everything I had been warned about. Even though I had an appointment, and had been called about it a half hour before, the salesperson I had spoken to was busy, so they had to track down someone else. They kept pushing their warranties. They wanted to use their financing which was about 3% higher than what I wound up with, etc. It turns out that Carmax salespeople work on commission, go figure. I asked if they could hold the car for a day or two so that I could get financing in order. They said, “no.” kthxbai

Finally, I went over to Auction Direct. They didn’t have any Priuses, but did have a very nice 2008 Honda Civic Hybrid that was about $3k less than the Carmax Prius. I drove it for a bit, decided I liked it and went ahead with the purchase. The Civic Hybrid feels like more of a regular car than a prius. There isn’t a big center LCD display with a graphic showing you when you are charging and when you are draining the batteries. Instead, there are a couple of discrete extra guages that provide the exact same information. I like the fact that it uses regular tires, as opposed to the expensive, short-lived low-profile tires of my old Celica.

The only problem was that it still took forever to finish up. I didn’t get out of there until about a quarter to seven. After all of that, I was not very good company when I got a chance to see my friends. I hope y’all forgive me. I’m doing better this morning, but it has been a long week.

So, I’ve been in my apartment for an entire two weeks now. Sadly, it’s been a bit sparse. There’s a bedroom set, a recliner sitting in front of a television which is resting on my stereo. But fortunately, that should all start changing. Yesterday I placed a fairly large furniture order. For the first time in my life, I’m going to have new furniture! Not a hand me down, nothing 20 (30?) years old, but honest to god new furniture.

My brother (the architect) and his wife (the interior decorator designer [sorry nikki!]) have been incredibly helpful. I showed them some things that I liked and they made recommendations, for improvement. Perhaps the funniest part to me is that when I was showing them what I was thinking, they made the comment that they didn’t realize my furniture tastes were so contemporary. If they both weren’t so nice, and didn’t work in so many different styles, I would have sworn what they said was “we didn’t realize you had good taste.” 🙂

So, in case anyone’s interested, here’s what I’ve picked out.

From the first image, a sofa, love seat, coffee table and end tables. From the second, the dining room table, a bench and four chairs. From the third, just the entertainment center. Then the rug, a couple of lamps and a screen (no idea what I’ll use the screen for, but there it is 🙂 ).

All of it should look very nice in the loft apartment and isn’t unreasonable with respect to the art deco bedroom set. Now I can’t wait for it to arrive.

My local paper kindly reminds me that tomorrow’s moon rise begins the Muslim holy month of Ramadan and that the associated fasting and prayers begin on Wednesday.

Recently, I was talking to my neighbor, the very kind, interesting and extremely conservative Glenn Beck viewer. While we were talking, he brought up Muslims and his fear that they were trying to take over the country (apparently, religious appropriation of a country is only okay if it’s Christian – but that’s another story). According to my neighbor, no religious Muslim could ever possibly be up to any good. I don’t explicitly recall him calling Islam a false religion, but he very much believed that you couldn’t trust “them,” that they were all radicalized and out to destroy “our” way of life.

At the time, I countered by noting that while in graduate school, I knew several Muslims that were all very good people, moreover, I would be surprised if any of them have converted to a fundamentalist form of Islam. My advisor was a (I believe non-practising) Muslim. One of the best and kindest professors in my department was a practising Muslim – on his office wall, he had a discrete prayer calendar reminding him of the appropriate times to pray. Moreover, there were several Muslim graduate students in my research group, including a gentleman named Hakan.

Hakan was a great guy (presumably he still is, I just haven’t seen him in several years). At the time, K and I lived in a town house a couple of miles from my office, in a part of the city that was growing. Because of the population growth, the city decided to put in a new water pumping station just down the street. Unfortunately, we had plastic piping in and leading up to the town house. When they first started testing the new station, a pipe burst in the front yard. We had a plumber come out, dig it up and fix it. Shortly thereafter, a new spot went and I did the digging and repairing myself. Finally, after the third time, I had enough. I was going to (hand!) dig a trench between the water meter and the house and put in copper piping.

Well, I mention this to my officemates, and Hakan volunteers to help dig the ditch. He came out on Saturday morning and we spent several hours in the sun, digging the trench and tunneling under the sidewalk. After a couple of hours, I asked him if he wanted some lunch or something to drink. He said that he couldn’t – it was Ramadan and therefore, forbidden. I was horrified that I had him out in the sun, digging a ditch while he couldn’t eat or drink. It wasn’t the middle of summer, but I recall it being hot enough that I was worried about his health. Regardless, he persevered and we got the copper piping installed. Thinking back on it, my understanding is that a part of Ramadan is service and in taking his beliefs more seriously than many people I know, he was both abiding by the fasting requirements of Ramadan and helping his officemate, even if it was physically taxing.

I do understand why people would be concerned about fundamentalism in Islam, I’m concerned about it too. But I’m also concerned about fundamentalism in Christianity and most other religions for that matter. But those who think that all Muslims are radical or that all of Islam is somehow tainted by the radicals, need to spend some time with Hakan or any of the other Muslims that have positively affected by life.

Last Friday, hsarik pointed out an interesting web site: Echo Nest. They provide a web service that allows you to analyze and remix music. The API also can provide information (meta-data) about music, artists, songs etc. and has Python bindings. If you’ve seen the “More Cowbell” website where you can upload an mp3 and have more cowbell (and more Christopher Walken) added to it, well that site uses Echo Nest and if you download the python bindings for their API, you can see the script that adds the sounds. Personally, I’m fond of “Ob-la-di, Ob-la-da” with 80% cowbell and 20% Christopher Walken.

I started playing with the API and as a first cut thought it would be neat to use the “get_similar” function. So for each artist, you can get the top N similar artists. Now where can I get a list of artists I like? Well, I could type ’em in, but that sucks. So I wrote a small program which:

Opens the database on my iPod (or a directory of mp3 files)

Finds each artist by either reading the iPod db or looking at the id3 tags in all of the files

For each artist, add a node to a graph where the area of the node is proportional to the number of songs that artist has on the iPod (or in the music folder)

For each artist, finds the top 50 similar artists

For all of the similar artists that are in my collection of artists, add a graph edge between the two nodes

Plot the graph

What can I say, I’ve been working on a fair amount of graph-theory at work recently. So after processing my iPod, I came up with the following graph of my current music (click to embiggen):

Okay, that’s pretty cool. Almost completely illegible, but cool. FWIW, the graph has 15 connected components, unfortunately, 13 of them are “singles” (not connected to anything), with one pair (Louis Armstrong paired with Louis Armstrong and Duke Ellington). Fortunately, the graphing tool I use (igraph), has built in tools for doing community analysis (using the leading eigenvector method), i.e., we can automatically find tightly coupled subgraphs. A few examples from the 25 or so communities:

which arguably correspond to “Indie,” “Classic Rock,” “Jam Bands,” “Guitar Gods,” and “Alternative.” If I processed my complete music database, I suspect we would wind up with several other communities, e.g., Blues. But since Robert Johnson is the only blues I’ve got on there right now… he’s in a class by himself.

I suppose it goes w/o saying, that my musical tastes aren’t everyone’s and that if you don’t like my musical tastes, you can keep it to yourself or go DIAF 🙂

So, what’s next? I was talking with M from my office and we’ve come up with another interesting project for the Echo Nest API. This one a) uses the audio analysis functions, and b) if we do it right might cause someone to send us a cease and desist. So, win all the way around.

Four years ago, I made the switch to digital SLR photography. The primary reason was the workflow. When I shot slide film, I would have to get the film developed, look at each image, scan the ones I liked, correct the color balance and then manually remove the dust spots from the scanned images.

When I first got the digital camera, the workflow became: auto-correct the color balance using the Nikon’s color profile, then select the images I liked. Great!

Unfortunately, over the years, my SLR has gotten dust on the sensor, because I was doing what Nikon said and not mucking with the sensor to try to clean it. So, first thing is that I should ignore Nikon and actually clean the sensor. But the second thing is that this has really screwed with my workflow. Last year, after identifying the “good” images, I had to manually go through them and use the Heal tool in the GIMP in order to get rid of a few dust spots. Well, dust is cumulative and this year it was worse than ever. In particular, the dust was more noticeable because I was shooting a lot of waterfalls… long exposures with a small aperture – dust city. Take a look at the following:

To some extent or another, that’s on every single image I took while K and I were on vacation.

I could repeat my old workflow, but that would take days. New idea: there is a tool in the GIMP called the Smart Remove Selection. It takes a selected bit of the image and replaces it with textures from the surrounding area. It’s comparable to Photoshop’s content-aware fill. So, if I can select all of the visible dust, I can clean it at one time. But that’s still slow.

Instead, I selected all of the dust from the image above. Grew the selection by 10 pixels, converted it to a path and then saved the path as an SVG file. Since the dust is at the same location in each image, a single dust file is relevant to all of my images.

Now all I have to do is to open an image, import the path, convert the path to a selection and apply the smart remove. That’s a little better, but still means that I have to touch each file manually.

Enter GIMP scripting. Last night, I wrote a script that takes a file glob, converts it to a list of files, and for each file automatically removes the dust and color corrects the image. It still takes about a minute per file, but it’s completely automated. Unfortunately, the first version of the script only handled horizontal images. But since I always turn the camera clockwise when I shoot vertically, I was able to modify it to rotate the image appropriately, apply the dust removal and then rotate the image back i the height of the image is greater than the width.

Honestly, I can’t believe that we’re talking about a state where you might expect law enforcement to request your papers.

I did have a few questions, including what exactly does the law require? And how will it be enforced? In a nutshell, the law:

Requires officials and agencies to reasonably attempt to determine the immigration status of a person involved in a lawful contact where reasonable suspicion exists regarding the immigration status of the person, except if the determination may hinder or obstruct an investigation.

Okay, so, how does one form a reasonable suspicion? Well, good for Arizona, the law further:

Stipulates that a law enforcement official or agency cannot solely consider race, color or national origin when implementing these provisions, except as permitted by the U.S. or Arizona Constitution.

So, I don’t know what is permitted by the Arizona Constitution, maybe all of those forms of profiling, maybe none. But one final question, given that the mayor of Phoenix doesn’t support the law, how do you guarantee that it gets enforce? Well, a citizen can sue if there’s a policy that doesn’t support enforcement:

Allows a person who is a legal resident of this state to bring an action in superior court to challenge officials and agencies of the state, counties, cities, towns or other political subdivisions that adopt or implement a policy that limits or restricts the enforcement of federal immigration laws to less than the full extent permitted by federal law.

Requires the court to order any that a violating entity pays a civil penalty of at least $1,000 and not to exceed $5,000 for each day that the policy has remained in effect after it has been found to be violating these provisions.

A few thoughts:

It’s not clear to me how one forms a reasonable suspicion about the immigration status of the person, except given their ethnic background

It’s not clear to me that ethnic background is even restricted as a category for consideration in the law, based on Latinos being Caucasian and questions about what is permitted under Arizona’s constitution

Given #1 and #2 above, I don’t think people realize how ineffective ethnicity is in determining legal status

Fortunately, statistics gives us a good answer to #3. For the following, let’s consider that L indicates Latino, and I represents illegal.

Bayes rule tells us that the probability of being illegal given that you are Latino [ Pr(I | L) ] is the probability of being illegal (the prior, Pr(I)) times the probability of being Latino given that you are Illegal (likelihood, Pr(L|I)) divided by the evidence or the probability of being illegal given Latino and given not Latino [Pr(L|I)*P(I) + Pr(L| not I)*Pr(not I)]

So, the probability of being illegal given that you are Latino is: .14147 or ~14%. Which in my mind is no reason to form a suspicion. Hell, more that 14% of the population are pot smokers, you wouldn’t want to give the police authority to stop and arrest everyone to find that subset who are.

CAVEATS AND NOTES

The data above are from 2000 and may not be current; however, illegal immigration shows a strong economic correlation and the economy is down compared to the boom year of 2000, so the numbers are probably in the right ballpark

I don’t believe that Pr(L | I) = 1.0. This says that all illegals are Latinos. That’s bull. There are plenty of Asian and European illegals. Adjusting this probability down will significantly decrease Pr(I | L). For example, assuming P(L|I) = 0.8 results in P(I|L) = ~.116.

The above analysis assumes that Bayes’ law is true, but if it isn’t then we’re all seriously screwed.

Finally, I would be uncomfortable with racial profiling even if Pr(I|L) > 0.5. It’s just not America when the cops stop and ask you for your citizenship papers. I can think of a few places where that did occur, but won’t risk the Godwin retraction by invoking them.

So, what to do? I was challenged a few years ago as to my solution to the illegal immigrant “problem.” My first response is that you are assuming it’s a problem. After all, studies have shown that low-wage legal and illegal immigrants actually grow an economy. Moreover, since many pay into Social Security and Medicare, without receiving benefits, that helps those programs. OTOH, it’s not fair to have them pay in without receiving services; moreover, the social safety net should be expanded to help all those in our community. So, in that sense illegal immigration is a problem. However, the solution is straight forward. Enforce current laws restricting a business’s ability to hire illegals. Illegals come here due to the draw of jobs. Businesses love ’em because they often work below minimum wage, and don’t complain about things like OSHA requirements. Fine – regulate the businesses better and we’ll

This explains everything. Apparently those of us that are blessed with beards are deemed to be more credible than our clean shaven brethren. At least that seems to be the case for neat, medium length beards. And unless we’re trying to sell underwear… go figure.

I wonder if there’s trust-worthiness scale for beard-type? Inter-webz to the rescue: