in a nutshell: This post will be a little different in nature as I’m still working on a couple hacks that will require a few more weeks of work. In the interim I’ll explain the very basics of web scraping/crawling. This is a great skill to have as it will open the doors to the limitless amounts of data from the web.

what you need:

Python

HTML Parser (PyQuery, BeautifulSoup, etc) Note that this is optional, you can do the parsing yourself with regexps

implementation: To start we’ll go over the very basic idea. The goal is to have a few starter links, and either scrape more links off of them or scrape the data and store it into a database. To do this we’ll use the Python module urllib2. Here’s is a sample starter script:

To start, the Python request object can specify a ton more stuff and it is also how you would spoof your header. You can read about it here. Then we make the http request which will timeout after TIMEOUT seconds. If we’re successful in opening the page we can read all the html by calling read(). There are also other functions supplied that allow you to get the website status code and other fun things. Now that we have this blob, (this is where it is nice to have a html parser) we’ll use a regex to extract the data we want.

linkPattern = r'href="(.*?)"' #Regex for grabbing all the links on the page
linkRe = re.compile(linkPattern)
matchLinks = linkRe.findall(page) #Finding all the links on the page
print matchLinks

Bam. There you go, you now have a whole list of links in which the possibilities are endless. You can repeat the same process on each of these links to scrape their data and so on. This is a super basic example of web scraping but it can help you get off the ground.

areas to explore: Now that you know the basics you can do so much more

Set up a cron job to run your scraper monthly, daily, or even every second (easily save every article on Hacker News)

As I finish up my hack (largely based on searching and natural language processing), I wanted to step back and praise Google for their amazing search. Much of what they do is taken for granted, but over the course of the last two weeks I’ve really dived into the inner workings of Google’s search. I have renewed appreciation. Consider the fact that there are approximately 25 billion web pages contained in the Internet. Then consider that Google is able to not only come up with the most relevant articles but also do it in less than half a second. 25 billion web pages! Even naively, by taking the search query and just looking for matches, that’s a whole a data to go through.

But let’s take it a little further. Before you have even typed your second letter Google is already guessing what you are searching with, to the best of my guesses, some sort of Markov chaining. Alright so what? They are smart dudes after all and this should be expected. Now take a look under one of the resultant links. A nice little description of the relevant sentences pertaining to your query. As I looked to add this to my website, I began to wonder how they did this. First, they must store all that text from the articles somewhere (which was a deal breaker for me since I’m using one server that’s sitting under my bed). Then they have to sift through the text to find what’s relevant to the query using some natural language processing. It’s pretty neat when you think about it. Further, when I was building my search function I ran into the issue of stemmed words. If you stem a query (which is usually a good idea) you must also stem the text you’re searching in. So Google probably has 2 copies of every article, stemmed and unstemmed. Then what? Do you search the stemmed query in the stemmed text and then somehow relate that to the unstemmed text that needs to be displayed as a description? By this point, my mind is blown. But on top of all that, Google has to handle 400 million queries a day. That’s a lot of queries.

I could go on forever talking about all the stuff that needs to be done in that half of a second (advanced searches, etc), but I thought this was a good list to start with. Just to recap, here’s what goes on in just a day of searches:

for (int i = 0; i < 400 million; i++) {

Process query and predict what the user is going to query for

Handle funky searches and advanced searches

Find relevant articles out of the 25 billions pages indexed

Search articles for relevant description

Output unstemmed description

Finish in under half a second

}

I hope that next time you type a search in on Google you take a second to think about the crazy technology that must be behind it. For that matter take a second to think about any site you go to, consider how painstakingly long it took the developer just to churn out one simple feature. How many bugs and how many times he or she had to hit the refresh button on the browser in order to get that site working like a well oiled machine. Thank you to all those brilliant developers out there creating amazing things everyday.

I recently launched my new site HNInstant, and these are the nuggets of knowledge that I picked up.

What I learned

All that I learned in this project is beyond the scope of this article cause it’s just way too much so I’ll hit the key non-technical points.

Build something and show it to the world

This is key. This is the first time I’ve ever gone all the way through on a project and put it up on Hacker News. Watching the points and feedback stream was indescribable. Nothing could take my eyes away from the screen as I watched the points and comments rack up on HN. Even the stinging comments from British people bashing the “colour” of my website felt amazing (nothing against British people, pretty much everyone bashed the color of my website). The 4 weeks of hard work paid off in those 2 hours. This is why it is great to be an engineer. Find a need and fill it.

Bust through the lulls

Every project has these, and I’ve hit them many times. It’s right when you finish coding all the “fun” stuff and you have start the “boring” stuff. It reminds of the days when I had gymnastics practice 6 times a week and my dad would tell me it’s the days you don’t want to go that you get stronger. In the same vein, pushing through all the boring and monotonous days of making uninteresting changes will open the door to success. For me, it was making small design changes, writing the about page (always extremely painful to write these), making the webpage compatible on all devices, and last but definitely not least, fixing darn’d bugs. Whatever it is though, push through it! I like writing a list of all the boring stuff that needs to get done and sticking it to my monitor. This way when I have a spare moment or two, I’ll be able to knock a few things off the list.

Make sure your website is pristine before posting to HN

I learned this one the hard way. People have no shame digging into the flaws of your project (as it should be). You have one shot on HN so do it right.

Appreciate

Bask in the glory of your hard work. There is nothing more satisfying. Then fix the site.

What I used:

Host – Me

I wiped my old desktop and installed Ubuntu server on it. This is nice because I didn’t have to pay ridiculous amounts to keep the site up. I’ll write about getting a server up and running in a later post since it wasn’t too trivial, but here’s a good guide to get you started.

Framework – Django

Easy choice for me since I was already familiar with it. Django is very easy to setup and quick for development which was what I was looking for.

Database – MongoDB

This was a tough call since I’d never used it before and after reading this, I was unsure. However, I clearly wasn’t going to stress MongoDB as much as other sites, so I ultimately decided that I was looking to store a tree as a backend and this was the obvious choice.

How I did it:

The scraper

The scraper was most essential to my site since it was the program that analyzed the data and organized it properly.

The parsing and discovering the relevance of the article was the most difficult part for me. I’ve had limited exposure to natural language processing in my career at Stanford (thank goodness I found python’s nltk package). After opening a link scraped off the site (which I did using PyQuery to grab the score, link, title), I used a simple parser to give me content rather than side bar information and such. To do this you can read this article here. Simply put, it measures the text to html tag ratio to decide whether the text is important or not (more text than html mark-up usually means it’s content rather than navigation bars).

Once I had all the text I used the nltk package to tag all the words by their part of speech. After tagging all the words, I only took the ones that were nouns. Then I stored the stemmed version of the word along with the original word and the frequency in a dictionary. I repeated this same process with the title of the articles.This part I could definitely use some guidance on, so any suggestions would be welcomed.

Once I had parsed all my data, I constructed a suffix tree to store it in. The node structure was as follows:

Then to store a word I’d simply walk down the tree. For example, suppose I was storing the word ‘web’:

‘w’ – search for character ‘e’ in the children array of ‘w’ if it exists, load it, if it doesn’t, create a new child node ‘e’
‘e’ – search for character ‘b’ in the children array of ‘e’ if it exists, load it, if it doesn’t, create a new child node ‘b’
‘b’ – no more letters in query, if node exists, add the doc dictionary to the _docs, if it doesn’t exists create a new node and add the doc dictionary to the _docs

In addition to using a suffix tree, I also used another table that was a collection of nodes containing the title, score, link, and timestamp. This was used for the “what’s hot” button to quickly get the highest scoring articles.

The server side code

The server side code would simply handle two types of requests. An ajax call from a search, and an ajax call from the “what’s hot” button.

To handle the search, I would first break up the query and stem all the words. Then for each word, traverse the suffix tree. If nothing was found, return nothing, if I was on a node with no docs, I would find the nearest node using a BFS search (this was how I got my suggestions). I’d repeat this so I’d have a resultant set of docs for each word in the query. After processing all the words, I’d take the intersection of all the sets to get the set of titles that match all the query terms. Then simply hand over the data to the front end. This is convenient since it takes O(n), where n is the length of the query, to look up all the words, and then it takes O(1) to intersect a set which must be done for each word. Thus, it made for a very quick lookup, but I fear that it’ll scale badly. I have to return all the results since it’s hard to return a minimal set and keep track of what articles have already been displayed. In other words, it’s really hard to do pagination since I’m working with sets and not an iterable data structure.

The “what’s hot” button utilized the other table where the nodes only contained the title, link, score, and timestamp. This made it is easy to do a simple query to grab the highest scoring articles from the past two weeks and limit it to 100 results. This feature proved to be pivotal since it provides a function for the site if the user doesn’t know what to search for.

The front end

This was easy thanks to jQuery. All that was needed was an ajax request to the server when someone typed a letter or hit the “what’s hot” button. It was also necessary to limit the number requests in a certain time period in case the user typed quickly. It would only send an ajax request every .3 seconds, anything less than that, it would wait until the typing slowed or stopped. I also did a few cool tricks with css by looking at a few blogs here and there.

Future plans:

Have options to switch the orange to white and the white to orange (many complaints on this)

If you’re anything like me then pumping out 12-hour-a-day programming sessions are not out of the ordinary. It almost feels natural to sit for hours at a time staring into the terminal. This is actually quite an astonishing feat – it takes time to build up the endurance to pull this off 3,4,5,6 and 7 (hopefully not 7) times a week. After all, it’s not everyone that has the dexterous fingers of a programmer and the mighty reach of the pinky finger to hit that esc key in all those vim sessions. If I am not programming, then I’m usually thinking of what I can program next – a vicious cycle that never seems to end. However, with all those hours spent typing away, the novelty and sheer joy of programming had begun to slip away without me noticing.

With the approach of Thanksgiving break, a spark of joy ran through me – 7 whole days of uninterrupted, purposeful programming. No dumb school programming assignments to do, just my own ideas projected onto the screen. Yet, when break came, I had no drive. Believe me, I tried. I felt like a monkey who no longer liked bananas. So I picked up a book and read my break away. Not a programming book, just a pure fantasy – dragons, dwarves, elves, humans and all things inbetween.

And man did it feel good! It was like wiping away the condensation that has slowly built up unnoticed on the windshield. Life had new meaning! Well maybe I won’t go that far, but it sure did re-energize me. Perhaps this was just a newbie programmer mistake and most know the wonders of a nice break. However, for those, like the old me, who believe that they are immune to the slow and eventual decay of motivation or even those who do know but have forgotten the benefits, then I implore you to take advantage of the upcoming winter break. And when you take a break, really take a break. It’s easy to take a small vacation physically and have your mind still coursing over the same-old-same-old.

This is very important to all, but especially the programmer. Computers are everywhere now and it is getting increasingly harder to unplug yourself. Every small inconvenience can be seen as the potential for a new app or new website. This kind of thinking has its place, but not when you’re on break. Let everything go and you’ll be amazed by the effects it has. For me, I got right back into the programming groove without missing a beat and had renewed appreciation. So do yourself a favor and with Christmas, Hanukkah, and <insert your winter holiday here> coming quickly, give yourself a real break.

I caved. I bought a WordPress theme. $45 I swore I would never spend, and what’s more, against the advice of my girlfriend (and she always gives good advice). I don’t know what overtook me, perhaps it was that ad that I have to look at everytime I log into WordPress, perhaps I was bored with the old theme. Whatever it was, it beat my resolve. I feel like the guy who buys the Snuggie because “if you call now, we’ll give you another for free!”.

While purchasing the WordPress theme was probably an impulse buy, it does bring up an interesting question. Does the look of your site make that much difference? Unfortunately, much to the chagrin of hard-core techie programmers, the wrapper often matters almost as much (if not more) than the content itself. People learn best by example right? So let’s look at one.

Hipmunk vs. Expedia

In case you’re not familiar with Expedia or Hipmunk, they’re basically services that allow you search for the lowest airfare. Both sites give roughly the same results back. Here’s a picture of the two homepages side-by-side:

Clutter

This is immediately obvious when you look at the two sites. When visiting Expedia my eyes seem to glaze over, darting from box to box trying to sift through the ads to find the actual product. I guess Expedia is hoping that by the time I actually find it, I’ll be convinced that if I buy a flight from Expedia I’ll have more money in my bank account (who wouldn’t after reading the word ‘save’ 17 times). In contrast, Hipmunk does away with all the useless ads (do people actually click those things?). The only thing that seems to distract me from finding a flight is the little chipmunk holding a slingshot. And let’s be honest, who doesn’t enjoy looking at the chipmunk? You can lose your audience in the few seconds it takes them to find your product, so although it’s tempting to rack in all that money from ads, refrain from having more space devoted to ads than content. I understand ads can be an important revenue source for websites, but do it in more inconspicuous ways.

Simplicity

If your site is easy and simple to use, people will come back. Think of Chipotle, they make burritos, but so does every other Mexican restaurant. Chipotle beats them on simplicity, there’s no easier way to pick out a burrito than to simply walk down a line of ingredients and point at an intriguing looking salsa. Other restaurants will often bombard you with menu choices where you’re forced to read the fine print of every item, only to be foiled when mushrooms somehow slip into your burrito. The same idea applies to products. Sometimes choices overwhelm the user rather than aid in the decision process. Expedia presents you with a sea of radio buttons. To be exact there are 12 radio buttons and 1 check box – not to mention that the form changes every time you select a different radio button. Comparatively, Hipmunk has the 4 essentials for planning a flight. They offer more complicated features but don’t force them on you unless you need them. If most of your users only need a subset of your available features, then by all means, only present them with such.

Pizzazz

That extra somethin’ somethin’. Every site can more or less do away with clutter and make their site simple, but to really stand out you need something extra. Hipmunk accomplishes this with its friendly and inviting chipmunk. Apple accomplishes this with their sleek and slender products. Expedia doesn’t accomplish this. There’s no intangible force guiding me back to Expedia. Apple can captivate crowds of people in their stores, produce unwavering fanboys and sell ridiculously priced products all because they have pizzazz. Despite all the banter about Mac vs PC, for the lay user, Mac and PC provide very similar products. They both edit text, browse the internet and save photos. But they are sold for dramatically different prices. Having pizzazz is difficult to obtain, but creating a product that has it will elevate you above your competitors. This post talks all about how to give your product pizzazz.

So what can you do?

Cover your bases

Take care of the easy ones first. Make sure your site is simple, direct and to the point. Often this is difficult when you’re the one making the site. Feedback is critical during this stage.

Discover your inner product

Finding that pizzazz in your website or product is often the hardest thing to do. Obviously it takes a little creativity, but there are good methodical ways to help you through.

Create a word cloud – One good tactic that works for me is to create a word cloud. Mapping out a nice web of all the words that are related to your website can help you understand what your website it really about. Sometimes staring at your web long enough will reveal a common theme. Use that theme to make design choices in your product.

Take a shower – It’s no surprise that some of the best ideas are discovered during a nice hot shower. There are no facebook notifications or text messages to disrupt your shower time – use the time to brainstorm how you want your product to feel (will it be clean and crisp? friendly?)

Don’t stop until you get enough – It’s not enough to simply brainstorm one idea for your product. Maybe your first idea will be your best idea, but many times it is not. Keep thinking of how you want your product to feel. After you come up with 2 or 3 ideas, you’ll have more perspective and be able to choose the best.

Be bold – Some crazy ideas work well, some don’t, but either way you’ll get yourself noticed. Play with something bold and new. The Hipmunk chipmunk is a perfect example. A chipmunk as a logo isn’t usually one’s first idea for a airfare finder website, but it was bold, and it succeeded in creating a new feel for the product.

These tips will by no means guarantee success, but it can help you get on the right path.

—-

As hard as it is to swallow, design is important. So do what it takes to make an awesome website – grab those CSS stylesheets by the throat and wrestle them into submission and curse Internet Explorer all the while. You have one shot at impressing the user, so use it wisely.

Recently Stanford has started a new initiative to bring free classes to the public. From what I’ve seen from statistics, this venture has been extraordinarily successful with over 100,000 sign ups. Most likely only a fraction went through with the class, but that’s still a lot of people, especially for the first time. There has been quite a lot of press about these classes, but none seem to take into account the effects it has on the students that attend Stanford. Despite the success and the raves of great reviews, I was not at all satisfied by the CS229a: Applied Machine Learning, one of the three courses offered to the public fall quarter. Before I begin though, I want to say that I completely agree that education should not be locked up for only a few to use and I also agree that since education, in my mind, is a right, then it should be provided for free. Thus the Stanford initiative to do this is a great thing. However, there are quite a few things that hopefully Stanford will change in the future.

Rigor

First and foremost, the academic rigor of Stanford classes should be upheld. Going into CS229a, I knew it was going to be easier than its counterpart, CS229, since 229a focused on the applied side of machine learning and thus we didn’t have to learn the nasty mathematical part. In case you’re not familiar, the format of the class is such: watch 5-6 online vidoes (~10 minutes each), complete some review questions, complete a programming assignment for each week.

Since the video lectures were excellent in the class, I’ll start with the programming exercises. At the beginning, some of the programming assignments were challenging since I wasn’t used to matlab/octave programming or machine learning. However, the level of difficulty dropped off drastically as the quarter progressed. At its worst, I completed a few programming assignments without even knowing that the corresponding lectures had been released (I have never done machine learning in the past). This is not a tribute to a stroke of brilliance I had, but rather how worthless the assignments became. I completed the program without even knowing what I was doing. The pdfs and comments associated with the programming assignments became so informative and gave so many hints that almost no critical thinking was needed. After talking to a TA (teaching assistant), it seemed that the programming assignments were tailored to fit the needs of the public (apparently large streams of questions came in after the first assignment was released). It’s definitely fair that there would be a lot of questions, people come from all kinds of different backgrounds, but to sacrifice critical thinking so that there are less questions is not something I’m OK with.

Next, there were the review questions. These were simple from the beginning to the end. I don’t have as much of a tiff against these as sometimes it’s good to just refresh what you learned in the lectures, but the questions hardly ever asked anything that the lecture didn’t explicitly state. A little thinking would have made these more interesting.

If these classes are going to be labeled as Stanford classes, then they should be taught as such. CS229a has by far been my easiest CS class (besides maybe the final project) I’ve taken at Stanford. Normally, I wouldn’t have had a problem with this, except now that Winter quarter registration has opened and I have found that half of my classes are now open to the public in the online format, I’m worried that the rest of the classes will follow this trend. If all of my classes suddenly become as easy 229a, I will be seriously disappointed. I came primarily to Stanford to learn and study – classes like CS229a don’t satiate that desire. Perhaps it’s a fluke and the other online classes will be much more difficult, but it is still worrisome. Stanford needs to keep rigor even in their online courses – it’s useless to lower the bar so low that it only takes a small step to get over.

Separation

In the future, I think it’d be best for Stanford to separate the students from the public for a few reasons.

Online lectures suck. Sure, they’re great for rainy days or people learning at a distance or people that don’t go to Stanford. However, these new classes are getting rid of in-person lectures completely. I met barely anyone in my CS229a class. Everything was done alone in my room, which is kind of crappy especially when there is such a nice campus right outside. If Stanford is going to offer these classes, then by all means offer them, but don’t make students take them as well. Have the professors teach as many students as they can in-person and the rest can watch online.

Stanford “free” classes aren’t free. Stanford students have to pay for them. The fact that I’m paying for them doesn’t bother me, the fact that people who aren’t paying for them have changed the class more than the ones who have, does. I’m sorry, but if I’m going to have to pay $50,000 a year to go to Stanford then the classes should be tailored to fit the students – not a working professional who wants to learn a little machine learning on the side. That is why I propose that they should separate the classes. Then if the assignments aren’t clear enough or whatever, the online public version can tailor to suit their needs and the in-person version can tailor to suit the students.

If all of Stanford’s classes are to be open to the public, then all those classes will quickly lose their value. By establishing a separation between the students and the public, Stanford will maintain the value of the classes for its students and the public will still be able to learn about a variety of topics.

—-

The initiative that Stanford has taken to open up education is great. However, God help me if all my classes become 2 hour weekly online lectures with review questions and auto-graded programming exercises. Stanford can expect a letter from me asking to get a cut in my tuition if the classes begin to go the way of CS229a.

As a student, I’ve had a fair number of interviews and have discussed with others some of the ridiculous things companies do during the interview process. So I decided to put together a guide to concoct the most painful interview process based on things I’ve heard and experienced. Enjoy!

Step 1: Email potential interviewee

Start out normal. Email the victim interviewee explaining how interested you are in finding out more about him – how you’d like to see if he or she would be a good match for the company.

Keep the facade going, lure him in by seeming like you will actually schedule this interview. This is an important step, this gives the interviewee hope. That ‘hope’ has to be enough so that he will stick with you until the dirty end.

Step 3: Wait until after those dates have passed to respond

Perfect way to start the process. You’ve probably peeved him a little, but his emails are still full of niceties like “Thank you” and “Hello.”

Step 4: Repeat Steps 2 & 3 a sufficient number of times

Break his will by repeating steps 2 & 3. What’s a “sufficient” number of times? Some key indicators are: the failure to address you by your name at the beginning of the email, no more “Thank you”s, noticeable terseness in his responses.

Step 5: Give the wrong address to your interviewee

So you’ve finally worked out a date. He’s probably pretty relieved and he might even ask you where the office is. Nail him here by “accidentally” sending him the wrong address. He’s a busy guy so taking time to drive somewhere during the work day is most likely a hassle at the least. Even better, the interviewee is a student and doesn’t even have a car. Now that student has to pay an arm and a leg to get/borrow someone’s car. Score.

Step 6: Reschedule with correct address

You can lose him here, so your apology email must be sincere. Pretend like you’ve turned a new leaf. Reschedule the interview with new vigor and assure him that you’ve given him the correct address.

Step 7: The interview

Give the guy hope. Toy with his emotions as much as possible. Tell him how much you need his skills and how he would be a great addition to the team. Throw him softball questions that you know he’ll knock out of the park. Make sure the interview lasts a long time – you want to make sure you really wasted his day. Then end the interview with a firm handshake and say “We’ll get back to you soon.”

Step 8: Never Respond

This is key. Up to this point, it’s all tolerable – the interview process is usually not fun. Miscommunications happen all the time, your interviewee can look past this. Long interviews are fine when your interviewee sees the value in it and got a chance to meet some of your team. However, simply not responding after he comes in for an interview, after taking time out of his day, after sweating through all your interview questions is a small push that will make him never want to come back. Maybe it’s because it’s so easy to write a response or maybe it’s because an answer would put to rest his thoughts – whatever it is, this usually does the trick.