During my usage of reddit, I have observed that many titles have "(Pic)" or "[Picture]", "(Video)", etc. after them. It means that the contents the link points to has a picture or video in it. Sometimes I want to have fun with my friends and go through all the pics or vids. Unfortunately reddit's search is broken and there is really no good way to see the best pics and videos voted on reddit in the past.

I decided to create reddit media site which will monitor reddit's front page, collect picture & video links, and build an archive of them over time.

The site has been launched:

In (read more about it on about this blog page) page I wrote about one of the methods I like to use when developing software (and this project requires writing a few tools quickly, more about them below). It is call "the hacker's approach". The hacker's approach method is basically writing software as fast as possible using everything available and not thinking much about the best development practices, and not worrying what others will think about your code. If you are a good programmer the code quality produced is just a bit worse than writing it carefully but the time saved is enormous.

I will release full source code of website with all the programs generating the website. Also I will blog how the tools work and what ideas I used.

Reddit Media Website's Technical Design Sketch

I use DreamHost shared hosting to run this website. Overall it is great hosting company and I have been with them for more than a year now! Unfortunately since it is a shared hosting, sometimes the server gets overloaded and serving of dynamic pages can become slow (a few seconds to load).

I want the new website to be as fast as possible even when the server is a bit loaded. I do not want any dynamic parsing to be involved when accessing the website. Because of this I will go with generating static HTML pages.

A Perl script will run every 30 mins from crontab, get reddit.com website, extract titles and URLs. Another script will add the titles to the lightweight sqlite on-disk database in case I ever want to make the website dynamic. And the third script will use the entries in the database and generate HTML pages.

Technical Design

A knowledgeable user might ask if this design does not have a race-condition at the moment the new static page is generated and user requesting the same page. The answer is no. The way new pages will be generated is that they will be written to temporary files, then moved in place of the existing ones. The website runs on Linux operating system and by looking up `man 2 rename' we find that

If newpath already exists it will be atomically replaced (subject to a
few conditions - see ERRORS below), so that there is no point at which
another process attempting to access(2,5) newpath will find it missing.

rename system call is atomic which means we have no trouble with race conditions!

Reddit provides RSS feed to the front page news. It has 25 latest news and maybe 5 are media links. That is not enough links to launch the website. People visiting the site will get bored with just 5 links and a few new added daily. I need more content right at the moment I launch the site. Or I could to launch the site later when articles have piled up. Unfortunately, I do not want to wait and I want to launch it ASAP! The hacker's approach!

First, I will create a script which will go through all the pages on reddit looking for picture and video links, and insert the found items in the database. It will match patterns in link titles and will match domains which exclusively contain media.
Here is the list of patterns I could come up with which describe pictures and videos:

picture

pic

image

photo

comic

chart

video

vid

clip

film

movie

And here are the domains found on youtube which exclusively contain media:

This script will output the found items in human readable format, ready for input to another script which will absorb this information and put it in the SQLite database.

This script is called 'reddit_extractor.pl'. It takes one optional argument which is number of reddit pages to extract links from. If no argument is specified, it goes through all reddit pages until it hits the last one. For example, specifying 1 as the first argument makes it parse just the front page. I can now run this script periodically to find links on the front page. No need for parsing RSS.

There is one constant in this script which can be changed. This constant, VOTE_THRESHOLD, sets the threshold of how many votes a post on reddit should have received to be collected by our program. I had to add it because when digging in older reddit's posts, media with 1 or 2 votes can be found which means it really wasn't that good.

The script outputs each media post matching a pattern or domain in the following format:

title (type, user, reddit id, url)

title is the title of the article

type is the media type. It can be one of 'video', 'videos', 'picture', 'pictures'. It's plural if the title contains "pics" or "videos" (plural) form of media.

type - type of the media, can be: 'video', 'videos', 'picture', 'pictures'. It's plural if the title contains "pics" or "videos" (plural) form of media.

date_added - the date the entry was added to the database

The other table, reddit_status contains just two colums:

last_id - the last id in the reddit table which the generator script used for generating the site

last_run - date the of last successful run of the generator script

This script is called 'db_inserter.pl'. It does not take any arguments but has one constant which has to be changed before using. This constant, DATABASE_PATH, defined the path to SQLite database. As I mentioned, it is allowed for the database not to exist, this script will create one on the first invocation.

These two scripts used together can now be periodically run from crontab to monitor the reddit's front page and insert the links in the database. It can be done with as simple command as:

Now that we have our data, we just need to display it in a nice manner. That's the job of generator script.

The generator script will be run after the previous two scripts have been run together and it will use information in the database to build static HTML pages.

Since generating static pages is computationally expensive, the generator has to be smart enough to minimize regeneration of already generated pages. I commented the algorithm (pretty simple algorithm) that minimizes regeneration script carefully, you can take a look at 'generate_pages' function in the source.

The script generates three kinds of pages at the moment - pages containing all pictures and videos, pages containing just pictures and pages containing just videos.

There is a lot of media featured on reddit and as the script keeps things cached, the directory sizes can grow pretty quickly. If a file system which performs badly with thousands of files in a single directory is used, the runtime of the script can degrade. To avoid this, the generator stores cached reddit posts in subdirectories based on the first char of their file name. For example, if a filename of a cached file is 'foo.bar', then it stores the file in /f/foo.bar directory.

The other thing this script does is locate thumbnail images for media. For example, for YouTube videos, it would construct URL to their static thumbnails. For Google Video I could not find a public service for easily getting the thumbnail. The only way I found to get a thumbnail of Google Video is to get the contents of the actual video page and extract it from there. The same applies to many other video sites which do not tell developers how to get the thumbnail of the video. Because of this I had to write a Perl module 'ThumbExtractor.pm', which given a link to a video or picture, extracts the thumbnail.

Some of the links on reddit contain the link to actual image. I wouldn't want the reddit media site to take long to load, that's why I set out to seek a solution for caching small thumbnails on the server the website is generated.

I had to write another module 'ThumbMaker.pm' which goes and downloads the image, makes a thumbnail image of it and saves to a known path accessible from web server.

Netpbm is a toolkit for manipulation of graphic images, including conversion of images between a variety of different formats. There are over 300 separate tools in the package including converters for about 100 graphics formats. Examples of the sort of image manipulation we're talking about are: Shrinking an image by 10%; Cutting the top half off of an image; Making a mirror image; Creating a sequence of images that fade from one image to another.

You will need this software (either compile yourself, or get the precompiled packages) if you want to run the the reddit media website generator scripts!

I hit an interesting problem while developing the ThumbExtractor.pm and ThumbMaker.pm packages - what should they do if the link is to a regular website with just images? There is no simple way to download the right image which the website wanted to show to users.
I thought for a moment and came up with an interesting but simple algorithm which finds "the best" image on the site.
It retrieve ALL the images from the site and find the one with biggest dimensions and make a thumbnail out of it. It is pretty obvious, pictures posted on reddit are big and nice, so the biggest picture on the site must be the one that was meant to be shown.
A more advanced algorithm would analyze it's location on the page and add weigh to the score of how good the image is, depending on where it is located. The more in the center of the screen, the higher score.

For this reason I developed yet another Perl module called 'ImageFinder.pm'. See the 'find_best_image' subroutine to see how it works!

The name of the generator script is 'page_gen.pl'. It takes one optional argument 'regenerate' which if specified clears the cache and regenerates all the pages anew. It is useful when templates are updated or changes are made to thumbnail generator.

While developing any piece of software I like solving various problems on paper. For example, with this site I had to solve problem how to regenerate existing pages minimally and how to resize thumbnails so they looked nice.
Here is how the sheet on which I took small notes looked like after the site got published:

(sorry for the quality again, i took the picture with camera phone with two shots and stitched it together with image editor)

The final website is at redditmedia.com address (now moved to http://reddit.picurls.com). Click http://reddit.picurls.com to visit it!

For newcomers - What is reddit?

A source for what's new and popular on the web -- personalized for you. We want to democratize the traditional model by giving editorial control to the people who use the site, not those who run it. Your votes train a filter, so let reddit know what you liked and disliked, because you'll begin to be recommended links filtered to your tastes. All of the content on reddit is from users who are rewarded for good submissions (and punished for bad ones) by their peers; you decide what appears on your front page and which submissions rise to fame or fall into obscurity.

Have fun with the website and please tell me what do you think about it in the comments! Thanks :)

Comments

Great work! The sites awesome. I've been wanting to make a site like this for the longest time but am just now learning a bit of programming. The fact that you shared how you did is really incredible for someone in my shoes. Thx