Grouping Related Trends with Hadoop and Hive

In a previous post, I outlined how to build a basic trend tracking site called trendingtopics.org with Cloudera’s Distribution for Hadoop and Hive. TrendingTopics uses Hadoop to identify the top articles trending on Wikipedia and displays related news stories and charts. The data powering the site was pulled from an Amazon EBSWikipedia Public Dataset containing 8 months of hourly pageview logfiles. In addition to the pageview logs, the EBS data volume also includes the full text content and link graph for all articles. This post will use that link graph data to build a new feature for our site: grouping related articles together under a single “lead trend” to ensure the homepage isn’t dominated by a single news story.

Finding Related Trends Using Wikipedia Link Graph Data

For several weeks this summer, the death of Michael Jackson dominated the news and drove a large number of pageviews on Wikipedia. The hourly data in the following chart was downloaded from trendingtopics.org during the last week in June:

This burst of interest in the Jackson family, Michael’s albums, and his medical conditions pushed most topics unrelated to MJ off the front page of our site:

Ideally, TrendingTopics would continue to show a range of stories when this type of event happens. Automated news sites like Techmeme or Google News display clusters of rising articles across a number of topics to provide a better mix of content on the homepage. For example, Techmeme shows a ranked list of related blog posts that link back to the lead article for each story:

We can build a simple version of this functionality with Hadoop by combining the article trend estimates we computed in the previous post with the Wikipedia link graph. Wikipedia provides a periodic database dump which includes a file with the outgoing pagelinks for each article. The June data dump includes a 12GB pagelink file named “enwiki-20090618-pagelinks.sql” containing ~ 13K SQL insert statements:

The first step in our data preparation uses a Hadoop Streaming job to convert the Mediawiki SQL insert format into a tab delimited text file ready for further processing with Hive. The job uses a Python mapper called “parse_links.py” to convert the SQL insert statements into a tab delimited format. The “parse_links.py” script uses a regular expression to extract the graph edges from each SQL statement and emits tab-delimited record if the page_id belongs to namespace 0 (this indicates the page is an article).

parse_links.py

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

import sys,os,re

insert_regex=re.compile('''INSERT INTO `pagelinks` VALUES (.*)\;''')

row_regex=re.compile("""(.*),(.*),'(.*)'""")

forline insys.stdin:

match=insert_regex.match(line.strip())

ifmatch isnotNone:

data=match.groups(0)[0]

rows=data[1:-1].split("),(")

forrow inrows:

row_match=row_regex.match(row)

ifrow_match isnotNone:

# &gt;&gt;&gt; row_match.groups()

# (12,0,'Anti-statism')

# # page_id, pl_namespace, pl_title

ifrow_match.groups()[1]=='0':

page_id,pl_title=row_match.groups()[0],row_match.groups()[2]

sys.stdout.write('%s\t%s\n'%(page_id,pl_title))

You can test this script at the unix command line on a single machine before running it on Hadoop. Here we pipe a sample of the file into the mapper script and then examine the results with grep. The first column of the resulting output is the Wikipedia page_id and the second column contains other Wikipedia article titles linked to by that page_id.

1

2

3

4

5

6

7

8

9

10

11

$head-35enwiki-20090618-pagelinks.sql|./parse_links.py&gt;links.txt

$grep'Super'links.txt|more

303&#160;&#160; &#160;Super_Outbreak

303&#160;&#160; &#160;Talladega_Superspeedway

324&#160;&#160; &#160;Super_Bowl

594&#160;&#160; &#160;Tarquinius_Superbus

615&#160;&#160; &#160;List_of_Super_Bowl_champions

615&#160;&#160; &#160;List_of_Super_Bowl_records

615&#160;&#160; &#160;Super_Bowl

615&#160;&#160; &#160;Super_Bowl_XIX

We ran this script on the entire pagelinks file using a temporary Cloudera Hadoop cluster on Amazon EC2, and saved the output of the conversion process in an Amazon S3 folder called “links”. To work with these tab delimited link files on a new Hadoop cluster, we use distcp on the master node to pull the output data into HDFS from S3. We also pull in an archived export of our trendingtopics “pages” data which contains the page_id -> title mapping and monthly trend score for each article:

1

2

3

4

$hadoop distcp s3n://trendingtopics/wikidump/links links

$hadoop fs-rmr links/_distcp_logs*

$hadoop distcp s3n://trendingtopics/archive/20090816/pages pages

$hadoop fs-rmr pages/_distcp_logs*

We will use a set of Hive queries on the link graph data to generate the ranked list of top trending articles linking back to each Wikipedia. After our data is in hdfs, we can start the Hive CLI on the Hadoop master node:

1

$hive

In the Hive shell, we create a “links” table and load the raw data from hdfs:

1

2

3

4

5

6

7

8

hive&gt;CREATE TABLE links(

&#160;page_id BIGINT,

&#160;pl_title STRING)

&#160; ROW FORMAT DELIMITED

&#160;&#160;&#160; FIELDS TERMINATED BY '\t'

&#160; STORED AS TEXTFILE;

hive&gt;LOAD DATA INPATH'links'OVERWRITE INTO TABLE links;

Now we have the “outlink” data for each page in Hive, but we want to reverse the keys so that we can find all trending page titles which link back to a given page_id. We can map the original page_id’s to article titles and pull in the trend data using a Hive JOIN:

1

2

3

4

5

6

7

8

9

10

11

12

13

hive&gt;CREATE TABLE temp_backlinks(

pl_title STRING,

page_id BIGINT,

bl_title STRING,

monthly_trend DOUBLE)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY'01'

STORED ASTEXTFILE;

hive&gt;INSERT OVERWRITE TABLE temp_backlinks

select links.pl_title,pages.page_id,

pages.redirect_title,pages.monthly_trend

from links JOIN pages ON(pages.page_id=links.page_id);

A second join is used to convert the original outlink titles to page_ids for use in our web application’s MySQL database:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

hive&gt;CREATE TABLE backlinks(

&#160; page_id BIGINT,

&#160;&#160; &#160;bl_title STRING,

&#160;&#160; &#160;monthly_trend DOUBLE)

&#160; ROW FORMAT DELIMITED

&#160;&#160;&#160; FIELDS TERMINATED BY '01'

&#160; STORED AS TEXTFILE;

hive&gt;INSERT OVERWRITE TABLE backlinks

select pages.page_id,

&#160;&#160; &#160;temp_backlinks.bl_title,

&#160;&#160; &#160;temp_backlinks.monthly_trend

from pages JOIN temp_backlinks

ON(pages.redirect_title=temp_backlinks.pl_title);

The final stage of processing involves generating an array of the top trending backlink urls for each Wikipedia page_id. Later on, we can export the entire lookup table “backlinks_reduced” to MySQL for use in the trendingtopics Rails app. We use a Hive transform written in Python to rank the backlinks and concatenate the top 10 titles into a comma delimited text string for each page_id:

This looks like a reasonable list of related pages. Here are a few more trending topics from the same time period:

1

2

3

4

<strong>Julia Child</strong>

[Julie_Powell,Julie_&amp;_Julia,August_13,August_15,Meryl_Streep,

Mastering_the_Art_of_French_Cooking,Le_Cordon_Bleu,Amy_Adams,

Simone_Beck,The_French_Chef]

1

2

3

4

<strong>John Hughes</strong>

[The_Breakfast_Club,Molly_Ringwald,Sixteen_Candles,Pretty_in_Pink,

Weird_Science_(film),Some_Kind_of_Wonderful_(film),Curly_Sue,

Home_Alone_3,Uncle_Buck,Home_Alone_(film)]

1

2

3

4

<strong>Quentin Tarantino</strong>

[Inglourious_Basterds,Machete_(film),Pulp_Fiction_(film),

Eli_Roth,Bruce_Willis,Rosario_Dawson,Robert_Rodriguez,

Grindhouse_(film),Julie_Dreyfus,Death_Proof]

At this point, we have what looks like a reasonable approach for generating related links with a low amount of effort. The Rails application can apply filters to remove titles like “August_15” from each list and display the related links around each trend.

Here are a few ideas for next steps with the Wikipedia link graph data and Hadoop:

Show related articles by using the backlink ids as a feature vector to compute article similarity

Compute TFIDF weights for Wikipedia articles and incorporate it into the similarity calculation

If you are interested in digging deeper into the link data with Python and Hadoop, you should also check out Viraj Bhat and Jake Hofman’s talk “Cool Development Projects at Yahoo!: Automatic Tuning and Social Graph Analysis” next week at Hadoop World NYC. I’ll also giving a talk that afternoon called “Building Data Intensive Apps: A closer look at TrendingTopics.org”

That final SELECT into backlinks_reduced took 330 seconds, which ran the streaming code on 2,445,704 rows. The example Hive selects from backlinks_reduced take around 160 seconds. For use in a web application, I would export those final key-value pairs to MySQL and index by page_id or load them into a non-RDBMS datastore. Hive is best for offline batch operations where you need to run against all the data or a certain partition.

Small correction found when looking at the query timings: the related links displayed are actually from another table that finds the “mutual links”, pages that are both outlinks and backlinks. That requirement gives a more narrow set of backlink pages that are also pointed to by the target page: