A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery

Saturday, February 20, 2016

AGBT16 Storify Completion & Rate Limits

AGBT16 ended a week ago, but for various reasons I'm just now catching up on my Storify project. A vacation was in there but also some tool building. As I was griping about the pains of organizing the tweets manually, Brian Krueger suggested what was already dawning on me (but it helps to be poked -- professional embarrassment is often a stronger motivator than pure annoyance) -- I needed to stop doing this purely manually. So, off to deal with pulling in Tweets automatically and at least doing some organization programatically.
I had some code I used before for pulling Tweets -- two years ago! Python of all things -- if I remember correctly the Perl libraries were unable to deal with the newish Twitter authentication scheme. Someday I'll give up on Perl entirely, but this project would be a partial. Note that if you want to try the code yourself, you need to get API keys and other authentification doodads from Twitter and Storify -- it is easy, but a necessary step.

However, that previous program was dealing with favorites, not hashtags, and the interface for querying was a little different. I found the tweepy (one of a gaggle of Python libraries to deal with Twitter -- can't remember why I picked this one) documentation frustrating, but a bit of Googling found a nearly ready-made program that would not only query Tweets but dump them to an SQLite database. Sweeeeet.

Tried that out, and quickly bumped into Twitter's rate limit on pulling tweets. I understand the need for avoiding deliberate or inadvertent denial-of-service attacks, but on the other hand program development and debugging are seriously hindered when you hit these limits. I found some code using cursors and exception trapping to get around this -- and it kept failing to work. Since I wasn't in a hurry (the Tweet-pulling was during vacation evenings), I just had the program sleep for a second after each request -- this won't hit the rate limit, though it is a dumb way to do things. The program also is doing a commit after each write -- this saved tweets during earlier attempts that hit the rate limit. I also ran into a trouble trying to generate plain-text output at the same time -- Python exited complaining something I tried to print wasn't legal (I forget the exact message now) -- presumably some special characters.

Next, some quick Perl to pre-process the data, parsing out candidates for author-identifying tags. I didn't get this quite right, but close. Parsing tags is hard, starting with different styles. For example, if I were speaking then some tweeters would use "Robison:" and others "KR:" but still others might omit the colons. Inevitably there would be a "Robinson:" in the mix. Since I tweet, some would tag with "@omicsomicsblog". And more permutations.

Perhaps a future iteration (maybe in Python?) will try to use some more smarts here. Time information is a useful prior, though sometimes Tweeters post after a talk so they can give a more thought-out summary. I have a bunch of other ideas on how to try to enhance this, which I might play around with.

#!/usr/bin/perl

use DBI;

my $dbh=DBI->connect("dbi:SQLite:dbname=data/agbt16.3.db");

my $sth=$dbh->prepare("select id,name,timestamp,text from tweets");

$sth->execute;

while ( my ($id,$name,$timestamp,$text)=$sth->fetchrow_array)

{

$text=~s/^\#AGBT16 +//i;

$text=~s/\n/ /;

next if ($text=~/^RT/ || $text=~/ RT /);

next if ($text=~/ mt /i && $name=~/SeqComplete/);

my ($speaker)=();

if ($text=~/^([A-Z\-]+)[:,]/i)

{ $speaker=$1;

$speaker="None" if ($speaker=~/Chirp/);

}

$timestamp=~s/ /\t/;

print join("\t",$id,$timestamp,$speaker,$name,$text),"\n";

}

That data went into Excel -- which is pretty good for viewing this data (as I have advocated), though I did initially shoot myself in the foot by importing the data carelessly -- the Twitter ids are integers that apparently blow out Excel's precision.

Once in Excel, I could improve the column of tags to catch missed items. I also removed tags for modified tweets and also blatant copies of tweets -- there is one particular Tweeter who sometimes adds value, but much too often copies others content without attribution -- in this day of quoting tweets there is no excuse. I won't fully shame them here, but let's just say in the world of Seq you ain't Complete without good faith efforts at attribution. And in that vein, let me heartily thank everyone who did live-tweet the meeting -- without you, none of this would be possible. It looks like 235 different individuals tweeted with the #agbt16 hashtag, with only a handful of those spammers. My one request for the future would be to clearly tweet out when talks are scratched or changed in timing or speaker -- that would be useful to know with certainty. There's also a nice collection of blogs on AGBT16 pulled together by AllSeq

Once done, then copy-and-paste lists of Twitter ids for each speaker back to Linux. Another bit of Perl programming then generated the Storify API upload file. This took a lot of iterations, which were slowed by Storify's own rate limit -- only 10 posts per hour, which seems a tad low. Partly the trouble was my inattention to detail -- turns out an API key truncated by 1 letter won't work -- but it wasn't helped by the available resources being slightly out-of-date. For example, for the Storify URL one must now use an https address, not http. But, that's all solved.

#!/usr/bin/perl

use strict;

use Getopt::Long;

my ($title)=();

&GetOptions("t|title=s",\$title);

die unless (defined $title);

my @elements=();

while ($_ = <>)

{

my ($id)=(/^([0-9]+)/);

push(@elements,"\"http://twitter.com/#!/Storify/status/$id\"") if ($id>0);

So, here is the final result. I've copied the AGBT schedule linked to the Storify. Editorial comments are in bold italics. I may continue to edit these a bit -- James Hadfield has blog entries on many that I really should add in & I've been also pulling in appropriate pictures when Storify failed to add a thumbnail photo (sometimes because no tweets had images, but sometimes perfectly good images are mysteriously ignored).

Felicity Jones, Friedrich Miescher Laboratory of the Max Planck Society “Dissecting the genomic basis of adaption in natural stickleback populations”I believe Jones' talk was a scratch -- would love to hear it some other time! Eddy Rubin spoke in this slot, but I've linked him when he was scheduled

Stephen Lincoln,Invitae “Clinically important variants are often technically challenging for NGS: implications for NGS methods, validation and confirmation”

8:10 p.m. – 8:30 p.m.

Brendan Keating, University of Pennsylvania “Detection and validation of signatures of liver transplantation rejection diagnoses and successful minimization of immunosuppression from serum miRNA profiles”

Follow by Email

Search This Blog

About Me

Dr. Robison spent 10 years at Millennium Pharmaceuticals working with various genomics & proteomics technologies & working on multiple teams attempting to apply these throughout the drug discovery process. He spent 2 years at Codon Devices working on a variety of protein & metabolic engineering projects as well as monitoring a high-throughput gene synthesis facility. After a brief bit of consulting, he rejoined the cancer drug discovery field at Infinity Pharmaceuticals in May 2009. In September 2011 he joined Warp Drive Bio, a startup applying genomics to natural product drug discovery. Other recurring characters in this blog are his loyal Shih Tzu Amanda and his teenaged son alias TNG (The Next Generation).
Dr. Robison can be reached via his Gmail account, keith.e.robison@gmail.com
You can also follow him on Twitter as @OmicsOmicsBlog.