Friday, May 30, 2014

It is clear that I'm biased. I love Proteome Discoverer. I didn't always. When PD 1.0 came out, I was like "No thanks! I'll keep using Bioworks." I evaluated PD 1.1, but it didn't win me over until PD 1.2. I think 1.4 is on-par with any shotgun proteomics package ever.

But, what if you don't have a Thermo mass spec? Do you miss out on all the fun? Not at all!

PD can support virtually any instrument. It has had this capability almost all along, but I have never had the chance to verify it. A friend of mine (who will remain nameless right now) recently left the Thermo only lab her worked in for his postdoc and is now running a facility that has Orbitraps AND a bunch of Q-TOFs.

Where to start? Well, PD can't directly accept RAW files from other vendors, so you need to use distiller or ProteoWizard to convert the data to a compatible file type. For simplicity, we chose MGF.

You'll also probably need to tell PD what instrument you are using in your method template.

Thats what all these settings in the Spectrum Selector that we never use are for. Check out the blue box. TOFMS! Just go through those settings and match them to your instrument and save the template. Boom! Easy, right?

You'll need to set your mass tolerances according to what you normally would when running a search. But outside from that, PD will process your data just the same as described in all the tutorial videos on the right side of this screen and you can process all of your data from all of your instruments on one (very good!) software package. One less variable to worry about!

Thursday, May 29, 2014

We have two first drafts of the Human Proteome! What did you expect me to do? Lets compare what they did and what we end up getting out of it!

First of all, both these studies are awesome and big and give our field a load of credibility, but they are very different.

Instrumentation: Both groups used Orbitraps, of course. Pandey's lab exclusively used High/high data. So their MS/MS spectra were high res accurate mass. The Kuster lab used a mix of high/high and high/low data. Due to the increased sensitivity and speed of the high/low experiments, we'd expect Kuster to end up with more MS/MS spectra, and they do -- by a long shot, but the overall quality of the data is probably a little better from the Pandey lab. Pandey's lab generated all of the data on their own. The Kuster study draws strongly on Orbitrap data that was previously generated.

Tissues analyzed:
The Pandey lab evaluated 30 histologically separate protein samples
The Kuster lab evaluated: 60 tissue samples, 13 body fluids and 147 cell lines...holy cow....this was 6,380 runs. I'm not joking. This study redefines what we consider a HUGE proteomics study
In defense of the Pandey lab, the Telegraph reported that the entire project was pulled off for under $700,000. That's pretty amazing, considering that they generated all of this data on their own!

Okay, so both of these studies kick ass. They took tons of individual tissues and painstakingly detailed them via shotgun proteomics using the world's best instrumentation. Next question? What's in it for me?!?!?

You can search by genes or by preloaded pathways, you can compare different tissues and cell lines. No instructions necessary.

The output is even more simple:

Perhaps...disappointingly simple. For this example protein we see that it is expressed in two tissues. Clicking on the gene identified doesn't help much:

We see that for this protein, the study identified one single peptide. And that it was identified only in 2 tissues. It was not identified in any other tissues, including the human pancreas. This doesn't mean that it wasn't there (not having it almost always means cancer, by the way....) it just wasn't detected.

Not as simple as the other interface, but there is a lot more that we can do here!

Searching for CDKN2A?

Wait a minute! ProteomicsDB knows that CDKN2A has important isoforms? We're looking at the data from a protein centric level. Yes, its less clean, but there is so much more data here! This makes me really happy. The Human Proteome Map looks at Proteomics data like its analogous with genes, which is how we've always thought about it. ProteomicsDB looks at proteins the way Neil Kelleher and Albert Heck look at proteins, in that isoforms and variants are seriously seriously important and we need to think about them, regardless of how much we don't want to.

What about expression profiles for this protein (I'm looking at isoform #1)?

Check out how much information is here! They must have been working on this for years! The expression tab is just one of 8 pages of information on this protein? Unreal! And the increased coverage here shows that we're seeing this protein isoform in tons of tissues (as we should...I won't show it here, but we're also seeing virtually every peptide for the protein). This is a mind boggling amount of work and data. Unreal...

I can spend an hour looking through information on just this one protein. I'm not joking. Check it out. What if I said that you could directly examine the MS/MS spectra for every peptide identified? Would you believe me? Check it out. It's there. All of it at your fingertips. This might be the most thorough resource tool ever developed for human proteomics.

There is no way I have time to tell you everything that you can do on this page. Not without taking the day off from my real job. But I want to leave you with this bit of awesomeness:

Chromosome maps. Incredibly well curated proteomics data of every human chromosome. Expandable to just a crazy level. The amount of information here is unbelievable. Have we really come this far?!?

Let me sum this up. Both these studies obviously belong in Nature. They represent enormous undertakings that not only provide new information for everyone (I haven't even gotten into all the protein data that we have that genetics thought was from regions of DNA that don't make protein!!!! Which is a primary focus of Pandey's paper!). These are super powerful new tools that really demonstrate where proteomics is right now and where it's going.

The Pandey lab did an amazing amazing job with the resources they had to work with.
The Kuster lab just changed the scale. This may be the most thorough and sophisticated study anyone has ever done in our field and an enormous amount of effort has went into making all of this data available to everyone. Unbelievable.

Tuesday, May 27, 2014

While all of us were sitting around trying to decide whether to do IMAC phosphopeptide enrichment first or TiO2 enrichment, or even which one we would use if we didn't have time to do both, the Heck lab was making beads that use both.

Monday, May 26, 2014

At long last -- I can finally talk about the huge genomics project I've been working on!

Xenografts are human tumors that are grown in mice. They are important models for learning about the development of cancer and for testing in vivo drug efficacy and stuff. An increasingly common application is to take a piece of tumor from someone, grow the pieces up in a number of mice and figure out which chemotherapy drug (or combination of drugs) works best on this particular tumor. Powerful tools, right?

Well, we wondered 1) how different xenografted tumors were from liquid growing cell lines and 2) how stable xenografts are through passages; in other words, how stable are xenografts if grow one in a mouse and then take some of that and move it to another mouse. By passaging xenografts we can get more material while limiting the suffering of individual mice.

The easiest way to get a feel for how these cell lines differ? A shit ton of genomics via microarray. My job was analysis and quality control of the arrays. Did I mention there were a lot? Yeah....for just the first 49 cell lines completed (there were going to be 50, but we dropped HeLa on ethical grounds) there are 823 arrays, each representing about 14,000 genes. This project is on-going. Eventually there will be thousands of arrays, with a series for every cell line studied by the National Cancer Institute. This paper is an introduction to the project and the an overview of the first cohort.

What we've found out so far: some cells are super stable. Some cells differentiate like crazy and will be very different after a few passages. PCA analysis can be used to determine what cell lines are permeated by mouse tissue and that it may be possible to track sensitivity to some drugs across cohorts this large when the pathways of sensitivity/resistance are well understood.

Sunday, May 25, 2014

Like a lot of scientists, I'm pretty concerned with liver regeneration -- for a variety of reasons. Its nice to know some people are concerned enough about it to do interesting proteomics studies to check it out.

In this new paper at JPR, Thilo Bracht et. al., dive right in to this subject. They look at the proteomics of mouse liver regeneration in "normal" mice to those with a knockout in BIRC5. BIRC5 inhibits a protein called Survivin that I hope you can tell from the name is pretty useful. They go in, chop out some of the mouse liver, let the mice grow some of it back (or try) and take the rest of the livers for proteomics studies later.

Interesting for a lot of reasons. Partially cause the changes in STAT1/STAT2 are at the expression level rather than at the phospho level (surprise to this guy!) Direct link to the paper is here.

Saturday, May 24, 2014

I'm posting this cause I get questions on this both on this forum and through my day job. As you can tell by flipping through these posts, RawMeat is one of my favorite little programs for evaluating RAW data.

If you have a Q Exactive or Orbitrap Fusion, you will find that some of the features won't work here. These include the bar charts and time plots for ion injection times and anything that is calculated from your injection times.

I've contacted the author of the software and there are no plans at this time to update the software. It is a freeware after all and, like most of us, the author has a taxing day job. The other features of this awesome software package should, however, work just fine for these instruments. If you see other things missing, let me know and I'll update this.

Wednesday, May 21, 2014

It seems like all my genetics/biologists friends know about this, but whenever I mention it to my proteomics friends no one is familiar..and it's surprisingly relevant to human proteomics.

23andMe is a personalized genetics service that uses topnotch technology to sequence important sections of your DNA and give that data to you. I've thought about it for years (In 2008 Time Magazine named it the invention of the year), but I was pretty sure it would end up invalidating me for health insurance cause of some pre-existing condition I didn't know about. Thanks to the Affordable Care Act, I don't have to worry about pre-existing conditions anymore, so I ordered a kit and had it shipped to my parents in WV cause you can't get them delivered here in MD.

This service used to provide you with disease information, but they can't anymore due to some lawsuits from doctors or something, but they provide you with your RAW sequencing data and there are plenty of genomics tools out there. I've been digging through mine big genetics file with the Interpretome software (article here, and open access!) There is so much data. But that isn't the real story.

The real story is the huge amount of genetic variation that we have. In proteomics, I don't think we consider it a lot. Seriously. We take the peptide MS/MS data from every human sample and we compare it vs. Uniprot. We're making the assumption that the proteins from my plasma are going to match the protein sequence of the proteins from the FASTA of the sequenced person in the UniProt file. And for the most part, we're probably right. Albumin is albumin. It is pretty well conserved among all mammals. But what about other proteins? Just randomly selecting some data from my genomics data and a pathway that is pretty well annotated:

Parkinson's disease

4698412

AG

Parkinson's disease

9917256

GG

Parkinson's disease

356220

CC

Parkinson's disease

12726330

GG

Parkinson's disease

1296028

AG

Parkinson's disease

11026412

GG

Parkinson's disease

2839398

CC

Parkinson's disease

6430538

CT

Parkinson's disease

10877840

TT

Parkinson's disease

12456492

AA

Parkinson's disease

199515

CG

Parkinson's disease

11865038

CT

Parkinson's disease

2395163

TT

Parkinson's disease

11248060

CC

Parkinson's disease

356220

CC

Parkinson's disease

3129882

AG

Parkinson's disease

356219

AA

Parkinson's disease

11724635

AC

Parkinson's disease

1491942

CC

Parkinson's disease

199533

AG

Parkinson's disease

199533

AG

Parkinson's disease

17115100

GT

Parkinson's disease

1223271

AG

Parkinson's disease

6532197

AA

Parkinson's disease

2736990

AA

Parkinson's disease

393152

AG

Parkinson's disease

6812193

CT

Parkinson's disease

7077361

TT

Check this out! Every place in this pathway is a place where we are known to be genetically different in the pathways we currently know lead to Parkinson's disease that are tested by this $100 kit. Those letters at the end? Those are the nucleic acids that vary between different human beings. Look at the first one. At that point in the DNA you can either have an A or a G. If that is the third letter in a codon, this is probably not a problem:

The third letter shift from A to G rarely results in a different amino acid being put into place. Look at isoleucine, though. ATA means isoleucine. ATG means methionine! If it is the first in the codon, then it commonly means that your peptide sequence has a different amino acid in it than mine does. And if I'm searching my peptides vs your protein sequence there is nothing for that spectra to match to and that spectra goes into the trash.

I find this 1) a little scary and 2) super interesting and 3) exciting, cause someone out there is going to solve this stuff and I can't wait to see how you do it!

I've been pretty philosophical here the last two days. Its because I'm doing some hard technical stuff all day. Results are on the way, I promise!

Tuesday, May 20, 2014

This is a really interesting article Alexis found in BioEssays. And it got me to philosophizing. Where are we these days as a field? We've come pretty far, for sure! But are we at a point where we can beat the other techniques out there?

For example:

Do we trust an LC-MS obtained quantitative value
or
One obtained by horse radish peroxidase stuck so some stuff from mouse blood
or
Densitometry of a gel band stained with silver
or
A hand plotted enzyme kinetics curve where the slope is calculated as the limit appraoching something or other?

This essay takes into account the measurements of protein counts per cell via various techniques and then compares them to the values obtained by mass spectrometers. Is it time to readjust the values in our text books? Even if the mass spec says that they're off by a log value compared to these other assays?

Monday, May 19, 2014

This post comes from an excellent suggestion for a post in the comments section of something I wrote last week. What about injection times vs. different sample loads and complexities?

For this I'm going to brazenly steal from my absolute favorite talk of 2014 so far. Tara Schroeder of the Thermo NJ demo labs put together at talk for the iORBI tour for obtaining maximum peptide coverage with the QE. I'm going to refer to this slide deck often.

One of the many excellent bits of info is a general starting point for target values and injection times:

Again, this is a starting point. Chromatography conditions will vary a lot, as well the true definition of "simple" and "complex" mixtures depending on what you thing you are looking at (also compared to what you actually are looking at, right? Sometimes they don't exactly line up!)

For something complex and high load (segway, isn't it awesome that we are at a point in time where we are considering over 100 nano grams high load?!?!) we aren't as concerned with hitting our target values are we as about getting as many MS/MS events as possible.

When we drop into the low nanogram range, we are truly concerned that 50ms is not going to be enough time to hit that magic level for each individual peptide that will give us high scores. We sacrifice the number of MS/MS events that we can get in order to increase our chances of getting good ones.

Now, for simple stuff, we simply treat it like low load. By "simple" what we really mean is: that we can easily obtain a single MS/MS event for at least a few peptides for every protein that is present. I think that is a fair starting point. The stress here isn't in getting enough MS/MS events. The real concern is converting every possible MS/MS event into a peptide ID. Again, we sacrifice the number of possible MS/MS events a little in return for giving us twice the possible signal to convert these peptides into high quality MS/MS spectra that the search engine would love.

For the complex stuff at high load: Almost always, we want to use the soloist approach. 1 MS/MS event and then put the ion onto the exclusion list.

For the complex stuff at low load: This is a toss up. The soloist approach will give you more MS/MS events but at lower efficiency than the two timer approach. Tough to say which will be more effective for a given experiment without more information

For the simple stuff: Two timer! If you've got plenty of cycle time to fragment peptides from every protein present, give yourself a chance to get each peptide at least twice. Yes, the number of unique MS/MS targets decreases (by half), but by claiming your sample is "simple" you've already said that isn't a concern. For single protein pull-downs, I'd allow as many MS/MS events for each peptide as possible. I worked with a group recently where our #1 goal was maximum sequence coverage of a single small protein and its PSMs. Best coverage occurred when we allowed 4 fragmentations of each ion before dynamic exclusion kicked (this gave us a pesky phosphopeptide that just wouldn't ionize well!)

If we go lower, we will need to increase these fill times. But this gives us a crude starting point.

Not sure about your sample complexity or want to double check your run to see if you set it up right?
RAW Meat time!

This is the analysis of the MS2 fill times from an IP run my friend Patricia and I did. The maximum injection time for each MS/MS event was 100ms. We hit the maximum almost every single time. This suggests that we are loading so little sample that we need significantly more than 100ms of fill time. After a pull down it is currently just about impossible to determine your peptide load. We often don't have nearly enough material for a standard protein measurement assay (hopefully someone will come up with something more sensitive soon...) If this were a monoclonal pull down I'd say, crank up that fill time and try running it again!

What do we want to see?

This. This is a low load complex run from my friend Rosa. She used a maximum fill time ~150ms for this run and it was perfectly appropriate for this sample. The first bar represents fill times of <50ms the second represents ~50 ms, the third ~100 and the 4th bar is maxing out. The vast majority of peptides hit target value in less than maximum -- in fact, less half of max fill time. But there were a large number of MS/MS events that required at least half the max and about 1/8 of the peptides needed the full 150ms.

I hope this is clear. Thanks to Kristian for the questions and Tara, Patricia, and Rosa for the data to let me put this together this weekend.

Saturday, May 17, 2014

I bet a lot of y'all know this trick. But I don't think everyone does.

I didn't know it until I joined my current employer. All of my LTQ and LTQ Orbitrap methods looked like this:

Where I'd have 21 scan events for my "top 20" experiment. On sample #4 I'd realize that I screwed up scan event number 12 and I was really doing MS3 on the ion from scan event 11 or something else that would seem really stupid later. The worst was if I wanted to change a single parameter! Then I have to go through every one of these stupid things and edit them individually.

The trick? Nth order doubleplay:

If I build my method like this I get just 2 scan events:

And whatever settings I put in for this dependent scan can be carried over for every other dependent scan that I do. No more: take the 6th most intense from scan number 1 bologna.

BTW, a new colleague of mine, Donna Earley, suggested that I try to do a Ben's Application tip of the week. If I can come up with more than this one, I'm going to count this as number 1, lol!

Since the work week is over and this is a very work-heavy post, I present this crowd surfing pug!

Thursday, May 15, 2014

Do you love FASP but desperately wish it was faster to prep a ton of samples that way? Well, Yanbao Yu et al., has a solution for you: FASP in a 96 well plate!

Benefits? FAST (per sample), reproducible (check out that correlation factor!), and since there are lots of robots, workflows, and special pipettemen, etc., for automatic processing of 96 well plates it is very friendly for automation.

Wednesday, May 14, 2014

Recently, I worked with a couple of labs that use single protein digests and % coverage as a QC metric. Lots of people do this. This isn't my favorite QC, but as long as people are benchmarking their instruments with some sort of constant standard, I'm sure not going to stand in the way. A question occurred to me when I saw very high % of peptide coverage: how much can we actually see with a single enzyme digest and mass spectrometry?

1

MKWVTFISLL

LLFSSAYSRG

VFRRDTHKSE

IAHRFKDLGE

EHFKGLVLIA

51

FSQYLQQCPF

DEHVKLVNEL

TEFAKTCVAD

ESHAGCEKSL

HTLFGDELCK

101

VASLRETYGD

MADCCEKQEP

ERNECFLSHK

DDSPDLPKLK

PDPNTLCDEF

151

KADEKKFWGK

YLYEIARRHP

YFYAPELLYY

ANKYNGVFQE

CCQAEDKGAC

201

LLPKIETMRE

KVLASSARQR

LRCASIQKFG

ERALKAWSVA

RLSQKFPKAE

251

FVEVTKLVTD

LTKVHKECCH

GDLLECADDR

ADLAKYICDN

QDTISSKLKE

301

CCDKPLLEKS

HCIAEVEKDA

IPENLPPLTA

DFAEDKDVCK

NYQEAKDAFL

351

GSFLYEYSRR

HPEYAVSVLL

RLAKEYEATL

EECCAKDDPH

ACYSTVFDKL

401

KHLVDEPQNL

IKQNCDQFEK

LGEYGFQNAL

IVRYTRKVPQ

VSTPTLVEVS

451

RSLGKVGTRC

CTKPESERMP

CTEDYLSLIL

NRLCVLHEKT

PVSEKVTKCC

501

TESLVNRRPC

FSALTPDETY

VPKAFDEKLF

TFHADICTLP

DTEKQIKKQT

551

ALVELLKHKP

KATEEQLKTV

MENFVAFVDK

CCAADDKEAC

FAVEGPKLVV

601

STQTALA

Take this coverage map for example. This is the Mascot coverage output for one of these QC proteins. Mascot says 79% coverage (what was found is in red).

Something that I've started to be very concerned about, due to the amount of intact and top-down analysis I've been doing, is the signal and pro- peptide sequences. This protein is BSA, but the first 24 amino acids are not actually part of the true BSA sequence. They are part of the translational process and are cleaved prior to BSA, so I don't think they should count.

Lets look at what is left: If we assume 100% cleavage, we have:

DTHK

SEIAHR

FK

DLGEEHFK

VASLR

FWGK

IETMR

EK

VLASSAR

QR

LR

CASIQK

FGER

ALK

AWVSAR

LK

CCDK

PLLEK

NYQEAK

SLGK

AFDEK

HKPK

What are our requirements for settings for our instruments? I, for one, almost never look at ions with a mass to charge of <400. I also ignore anything with less than 2 charges, because they don't seqence in most cases. Ignoring the fact that not all amino acids can/will accept protons, if I only use the requirment that my peptide has a mass >800 Da, only DLGEEHFK, makes the cut. It also has two basic amino acids, so it should charge to at least +2. If it charges to +3 or above, this would explain why we didn't see it, as it won't meet our >400 m/z cutoff as a +3.

So, if we actually consider our coverage of what is possible? If we start with the FASTA BSA sequence of 608 a.a. and subtract our non-expressed region (24 a.a.) then we get 584 amino acids in the fully expressed protein. There are 109 amino acids in the peptides I just deemed too short for my mass spec analysis. 584-109 = 475. Lets assume that DLGEEHFK will charge +2, so it counts as one that we can see but didn't so (475-8)/475 = 98% achievable coverage of BSA in this example.

Real achievable coverage (RAC? is that in use?) is 475/608 = 68% of the FASTA sequence coverage. I wonder if that is anywhere near consistent in natural proteins?

Tuesday, May 13, 2014

UCSF has brought us tons of great mass spec resources over the years. The first that pops out in my head is the great protein prospector.

What if you are new to proteomics or are considering moving your biological problem over to allow proteomics to take a look? UCSF hosts a great presentation by Dr. Chris Walsh that provides a clear and thorough overview into PTMs and their analysis by mass spec.

Saturday, May 10, 2014

I get this question a lot, get the answers mixed up, and wonder "why haven't I just posted this on the blog where I can find it?"

Solution!

Can I install PD 1.4 on a Windows Server?
Answer: Yes! People all over have PD 1.4 running on their servers. We know that it works for sure on Windows Server 2008 and Windows Server 2012.

We also know that PD 1.4 works well with multiple cores. It appears to work best, however (according to anecdotal evidence) when up to 24 cores are being used. I have heard of two situations (which, honestly, may have been the same situation, I just heard it from different people) where devoting more cores from a cluster to run PD 1.4 didn't improve speed the way we might expect. For example, 48 threads from the cluster weren't twice as fast as 24 threads from the cluster (maybe it was 80% faster, I don't know, but I've heard that it wasn't 2x, thats all). But I only run 4 threads on my laptop and on my team's server. And I can knock out a 1.5GB RAW file with 3 dynamic mods in 6 minutes (SSD buffers..woooo!) so 24 threads sounds hella fast to me!

Keep in mind that PD 1.4 is a couple of years old now. Maybe there will be a new version soon!

Thursday, May 8, 2014

Something I consider a major goal for proteomics? Moving from gene ontology (GO) analysis of our datasets to true PRO (protein ontology). Genes are great. Grouping genes by functional categories is also great. But proteins are what we study and there is a point where we don't have a complete overlap. Consider the number of proteins out there vs the number of unique gene identifiers. WAY more proteins than genes to correlate them to.
Protein Center is making leaps and bounds all the time toward bridging that gap.

The Protein Information Resource (PIR) is an additional resource that focuses on true PRO. For example, unique identifiers exist for various known cleavage events of gene products. If we have the whole protein or half the protein, do they do the same thing? Probably not. And it sure would be nice to know which one is present, right?

Wednesday, May 7, 2014

Caspases are proteins that chew things up. If you didn't know better, they look like they chew things up completely indiscriminately -- every protein gets chopped up. This is really useful for controlling diseased cells during things like apoptosis. This cell is bad, program cell death, destroy everything.

To decipher this, the team employed an unbiased n-terminal approach they call TAILS and analyzed the rate of degradation of proteins in caspase activated systems. The analysis was performed on an Orbitrap Velos and the data was analyzed with MaxQuant.

They find that some phosphorylation events lead to increased cleavage rates and some lead to decreased rates. Interesting, right! It seemed strange that we would just auto-self destruct cells indiscriminately. Even in this worst case scenario, it appears that we have some measure of control after all.

I stole this image from their paper because its cool and describes their aproach.

In this one, the authors take a serious run at the pluses and minuses of depletion (like what a HUGE protein loss they take with depletion...quantified! from different depletion techniques). Then they take the depleted, undepleted, etc., and run it all out on an Orbitrap Elite using the same settings.

The results? Well, thats a surprise. Its open access and they made some nice Venny Venn Diagrams. Worth checking out! (and easier to take seriously than David Tennant doing Hamlet!)

Sunday, May 4, 2014

I've known about this one for quite a while, but I just realized I don't have anything up on the new blog (and I never fully transferred everything I've written from the old one...and probably won't at this point, lol!)

Quick reference info at the top: class, GO function and process: followed by a quick graphical breakdown of what we know of the protein at this point.

Followed up with info on pseudonyms for the protein, substrates, etc.,

You have other options as well, such as looking up canonical pathways. In a minor criticism, the pathways come up in HTML as a list and must be downloaded for viewing/linking. I definitely prefer when KEGG or ball and stick pathway models pop up, but I can get those from other sites. Those other sites won't give me graphical breakdowns of my protein domains, and I think this is the real advantage of this site over other ones out there.