Thursday, November 30, 2017

Hey!! You're a mass spectrometrist (or a weirdo) if you're on this blog. And I bet you use Skyline. If you don't I'd almost guarantee you will some day in the very near future -- and the best part about Skyline is that the developers support the heck out of it. It is always getting better and you can always get help on the forums if you have issues.

Skyline has always been free for everyone, but without some more letters of support from our community it may not be able to stay that way.

Wednesday, November 29, 2017

I was having an awful day. Fo' real, yo. Microsoft forced an update to my favorite PC (my Destroyer Gen1 (I got the new "Max Destroyer", but I haven't migrated anything but PD to it yet) and knocked out all of my software that had a dependency on the .NET framework (a lot of it -- watch out Windows people-- 4.7 might not be the culprit, but the timing is suspicious) but then this paper turned my whole day around!

If you've had the misfortune of reading any of this blog you might know that I've got a hangup for the malaria parasite. Malaria is a protein problem. If you throw all of the world's genomics resources at it, you'll never solve the malaria puzzle. Proteomics, glycoproteomics, and (my firm belief) PTMs we don't yet know about yet are what is going to finally help us understand P. falciparum and why it kills human beings with such relative ease.

During some stages the malaria parasite lives in your red blood cells (RBCs). It looks super gross under a microscope. There are a bunch of bugs in your red blood cells. Our immune systems are nearly perfect. Organisms generally have to evolve against our immune systems for millions of years to have a shot. Yet these huge parasites live in our red blood cells and our immune things pass right by. We know there are proteins that the parasite pushes through the RBCs and that these help protect them from targeted destruction.

Sandra Bark et al., decided to see what the surface of the infected RBC looks like to proteomics by shaving off the proteins and doing proteomics on them. Big deal. Both Laurence Florens and Michal Fried did this independently years ago. The devil is in the details.

This paper is Bad. Michael Jackson (1987) Bad.

They biotinylate the surface proteins.
AND
They iTRAQ label everything (RBCs with great big parasites living inside them and eating everything tend to be leaky and populations of them are untrustworthy in terms of proteins leaked).

Check out the figure at the top. The number of negative controls for the study are what give it so much power. Leaky proteins from damaged cells can be eliminated --they're all leaky. Comparing shaved from unshaved membrane protein preps allow you to tell what is really important.

Deep proteomes were obtained by normal means -- 30 high pH reversed phase fractions were ran on a Q Exactive classic using 110 minute gradients and a Top12 method. An NCE of 28 was used (+1 since it is iTRAQ 4-plex).

Did I type this already? The power is in what can be eliminated from consideration! So much of the background noise can be tossed because of how good the controls are!

What did we get? Well -- to date -- the most comprehensive picture of what is going on at the RBC surface when the malaria parasite is hanging out and eating all the hemoglobin. And new surface proteins(?) definitely peptides that might make great new vaccine candidates!!

All the RAW data has been deposited at MASSIVE, but the details in the paper for the user name and password appear to be incorrect. I'm sure it's only password locked for the review process and will be available once the study is officially accepted.

Saturday, November 25, 2017

If TAFT works as well as these researchers claim in this new study, this could be a game changer!

They directly enrich phosphopeptides and then do high ph reversed phase fractionation -- all in the same tube/tip and it's ready to go!

Did the number of phosphopeptides they identify break records set by techniques like SCX or high pH RP fraction collection followed by TiO2/IMAC enrichment? No. It's not bad, but we have seen better.

However -- 4 hours from start to finish! Compare that to most of the deep phopsphoproteomics studies we've seen -- they point out several and you're looking at a full day to prep the samples and then up to 48 hours of instrument run time. 3 hours vs. a long week?!?

If you use TAFT you could use the other 4.5 days of your work week to try and make sense of the 14,000 phosphopeptides you identified.

On the method details -- they do use an Orbitrap Fusion for their analysis. It appears to be all HCD high resolution MS/MS on the mass spec side for this study and the data processing is in MaxQuant / Perseus.

The reproducibility of the TAFT protocol, run to run, looks pretty spectacular as well. One of the benefits of keeping everything simple and and all in one tube!

Over a 7 month period, this team collected xenograft samples. In all experiments they used a single reference sample and use that reference channel for normalization and in a lot of the other downstream work.

They demonstrate a surprising amount of power and reproducibility over this time. This is despite the use of 2D offline fractionation (via high pH reversed phase) into 24 fractions and keeping the fractions stored at -80C. (96 fractions were collected, but they were concatenated into 24, using the same concatenation pattern for every fractionated sample -- for example, fractions 1,25,49 & 73 were combined for every set of peptides)

All the work for this study is performed on the same Orbitrap Velos instrument using a 2Da MS/MS isolation window on a top10 experiment with HCD fragmentation set at 35.

Interesting touch here is a confidence metric that takes into account the hydrophobicity of the peptides and the reporter ion intensity.

The real star of this huge amount of work (at least for me) is the 22 separate metrics that were tested to determine the most reproducible way to roll the reporter ion and PSM data up to the protein level. As hard as this Orbitrap was working, it looks like the statisticians may have put in more time. In the end, they conclude that if you put the work in and do it right, there is no reason you can't utilize reporter ion quan in large scale clinical studies!

We keep seeing more and more awesome R tools for proteomics. I'm sure there is some nerd in your department who is doing something with R -- especially if you have transcriptomics or metabolomics people around. It is starting to seem rare around here for me to run into a grad student or postdoc who isn't proficient in it.

In case you aren't aware, however:
R is a programming language that has a few serious limitations:
1) Every object R is working with must be stored in the RAM on your computer. If you've got a 1e6 MS/MS spectra that you don't filter down to something very small immediately and you've only got 4GB of free RAM on your PC -- you're going to have a bad time.

2) R is super ugly. Apparently nothing can be done about the times new roman 1.5 spaced text and grainy plots.

3) Anyone can write a package in R and upload it where it can be accessed and used.

I'm honestly just being a jerk about number 2. Since I mostly saw younger people using R, I assumed at first that the output I saw was some sort of retro hipster MacBook protest thing. "Only the data matter, who cares about the resolution of the pixels on the plot?" he said as he chained his single speed bike to the rack. (You can actually make it a really pretty output if you make it a priority to do so. I obviously can't do it, but I've seen experts who can do it)

Number 3 can be a huge advantage. R is true open source software. You can get a new R package loaded into your text edit, change it however you want (cite it if you publish with it of course!) and it's yours to run. This is only a disadvantage if you download some R package and run it when you don't understand how it works. If it sucks, your output also sucks.

I've been really excited about all the Shiny Apps recently as well. Shiny is a way for someone to take an R package and make a GUI (user interface) so the end user doesn't have to load R and know the right formats to use the programs.

all appear to be Shiny apps (I'm not 100% sure, but they look like them) that I've used and really like (especially GiaPronto -- which I'm using the heck out of!)

There aren't yet Shiny Apps for everything and this leaves you with 2 very good reasons to spend some time using R

1) There is some amazing tools out there for proteomics already (too many to name, but the package R for proteomics is a great place to start!)
2) R provides you with what can only be described as super powers for digging through data. Things you've always wanted to do but seemed impossible are things you can get R to do just by writing a single word in the right place in a bracket. Got a thousand text files (.mgfs, maybe?) and you want to pull one data point out of each one of them and get the mean and standard deviation across all of them? One sentence. For real, I had to do something similar for a homework assignment recently.

Which bring us around to -- how do you learn this oddity? I signed up for this program a couple years ago:

I was feeling really confident about it through the first class, then I failed the one called "R programming" and decided to spend my time doing something else. Unfortunately, I've got something on my desktop I'd like to finish and R looks like the easiest way, so I buckled down and -- at the half-way point -- I still haven't failed out of it. The program is online -- stupid hard (at least for me) and I think you can pay by the class or by the month now. I paid a lump sum up front for a number of classes I could complete (maybe 15 or something? I forget, but I think that was a really good deal, maybe $30 a class?) anytime in a 3 year period. You can audit the class for free, but you can't take the quizzes/homework assignments.

There are loads of other ways to learn R -- there are great videos on Youtube, books with condescending titles like the one at the top of this blog post, and -- my favorite SWIRL

SWIRL is a game you play in R that teaches you how to use R by playing the game. It also provides you with excessive positive feedback when you get things right. Get an answer right (finally...ugh...) and you'll be rewarded by a statement like "you are truly exceptional!" and all the sudden you don't feel quite as dumb as you did before.

This is a lot of words for this point -- if you see an awesome new paper with tools in R -- don't dismiss it out of hand. If you're smart enough to run a mass spectrometer, I guarantee you can totally download R and use that tool! Just don't get down on yourself if a 500 page book suggest you'll do it by tomorrow.

Metaproteomics is the use of our normal proteomics tools that can give us a snapshot of information about an entire population of different organisms. I think we're still only just scratching the surface of the potential of this technique and every paper I've seen I think I've liked better than the last one.

In this study, these researchers work with an extremely well characterized mouse model. This model develops colitis. This is an unhappy situation for the mice, but it has been a really effective model for understanding how to better treat the disease in humans. I know some humans with this and it sounds terrible, so I'm on board with the use of this model.

I don't understand the model completely, but this is as close as I'll get. They use Rag1-/- immunocompromised mice and add T-cells to these intestines of these mice. These T-cells are wonky somehow and it causes the colitis condition. A quick GoogleScholar search finds tons of references.

The important part (to me at least) -- we have wild-type mice, Rag1-/- immunocompromised mice, and immunocompromised mice that have ulcerative colitis.

TIME TO DO FECAL METAPROTEOMICS!!

The colon flora of these mice is filtered and the microbes captured -- then they lyse them all and just do proteomics on them (this is all described well in the picture at the top).

A unique aspect of this study over other ones I've read is the use of MudPIT for the LC-MS. They utilize 11 steps using a standard high-low top8 method on an Orbitrap XL system.

Now you've got a RAW file with MS/MS spectra from peptides from a ton of organisms. There is a growing number of tools for metaproteomics, but none may be more powerful right now than ComPIL, which I rambled about here a few months ago. Essentially, though, the approaches are the same -- you are looking for peptides that can tell you what species/genus/or family is present from these MS/MS spectra.

Making the link between PSMs and the genus/family of prokaryotes that are around requires some bioinformagic that I'm not in a rush to try myself. However, I'm pretty familiar with a PCA plot and what that means --

This is a zoom in of the principal component analysis from these 3 groups of mice. Keep in mind -- this is the peptide profile of the bacteria (and other things?) inhabiting the colons of these mice. Mutations in the mice can lead to completely different bacterial populations!

That's weird, right? We've heard a lot about the microbiomes and everything...but... isn't it seriously crazy that changing one gene in a mammal (yeah -- this is an important gene) can completely change the population of the bacteria in that mammal? I'm stunned and I'm not sure what to do with what this implies.

The authors go further than this, of course, they're looking for the bacteria that are linked to the disease they study and they make some interesting findings on the populations and the specific proteins that these populations are really up-regulating. These could turn out to be great biomarkers that could be used to diagnose and treat the disease early.

Seriously -- an awesome study with results I think are both far-reaching and surprising. I highly recommend you check it out.

Tuesday, November 21, 2017

(This is the first hit for "embarrassed Pug" that was funny to me. I am concerned about that dog and how it got there...)

#1 Thanks to everyone who pointed out the fact that I completely got the results of the last paper I put on the blog completely wrong! I've approved your comments and put them in the post.

#2 I was just checking to see if anyone was paying attention. This is not true. I think I looked at the graphs that clearly detail the findings, drew an imaginary crooked line between the 2 techniques and then went ahead with my interpretation of that line despite the clear statements in the body of the text and abstract that said I was talking about the wrong thing.

#3 I'm putting an erratum at the bottom of that post that will link back to here.

#4 Hmmm....I feel like there was a number 4....but I forgot what I meant to type.

Friday, November 17, 2017

I don't have to tell anyone doing proteomics that the demands on us are getting harder as time goes by. We're seeing a lot less "how many proteins can you find in this unlimited amount of cell culture I have here" and more -- "what PTM changes the most in the membrane prep of these 4,000 cells my grad student spent 11 years isolating from the earlobes of these 2 gnat species."

Full disclaimer here -- I've never personally tried the SP3 or iST -- I'm not even sure if I've heard of the latter.SP3 was originally described here, in case you're interested (they validated it by digesting a single drosophila embryo. I'm no expert on the subject, but that sounds like it might be comparable in size to 4,000 gnat earlobes)

These authors start with low amounts of protein and/or cell counts for all 3 methods. I'm talking low --their high point is 20ug of starting material!

They compare the number of peptides identified after using each digestion methodology and, perhaps more importantly (at least to me) the reproducibility of the peptides ID'ed.

Most of this is given away in the abstract so I don't feel bad telling you about it here.

At "high" load (20ug of protein is high load these days, LOL!) FASP is right there with the other 2 methods. Both in peptides ID'ed and it looks like it has the best CV -- though iST is right there with it.

However -- as you drop down in starting amount, the effectiveness of FASP drops faster than the resale value of a BMW i8 after Tesla's new roadster was revealed. (Maybe not that fast)

SP3 looks good in terms of peptides ID'ed, but the CV gets wonky as the load drops. The clear winner in both categories is the iST methodology.

The authors go on to validate this by flow sorting some sells (!!! awesome !!!) to just a few thousand and reveal that the sorting still left a heterogeneous mixture of whatever cells there are.

I want to give a big shoutout to Dr. M for sending me this ultracool paper. It is quite seriously my favorite thing I read this week.

ERROR/EDIT: It has been brought to my attention by some readers that I completely misinterpreted the results of this paper! Looking at it, I definitely concur. I did not interpret the main figures and -- really -- the central findings of this paper correctly. Swap what I said about SP3 and iST in terms of peptide IDs and you're on the right track.

I'm going to leave my mistakes here for the sake of posterity -- if you'd like an accurate description of the results of this paper, please see the comments on this post!

However -- we HAVE to do something about this antimicrobial resistance stuff and bacteria are simple organisms in comparison to poplar trees and 60,000 gallons of salt water and the other things some proteomics researchers out there are working on. We should totally be the ones solving all these problems with the "simple" organisms!

If you're writing a grant right now that has anything to do with bacteria and proteomics and you don't drop the term "survival proteomes" into the title or abstract right now, it's certainly not my fault now.

Monday, November 13, 2017

At some point a new column called "WikiPathways" appeared in my annotation columns in Proteome Discoverer -- and I've found the column pretty useful in helping me make sense of these filtered down differential protein lists.

As of yet, I haven't seen this cool resource make its way to Compound Discoverer, but I'm gonna guess I'll see it soon.

In the end any if any of these big data tools like gene ontology, annotated pathway analysis, yeast 2 hybrid (...I'm stretching it a little bit here...) are going to help me get to the question from a global dataset I'm probably going to try it. And if it can integrate genetic observations with the ones our instruments are making -- even better!

Friday, November 10, 2017

Sometimes we have to stand on the shoulders of giants. This is more polite than just up and stating that your are going to steal someone's super cool idea and show people how you can do something similar without using their uber powerful software.

I recently discovered a MaxQuant feature called "dependent peptide search." It has been in place for years, but I've been motivated for a number of years to spend my time on another software package.

Dependent search goes kinda like this (not exact, but you didn't come to a blog for exactness):

There are modified peptides present in your RAW file.
If you looked for them all (with traditional search engines) it would take FOREVER.
However, if there are modified peptides from a protein there -- there are definitely unmodified peptides from that protein -- and they're almost always easier to detect.

SO -- it's time to reduce some variables.

MaxQuant is fancy enough to do all this with some button presses. You, my friend, paid for your software (unless you are using IMP-PD) so you have to do a bit more work

Open your processed report. Right click anywhere on it and check ALL THE THINGS!

You'll notice I have a Contaminant flagged. I'm gonna leave it in there. I'm not getting paid for this. Actually, it will just be a redundant entry in the later steps and won't matter.

Now that everything is checkmarked -- File > Export > To FASTA. Then you'll discover you don't actually have to do the right click checkmark thing.

Now you have a FASTA that is only made up of the proteins that you actually discovered. If you are using a big database this could be a massive search space reduction. You'll notice my Filters are open. I'm running some stuff as I'm writing this to see what filters are the most effective. First run, I filtered down to just the "Master" proteins and things with >1 unique peptide ID. If you're going to find a phosphopeptide you're sure as heck gonna find at least 2 peptides from that protein first -- right?

Now you can input this new FASTA database and go crazy.

Check the phosphoSTY, add all the acetylations, throw in some GlyGly. If you've got Byonic or Mascot you can get closer to dependent peptide search by actually doing a deltaM or wildcard search.

If you're concerned about FDR considerations -- you definitely should be. That's why I don't have data from this to look at. Your lowering your database size and potentially forcing the search engine to make some matches that might not be the best ones artificially.

I'm dealing with it (for now) by allowing the Peptide FDR in the consensus to work things out. If I take my first run data and my new stuff that just processed with the 10 PTMs I care about and combine it into a new (Multi) Consensus report

I have a lot of settings (hit "Advanced") under the peptide validator that I can toy around with:

And I think that optimization of these is the trick to getting the best data out of this. You can always go back to deltaM or search engine PSM score (and manual validation of the ugly ones) if you need to.

Thursday, November 9, 2017

I don't know if I've ever sat down and mastered such a powerful data analysis tool in 4 times the time it took to get beautiful plots out of GiaPronto. I seriously almost didn't write anything on the blog about it, because I was afraid more people using it would mean it wouldn't run so lightning quick for me.

I've been impressed with the ingenuity of GIA before. But now I have all these tools in a ridiculously easy web-interface?

You pull your quantitative protein list you want to analyze (it only supports pairwise comparisons, but it's so fast, just do a bunch of them!) put them in the format it wants (I just exported my PD 2.2 result reports as Excel and deleted anything I didn't need, save it as a tab delimited .txt file input it into the interface and hit the Go button. If you're using MaxQuant it's even easier. You don't have to change the titles of your Rows (columns?) to say iBAQ. They might already include this program-critical term!

It normalizes your data (as shown above) it makes PCA and volcano plots for you, it pulls out your list of significantly differential proteins which you can just export as full tables and ---

it does really powerful and intuitive biomarker analysis! "Hey, what proteins are my important biomarkers?" Don't change anything. Just hit the Go button and tab over!

It also does GO (Gene Ontology), but I don't have my data formatted correctly for that I think. AND I haven't even tried the PTM analysis functionality.

Wednesday, November 8, 2017

Some peptides are invisible to mass spectrometry. One of my favorite pathways is a phosphorylation cascade where the active sites are something like KXS(p)XK -- if you try to study this pathway and are smart enough to use trypsin, you'll enrich a lot of stuff that is singly charged and/or too small to ever identify.

X!Tandem at the GPM has a really neat function where instead of getting percent coverage of your total protein, you can choose to get percent coverage of the protein for what should be visible in your protein (singly charged and super small peptides don't count against you). It's cool to run it at least once to see that you really have been getting 100% coverage of BSA in every sample you've ran -- for years and years.

Please keep in mind that this tool, PPA, is far more powerful than what I'm about to use it for and I'm going to return to use it's more advanced functionality in the future, for sure. However, it just did a really neat trick and that's why I'm talking about it here.

PPA is a fancy machine learning algorithm that can figure out how likely your peptide, including modified versions of your peptides are, to show up in your MS/MS analysis. The authors validate it with some really complex datasets using files from several instruments. You can load in your theoretical databases and your experimental and that's all the advanced stuff.

The neat trick that I'm very impressed by is that you can just give it the FASTA file and it will predict the likelihood of each individual peptide of being detected by MS/MS using 15 known properties of peptides in general and give you a likelihood of detection for that peptide on a scale of 0 to 1 (with 1 being very good).

And...this could be small sample size...but I've got some data I've been trying to help troubleshoot on my desktop in my off hours. The problem has been the decrease in total % protein coverage of the protein of interest as the experiment has progressed...and PPA is surprisingly predictive of the peptides that are still around late in the experiment. The authors of this software have more lofty goals for this algorithm, but seeing it do something simple that matches experimental really well lends it a ton of credibility in my mind.

It has these advantages:
1) Vendor neutral (it looks like it takes any mZmL converted data)
2) It runs in Java (which you probably already have)
3) It is heavily optimized for speed and usability
4) It's free! (You can get it here).

The authors focus on comparing it to TOPPView, which is a really nice program that operates within the OpenMS framework. JS-MS doesn't require the installation of a full program like Open-MS, but it does appear to require a configured Maven environment. If you're already using OpenMS, you'll probably want to stick with your TOPPView. Heck, if you've got an LC-MS background, you might still be better off. Open-MS is built for you and you'll be able to figure it out.

However -- If you're a bioinformatics person or a programmer in general, you may already have Maven set up and you'd be better off going the JS-MS route. It's great to have more options as our field continues to expand!

The comparison I was really curious about was how this how JS-MS would compare to BatMass, which seems to have a lot of similar functionality, but I've only seen it utilize Thermo data and I'm too lazy to look up what files it is capable of accepting.

BatMass has the additional advantage of being the coolest looking icon and startup screen on your desktop -- maybe even your personal one, even if it's full of old 2D arcade games!

In this particular study they use this awesome technology to show what proteins change when mice are bored or when they have fun things to play with. The list is surprisingly large, but it is the promise of this technology that I find really exciting!

If you can label and enrich things as small and complex as different areas in mice brains -- accurately -- and you have cell-type specific promoters that are truly cell-type specific -- you can do loads of awesome stuff with this! Maybe settle some of these tumor vs. stromal cell/protein identifications once and for all, for example....

Downstream analysis was all done with a quadrupole Orbitrap (Plus) system and an LTQ Orbitrap Elite. Data analysis all appears to be MaxQuant/Perseus.

Thursday, November 2, 2017

If you aren't, I bet there is some guy in your department who won't stop talking about them if you get on the topic!

If you can identify the antigens that are present on the surface of cells that you don't want in a system with a functioning immune system (for example cancer cells, intracellular parasites, those sorts of things) then you can utilize the immune system to go ahead and get rid of those cells for you. It's totally probably not really that simple.

In all seriousness, they do make this sound quite straight-forward and present some very positive findings in the tumors that they analyze here. I don't know enough about the biology to contribute anything meaningful, but the authors seem excited about it. What I do know is that if I wanted to start looking for immunogenic antigen presenting peptides (or whatever they're called -- come on, my job would just be to identify a bunch of them, right?) I'd start by following the very clear method section in this paper!

Wednesday, November 1, 2017

I got an email from a reader who is having some trouble assessing digestion efficiency with the awesome free Preview node from Protein Metrics. While they get that sorted out, I suggested doing it the way we did before that team gave us all free software that would do it for us. Then I realized that it is even easier in the new Proteome Discoverererers than it was in the past.

Quick, lazy tutorial time!!

You need a couple of things.
A representative FASTA of the most abundant proteins in your samples (you probably have to switch this if you run lots of different organisms)
RAW files
PD 2.x

Firstly, you need to process one or more of your representative RAW files in Proteome Discoverer and get a result report.

I just grabbed a quick file. This is some HeLa digest I recently ran on a CE-QEHF (ZipChip 8 minute runs), processed vs. UniProt/SwissProt and sorted by highest number of PTMs

Alternatively (and probably more validly [wait..."validly" is a word?!?!]) you should go with the intensity or something if you've got it from LFQ or whatever.

Next you'll want to "Check" these highest hits. I just grabbed everything on the front page of the 46 inch TV someone was throwing away(!!) that I now use as a PC monitor. As long as you select more than 25 proteins, you'll be fine. You can draw a box around all of them and then right click "Check selected", or you can checkmark each one of them.

Then make a new .FASTA from these proteins. File>Export>To FASTA> Checked Proteins Only

BOOM! Tiny FASTA file.

Now you can import that FASTA and then use that to search the data that you're concerned about digestion efficiency.

With a database this small it doesn't matter if you allow your search engine to run with 10 missed cleavages. It still won't take very long

Once you have an output report find where you can plot your data. The icon looks different in PD 2.2 than in the other versions but it's at the top. Then toggle over to your histogram, choose PSMs #missed cleavages and hit Refresh (cut off in this screenshot)

Now you have a simple representative FASTA and a quick way to use the search engine of your choice to get a picture of your sample digestion efficiency.

Of course -- this is all assuming that the digestion efficiency will universally affect the proteins by abundance in the same way, but this is an assumption that seems reasonably safe to make. To be super thorough you could just run your whole FASTA with 10 missed cleavages, but this could take a really long time....

Thanks to Dr. A.H. for the informed questions and really interesting problem I haven't seen before (and still don't know how to solve) that led me around to putting this together.