thinks that if we mine the existing data and use all the proteins that have phylogenetic relationships with one another we can get to the answer of who interacts with whom.

The results are impressive. And it is worth noting that even though they didn't even cite the BioPlex resource, 57% of the data points they incorporated came from direct experiments with human proteins.

BTW, InWeb_IM is their resource in the Venn Diagram above, so even if they aren't right about all of them, it is a whole lot more protein-protein interactions for us to look through!

Monday, November 28, 2016

The CSF-PR resource has been around for a couple of years, but needed a revamp since...you know...something like 90% of the proteomic data EVER generated has shown up in the last 4 years or so. (Do I have that pie chart on here? I LOVE that pie chart!)For details on the updates check out this new one from MCP here!

There are some other databases out there on CSF. This one stands out cause of the following criteria:

Has to be LC-MS/MS from living humans
Has to contain at least 20 patients per study
The data from the study has to be publicly accessible AND in some way be open to quantification between 2 disease groups or between 1 disease group and 1 control group with an n greater than 3 in both groups.

Unsurprisingly, considering how relatively few people happily line up for CSF withdrawals, they whittle a relatively large number of published studies down to a much smaller set of super high quality (and medically interesting!) studies.

Sunday, November 27, 2016

Yes, I know, this belongs somewhere else, but I promise it is really super cool. (Link to paper here!)

From our perspective, it probably seems pretty straight-forward, right? If you've got MS/MS data that you are saying is this small molecule, maybe you'd want to do some sort of a false discovery measurement, right?!? And...maybe if you've jumped head-first into doing metabolomics cause it's super easy interesting, you might be put off a little at first cause you don't have FDR measurements.

Turns out it isn't quite so easy with the small molecules thanks to how they don't fragment as friendly as peptides do, and we can't just move down the line to the next peptide sequence that is truly unique -- since there isn't a second peptide. You get one shot at identifying and quantifying

This paper introduces two ideas -- JumpM and MISSILE that are a little incongruous, but together assembles a full methodology for how they think metabolomics should be done with heavy standards, Orbitrap data and target-decoy based FDR. And...it is honestly way smarter than the way I do it....

Tuesday, November 22, 2016

Need a paper to mull on while avoiding discussing politics with your family this holiday weekend? Think on this one!

What is it? Wait, you can't tell from the title? Come on!

In all seriousness, it is a really unique (to me, at least) way of thinking about what that unmatched spectra might be in that organism you don't have a good database for. And it might just be brilliant. I can't tell.

I gave it a good read and then thought about it in my car while I enjoyed the combination of normal D.C. and possibly early holiday traffic(?) and this is what I think is going on.(And I might totally have this wrong).

Imagine we're starting off with this organism that no one has sequenced before and we need to do proteomics on it. The mass spec side is the same as always (as long as it wasn't hard to lyse or whatever, of course) but then we've got no database for it. We could de novo it or use BICEPS, but these are both going to be super computationally expensive, full of false discoveries or require that you spend 2 years studying Python to use it (this approach may fail in one of these regards as well, I'll have to check).

Spectral networks goes sideways here. What if you could lower your bioinformatic load (what?!?) by running more samples? They go the easy route here and take 3 bacteria and do dd-MS2 on them. Then they take the spectra that are the most similar (by MS/MS fragments) and network them together. In this way you can 1) Find the most important features and 2) Start to limit what you're going to have to search.

I know this is wacky. Who has spare mass spec time?!? To this, I answer -- who can find a good bioinformatician for that salary that you can't seem to find a good mass spectrometrist for? Nobody, that's why!

Seriously -- what choice do we you are told to get some proteomics data on this organism? Wait and hope the genomics people are considering it a priority, will sequence it this year, and will annotate it by 2020?

Example set: They start with 3 species (or strains) of Cyanothece that biofuels people are seriously interested in that someone has done proteomics on. Serious proteomics:

Start with:
>1e6 spectra/organism
Cluster the completely homologous peptides (identical ones from each run AND organism)
= heck, if you search those conserved ones you're gonna have a massive reduction in search space (but you're going to miss what makes that organism why it isn't the other)
Cluster the MS/MS spectra that are only different in one mass shift. For example, the y ions are awesome till you get to the high mass ones and then each one is off by 8 in species 2 and 14 in species 3. (Or whatever). then move onto the next pairing!

As a side effect here, btw, you're going to get a quick understanding of evolutionary relatedness here -- without any genomic information on these guys! Most these MS/MS spectra are the same and you didn't get the samples mixed up? These things are related for sure!

In this run through they break their spectra into something like 16,000 networks. So....this is just a little more complex than the example 2 paragraphs up, but it is for illustration purposes only.

But check this out -- you now have these networks, where this spectrumA is equal to spectrumB (+8Da at y7/y8/y9) and spectrumC is equal to spectrumB (- something). Now that it is all linked you dump in some matched spectral data. Some stuff that is ID'ed and perfect. The MS/MS spectra are linked to IDs and it falls together like dominoes.

Does it work? They probably wouldn't have sent it to MCP if it didn't, but it definitely looks like it works. I find it makes more sense to me the more I think about it....

Monday, November 21, 2016

...and it put how bad malaria sucks into perspective. One expert she references throws out a figure that more than half the human beings who have ever lived may have died...of malaria... If you are into a super depressing read on how a gross parasite has shaped our history, mostly by killing us by the millions and billions, I couldn't come up with a better suggestion.

For a more uplifting tail, I suggest this nice recent paper from Bertin et al.,. In this work, these authors take a proteomic run at some patient sera with non-complicated (bad) and cerebral (really really really bad) Plasmodium falciparum (the species that generally kills you the first go 'round) malaria. They used label free quan and an Orbitrap Velos and some clever bioinformatic tricks (compound databases with lots of the Var sequences) and sweet downstream statistics to try and find some differences.

While there are tons of challenges with this monster of a disease, like crappy databases and poor annotation and mutations all over the place, they are still able to find some really interesting conclusions. Several of the differentially regulated proteins they find appear linked and may even work together in functional complexes.

Sunday, November 20, 2016

After the reader responses to a previous post I constructed regarding ion mobility, it is clear that I don't understand this technology enough to really go on about it (though...that doesn't always stop me.)

Again, I don't know the difference, but I have got to see this Ion mobility Q Exactive that ExcellIMS has...

...that they were using quite successfully to separate petroleum products. And I also got to see this monster in person (cause it is here in Baltimore!)

...so it isn't a particularly novel idea, but PNNL seems to know something about ion mobility and I'm gonna guess it is something special.

Edit: Changed the title. I thought the original came off more sarcastic than I meant it to.

In this study, they take a huge historic dataset from PRIDE and borrow an R package from BioConductor that was designed for RNA-Seq based quan and apply it to spectral counting. It turns out that it is stinking fast on a normal desktop PC and it appears to work very well on this dataset. All good news!

I guess my hesitation comes from my normal issues with spectral counting -- that it does work well on huge datasets with lots and lots of PSMs per protein. It would be nice to see this applied to another set that hasn't been fractionated and repeated so many times. It is a seriously interesting concept, though!

The authors take an interesting approach to the code as well. Unless I'm overlooking it, they didn't publish it anywhere. Instead, they produce a Supplemental text file with the example R Script.

Initially, this made me laugh. Partially cause I'm a little under half awake, then I realized how thorough the script was!

I'll be darned if it isn't complete enough CTRL+P in and just run because it does reference all the prerequisites. This probably sounds dumb, but...people aren't always this thorough with free resources that they put out there....and when they are, it is something that should be appreciated!

Friday, November 18, 2016

When you're running PD you often run your nodes down two distinct pipelines -- one that is your peptide ID pipeline and one that is your quantification pipeline.

I'm no expert in what is going on behind the scenes here in the magic binary land, but I find it really useful from a logical sense to keep this in mind. Our end report is going to bring it back together, but these nodes are functioning, for the most part, as distinct and separate executables. The results are all brought back together into the SQLite (is it still this in PD 2.2, I think so, but I'm not 100% sure) table that is our .MSF and .PDRESULT file.

As such -- it is completely possible for our friends Sequest and Percolator to find something that the other side of the aisle did not. Honestly -- dig deep -- it happens quite a bit.

Check this out --

This is an iTRAQ run from a friend who studies possibly one of the hardest things (that isn't a plant) that you can do proteomics on. (14,000 MS/MS events -- 90 PSMs in this file, seriously.) But check this out -- in the processed data I can find thousands of spectra with iTRAQ quan) -- but no IDs.

This can only occur when we've seriously separated out the 2 processing pathways.

This isn't the most common question I get about PD, though! The question that comes in is -- wait, I ID'ed this! Where's the quan?

First off -- this is gonna be significantly less common in reporter ion quan. If you've got a good fragmentation spectra, chances are you're going to have reporter ions down there. Even if you spike in a good heavy labeled standard -- like the PRTCs -- you'll probably see reporter ions. (Thought I had proof of this, but I can't find it right now). This is isolation interference. We're never fragmenting a 100% pure population of just our ion of interest. Other stuff sneaks in. But it does happen.

If you see something like this, you'll want to look at the Peptide Group level for the "Quan Info" tab. This will give you a vague statement regarding why you didn't get quantification.

It is significantly more common to find ID with no quan in the Event Detector MS1 quantification. (SILAC and PIAD). Example...

Check out this SILAC dataset and the stuff we find waaaaay down in the noise. We get some info on why there isn't quan when we look at the Peptide Group Level. "Not used" and "Excluded by Method"

To figure this out, you need to check out this troubleshooting chart from the manual.

This is what PD considers behind the scenes. In PD 2.x we've got control over some of these parameters (in the MSF and Quantification nodes). It might take some detective work to determine what you are looking for. But the Quan Info columns can help you chase it down.

It is a little more manual in the PIAD workflow. Example...

We've got a protein ID with 55% coverage and no quan? What? As the highlighter and misspelled word indicate, you see that this protein only has one Unique peptide. What we need to do is find out why that peptide didn't get quan.

If we check that protein and then Expand the Associated Tables.... (click to see full-size)

We can find that 1 PSM that is unique just to that protein...If we go one layer down...we find an absolute kick in the pants. Remember when you set you built that method and you said "No Quantification" (cause in PD 2.x the PIAD isn't considered a "real" quan method).

PD 2.2 has "real" label free quan, as do the PeakJuggler and OpenMS community nodes. But PIAD doesn't get some of the troubleshooting benefits that SILAC does.

But we can figure out why this thing didn't get quan.

If we highlight this peptide and then show "the spectra used for quantification" and "show the XIC" we might get to the answer. Check out the XIC at the bottom. Even with a 6ppm mass tolerance cutoff, this is an ugly peak. If we look at the precursor, we're seeing an awful lot of interference here. (It says 64% isolation interference...which...honestly, is a measurement of something else entirely, but is useful for illustration purposes here.)

The Event detector is seriously strict. Remember, the maximum cutoff you can put in is 4ppm.

Check out another peptide (the next one down the list and what it does look like)

Again...isolation interference shouldn't be my metric, but its only 22% for this one and it shows in the Peak. The PIAD has no problem working with this one.

I guess the moral of the story is -- PD 2.1 quan has a logical pipeline and you can almost always figure out why you get and ID and no quan. Honestly, it is probably harder to figure out why you got quan but no ID.

Thursday, November 17, 2016

There is a lot of good stuff in here -- like arguments for why having an array for 30,000 transcripts isn't as good as looking at the proteome -- and clear descriptions of where all of our "proteoforms" are coming from.

I don't want to steal all the the highlights from a review that is this good, but I seriously just printed out Figure 1 and hung it on my wall so I can just look at it....

...cause I need more time to think on this.

You should check this review out, for real, cause I think we've got the capabilities already to fully assess a lot of the mechanisms that they describe here in terms of proteomic imbalances already hiding in our RAW files -- and we're just not systematically looking for them...

If you don't know about this and English grammar isn't the thing you're best at --maybe you come from an area of the U.S. that is famous for the poor quality of its public schools -- you might want to check out this free Chrome extension!

(They pay-for version appears to be much more powerful, but at $12/month the free on is going to have to prove itself first).

Wednesday, November 16, 2016

While the paper is definitely geared toward a specific nefarious disease, it is a beautiful exercise in optimization of how to extract the most proteins/peptides out of a tiny amount of fixed tissue. Cause...if you can take a block of wax that has something as tiny as neurons in it and detect the proteins that ought to be there, you're doing it right!

The procedure isn't trivial, and the authors are quick to warn you -- the laser microdissection (sp?) needs to be done really really well. Sample preparation methods and instrument methods need to be employed that focus on minimizing peptide loss above all else.

And...they pull it off (of course!) and get almost 2,000 protein IDs from some areas they cut out with the laser. Others aren't nearly as high...but honestly, maybe that isn't a limitation of their methodology, it may really be biology. When you are looking at areas of anatomy this specialized will we need expression of the whole proteome?

Some details -- the LC-MS was single shot (no fractionation, but technical replicates when possible) analysis on 50cm EasySpray columns onto a Q Exactive. The QE was geared up for sensitivity -- allowing up to 120ms for MS1 fill time and up to 250ms at the MS/MS level. They were willing to take a massive hit in cycle time, if necessary, to get good fragmentation spectra.

While it seems like I'm focusing on this as if its just a methods paper, they did serious runs on tissues from Alzheimer's diseased and control brains and have all the differential data in the open access supplemental info!

Wait -- this deserves an extra sentence or two -- the supplemental tables are so freaking logical. I might be baised since I've been looking at PD output tables for...a long time...but they are so smart. For example table S4 has the protein IDs charted against the surface area of the tissue that was dissected out! S2 is the actual comparison between the normal and diseased tissues and whether the proteins were detected or not.

This is a killer little study showing how much we can get from those little waxy blocks of tissue those pathologists have been stockpiling!

Tuesday, November 15, 2016

Ribo-Seq is a powerful technique the genomics world has now. It gets them even closer to the proteome by characterizing what messenger RNAs (mRNAs) are currently protected by ribosomes. They are protected by the ribosomes because they are, right that second, making some proteins!

I'm a little foggy still, but I think they pull out all the RNA like they would for RNA-Seq but then degrade everything that is free floating. Then they just have to destroy the ribosomes to release the protected mRNA.

In bacteria, the central dogma applies pretty clearly. The regulations systems are pretty simple...cause you've (normally) got one tiny chromosome. And its been shown that RNA-Seq and proteomics line up pretty great in bacteria, but we're a little more complicated.

This team uses pulsed SILAC (pSILAC!) and this technology together to assess stress response in human cells treated with bortezomid. Its a proteosome inhibitor that is used for some cancers. Actually, on its own this drug is super fascinating. Some cancer cells protect themselves from immune response by making tons of proteosomes and just eating up the immune response. This drug drops a boron right in the catalytic site of one of the major proteosome proteins and shuts 'em down. Now...proteosomes are pretty important to our cells functioning so this creates a lot of stress! Hence this paper!

Here is the setup from the main experiment in the paper. On the proteomics side, they do something interesting with the pulsed SILAC. They use HCD and high/high mode on an Orbitrap Velos to develop their SILAC SRM transitions for their heavy and light peptides. Then they do the rest of the study with targeted SRMs. I've never tried this approach, but it does seem kind of smart. They pick 4 transitions for each peptide to monitor -- and since the heavy label should be on the y ions, they should be reasonably easy to get to. With 4 separate SRMs, its hard to argue about interference effects, even on a device with quads that can only isolate 0.7/1.0Da at Q1/Q3, respectively.

What did they find out? Human biology is seriously complicated! Even this fancy-pants new Ribo-Seq thing can't accurately tell you how much protein is there or how much is being made. However, using the two together can give you an understanding of the stress response in the cell. They propose the use of this methodology for understanding other chemotherapeutics -- get a deep mRNA-Seq, go get a deep Ribo-Seq, and then get accurate information on protein levels from the proteomics to make the rest of it make sense.

If you're wondering why you wouldn't just cut out the two expensive genomics techniques from the experiment and just do good proteomics on it -- well...so am I, but you don't just want that sequencer sitting there do ya?......ummm...got one! The pSILAC could tell you how much protein is there at each point in your time course, but the Ribo-Seq can give you an extremely rough estimate a measurement of what is being produced so you could better infer whether the change in total protein concentration is because of a change in production (translation) or degradation!

Monday, November 14, 2016

First of all -- shoutout to the team at OmicsPCs for allowing me remote access to one of their "mid-level" computers (actually, this might be one of the bigger non-server rack ones above, lol!) after seeing my complaints about having trouble crunching >1k RAW files on my own desktop a while back.

On my first run, I hit a striking revelation -- if I just leave the Administration properties at default -- this monster with 6x more processing cores than mine -- wasn't any faster on a 24 fraction human lysate digest. They ran neck and neck -- which is weird cause these Xeons are only running 3-4GHz (I forget what he said) and my Destroyer runs at 5GHz and my solid state drive is recent and was the fastest on the market when I got it. (Don't ask him cause he starts talking in fractions about laten purses or cachets or something).

Time to mess around -- especially cause when I looked at the Task Manager and it doesn't look like the one above (going hard on all cores!) at all.

If you go into the Admin section on PD 2.1 you've got a lot you can optimize. Way more than in any previous version of PD. You can tell Sequest, MSAmanda, Byonic, and even ptmRS how many processing threads they are allowed to use. There may be more. At this point, I'm going to leave them all at default so they will use the maximum available.

Then -- I'm going to modify the Parallel Job Execution thing -- this is a shot of the defaults from my desktop. I can only allow 4 workflows to proceed at once and up to 4 consensus workflows.

On the PC above (shoulda took a screenshot before they changed the password...) I could set each to up to 24!! Messing around with changing these parameters and a workflow like this....

...didn't change things! Until you clicked the "As Batch" button -- and then the high core box finished over 3x faster!! WTFourier?!?

Now...this probably bears further examination. Might be some things going on here, like maybe setting Sequest to 0 (to recognize all cores) doesn't work on these components? But my thought here is that PD doesn't need all these threads to finish each individual RAW file, so it doesn't use them. But if you are doing them in Batch -- then you are using all sorts of different Processing and Consensus workflows and you can use all sorts of extra resources. So....

Why don't we batch them and then recompile the results? I've always been told that if you fractionate your samples you should search and Percolate them altogether for the best statistics. But...PD 2.1 allows you to do peptide group level and protein FDR. So..maybe its okay?

First off, we've got to get those Consensus workflows out of there -- cause we only want to do 1 of them.

Did you know (I didn't) that if you only want PD to stop at making an MSF file you can? You just make a Consensus workflow that looks like this!

BOOM! No consensus!

So...I messed around trying a few different things. But, ultimately, 12 and 24 processing workflows and 24 consensus workflows (it still fires up the Consensus -- even if it doesn't actually do anything) appeared to be the fastest setups.) It was faster per processing workflow with 12 workflows (4 threads on this PC) that it made up for the 24 processing workflows at 2 threads each. Honestly, I was watching Cavs/Celtics while running this remotely...so...within the error margin of an NBA time out? (3-7 minutes at most). But they were pretty close.

Then you need to take the MSFs that you generated and make one Consensus workflow!

When you get your 24 processed MSFs you highlight them and now this option (highlighted) is no longer greyed out!

All you need to do then is to make a normal consensus workflow -- here, to keep all things the same (and cause Marcus Smart is hitting 3's!) I just used the same "Basic Consensus" that I used for the previous runs...and....!

...from a pure number level...they are pretty darned close...the top is them batched. The bottom is the RAW files ran in one processing workflow....

Oh...wait...I was talking about speed as well. The top one? Under 30 minutes (this is only 2 dynamic mods --MetOX and Acetyl-proteinN). The bottom one? Can't find my notes, but 1.5-2 hours.

Venny isn't set to do "as scale" so this looks extreme...but we're talking a total difference of about 2.7%. Now...I actually don't know if Venny is reading my upper/lower case (has PTMs) correctly. And I'm too lazy to check.

More importantly...how's it do at the protein group level?

Interesting! A little bit more! But still within a few percent. I wonder what those few percent are? I've got a hunch...

If I drop out the 1 hit wonders (at least 2 unique peptides, which is kinda harsh) we're looking at almost 99% agreement here between these 2 datasets and dropping the net processing time by at least 3 fold.

Not saying this is the way to do this,but might be something to check out if your data processing time is a serious bottleneck!