Monday, July 18, 2016

Lab series# 14- MS based proteomics


Now that I'm back from the break, let us go ahead with an unfinished discussion. I had earlier talked about some basic instrumentation principles of Mass spectrometry (MS) and left with a note that I will talk about using mass spectrometry for proteomics. In essence, proteomics refers to a large-scale study of proteins. The original definition  includes studying detection of protein, analysis of structure and studying their functions. MS-based proteomics can be used for detecting and quantifying the protein.

Fig 1: Edman's sequencing method for peptides. Source
It's worth mentioning that before use of the MS for protein sequencing, Edman's protein sequencing was the method of choice. It was a slow process and required that there is a free amino acid terminal. The method uses cyclic degradation of peptides based on the reaction of phenylisothiocyanate with the free amino group of the N-terminal residue such that amino acids are removed one at a time and identified as their phenylthiohydantoin derivatives via HPLC. The reaction fails if there is a modified amino terminus. It is notoriously slow and interpretation requires great expertise. Edman's method is still important since it offers some specific advantages. For example, amino acids with identical molecular weights can be identified. Isoleucine and Leucine both have a mass of 113.08 Da. Glutamine (128.05 Da) and Lysine (128.09 Da) have nearly similar mass but different HPLC retention time. However, most of the modern MS can differentiate Glutamine and Lysine. Reference mass of all amino acid can be found here.

Shotgun sequencing is a term derived from genomic science. In this method, the parental protein is fragmented via an enzyme and then the peptides generated are sequenced. The peptides are then realigned using software search algorithms. For an analogy, think you have several copies of a hundred page book. shredding multiple copies of the book and mixing up all the fragments and then reassembling the original text by finding fragments of text that overlap and piecing the book back together again. Chances are you will end up with most of the pages in the right sequence except for some errors here and there. This method requires computational power. Shotgun sequencing has its own set of problems. For example, a shared sequence can match to any protein. There are smart algorithms to avoid this issue which we will discuss later. The advantage of Shotgun method is that it is rapid and high throughput method allowing a maximum coverage in a small time.

So the first question. Why not directly sequence the whole protein? Why make peptides, sequence them and then realign back to the original sequence which cab bring in errors? The answer is there is more than one reason to do so.

The best way to get an information about the protein is finding its sequence. MS is most efficient in obtaining sequence information from peptides that are less than 20 residues long, which is far less than most parental sequences. Creating peptides also mean that properties such as solubility become irrelevant. As long as a protein can give rise to some peptides which can be sequenced, we still can find the protein. Though this affects the coverage, we still can have results which otherwise can never be had. That means the first step in MS-Proteomics is to create a peptide library. 

Once the peptide is injected through an LC-MS/MS platform, the first thing that happens is peptides are read for their m/z values in the first MS. Subsequently, the peptides are broken by CID (Collision-induced dissociation). In this method, the peptide ions are accelerated to high kinetic energy and then allowed to collide with neutral gas such as Helium, Nitrogen or Argon). Some of the energy is converted into internal energy which results in bond breakage and the fragmentation of the molecular ion into smaller fragments. The CID gives random cleavage and different types of ions can be produced from the same peptide.

Fig 2: Peptide Fragmentation Nomenclature
The ions can be named based on the bond cleaved and ion produced. Roepstorff P and Fohlman J proposed a nomenclature for sequence ions in mass spectra of peptides, now known as Roepstorff–Fohlmann–Biemann nomenclature. There are 3 possible bonds that can fragment along the amino acid backbone under the influence of CID: (i) NH-CH (ii) CH-CO and (iii) CO-NH.

On successful breakage, two fragments of molecules will be generated. Hence there are six possible combinations as shown in the diagram. The a, b, and c ions having the charge retained on the N-terminal fragment, and the x, y and z ions having the charge retained on the C-terminal fragment. The most common cleavage sites are at the CO-NH bonds which give rise to the b and y ions. In the light of this understanding, consider following MS spectra for a peptide sequence.
Fig 3: CID MS/MS, many copies of the same peptide are fragmented at the peptide backbone to form b and y ions. The spectrum consists of peaks at the m/z (mass to charge) values of the corresponding fragment ions. Source
By knowing the mass difference between b8 and parent ion we can calculate the mass of one ion (which is also equal to y1). In this case, the mass correlates with K. So the first amino acid from one side is K. By calculating the difference between b7 and b8 (which is also equal to y2) we can calculate the mass of 2nd ion V. That means sequence is VK. This can go on until the whole sequence is identified. This whole process can also be done other way around using y ions difference to yield the same results.

So the second question. How does MS know if the ion generated is b or y? For an MS its just ions and in reality b or y can be produced in random. The answer lies in the difference in cleavage products depending on where it is cleaved.

Fig 4: Generalised structure of a polypeptide. Source
As I already said, b ion represents the N-terminal cleavage and y ion represents C- terminal cleavage. (See Fig 4 for general peptide terminal nomenclature). As you can make out from the structure, there will be a difference in mass of the ion generated depending on if it has the NH3 group or it has COO group. The same estimation also helps in knowing the directionality of sequence.

For example, Let's compare two peptides as examples with different sequences.

ANELLLNVK      .........   ANELLLNV K
KANELLLNV      .........  K ANELLLNV

In both the cases, the K is cleaved which could have had come from any side thus totally altering the sequence. However, this is easily identified with its correct sequence by knowing the mass of K. If K has come from the 1st case it would not have the N-terminal component and in the second case, any C-terminal. In each case the mass is different. To calculate the mass of a specific b-type ion the add the mass of the N-terminal proton. For y-type ions the mass of the C-terminal -OH group is added, plus two additional protons (one for the N-terminus and one to provide the charge). There is a very detailed explanation of calculation given in this link.

In an ideal condition, all the y and b ions are produced for every peptide. But in reality, this doesn't happen. Only some of the ions show up in MS data. The challenge is now to deduce the sequence. Let us again consider the same sequence from Fig 3. Consider that b6 and y3 ions are not detected. Now how will you get the sequence? In the case point, the mass difference between b5 and b7 doesn't correspond to any amino acid with or without modification. But if you insert 2 amino acids in the combination LN the mathematics fits perfectly, in which case you can argue that LN is the right combination- giving the sequence ANELLLNVK. Note that more the number of missing ions there is more prediction involved and hence errors are more easily possible.

Fig 5: Methods to identify the peptide.
It is simply a pain to search the peptide sequence against everything possible to come up with a protein identification. The computational possibility in such case is infinite. It is always easier if you have some narrowing down of possibilities. So let's say I'm doing a proteomic study on E coli. I can search the results against E coli protein database and identify the protein. 

The exact algorithm for finding the protein differs based on the method followed. For example, in MASCOT search program, probability-based matching is used. The program first identifies the possible cleavage sites for peptide generation. This depends on the enzyme used during peptide preparation. For example, trypsin, a serine protease has very specific cleavage properties. trypsin cleaves peptides on the C-terminal side of lysine and arginine amino acid residues. If a proline residue is on the carboxyl side of the cleavage site, the cleavage will not occur. If an acidic residue is on either side of the cleavage site, the rate of hydrolysis will be slow. Based on these rules for a given protein, peptides can be hypothesised. Certain amino acids more easily break when CID is applied (such as proline). For a given peptide sequence the ideal spectra can be hypothetically computed. This hypothetically created peptide is a computational spectra. Next step is to match the computational and experimental spectra. The better they match each other more is the confidence in reporting the protein identification.

Fig 6: Venn diagrams comparing A) peptide identifications
and B) protein identifications. Source
Other methods of identification include Peptide Sequence Tags, based on the fact that fragmentation spectra usually contain at least a small series of an easily interpretable sequence. In another method called autocorrelation, mathematically determines the overlap between a theoretical spectrum that has been derived from every sequence in the database and the experimental spectrum. Different search strategy can give slightly different results and usually it is advised to search using more than one algorithm. In a study published by Joao A. Paulo; 2013, (See Fig 6), it is clear that there is a significant difference in identification based on search strategy. Over the years the search engines have improved though there is still a significant difference in identification. Probably, MASCOT and SEQUEST are most commonly used tools.

Fig 7: Target decoy method for estimating FDR.
One inherent problem in doing a shotgun sequencing is the error creeping in due to possible mistakes in the alignment of the sequence. This value needs to be kept at a minimum possible value. Universally a 1% error is fixed in academic standards. It is known as the FDR (False Discovery rate). FDR is analogous to Type I error. There are several reasons why there is mismatched identification such as a low-quality spectrum. In practice, it is impossible to tell which PSM (peptide spectrum match) is false. If there was a definitive method we could have designed an algorithm to remove false discoveries. Target-decoy method is commonly used to estimate the FDR. In this method, the software is used to search the target database and a decoy database. Hits at the decoy are considered as false ID. The decoy database is usually reversed sequences of database entry but needn't be always the case.

FDR = Number of Decoy Hits / Number of target hits

Fig 8: Setting up FDR cut off.
Keeping the FDR at a very ambitious level (Let's say 0.1%) will bring down the number of identification to a very low number and keeping it high (Let's say 2%) will identify too many on the wrong side. It must be understood that when we say 1% FDR there is a good possibility that out of 5000 proteins identified there is a good chance that 500 proteins are wrong identification. As can be seen from Fig 8, shifting cut off to the left will increase the number of peptides to be retained which means a number of False positive increases. Shifting it the other way decreases the identification. This also explains why proteomics cannot identify every peptide present (See Fig 8 again). A lot of protein identification is discarded because of lack of confidence in reporting the peptide.

Let's come to 3rd question. Can we try to reduce the FDR by the search strategy? Each search strategy has an inherent error which is fixed to 1%. Since the strategies are different it can be argued that the erroneous identification is different protein on a different strategy. So one's which are identified in both cases automatically overcomes the error, thereby reducing FDR. But since 2 methods have different errors, the error actually amplifies when it comes to peptides exclusively identified.

Now that you get some basic idea of how MS determines protein sequence, we can take the next step of talking about different types of proteomics experiments possible. Depending on the strategy and intent of the experiment, there are many different types of proteomics experiment possible.

I will give you some scenarios on how advanced MS is useful for studying microbiology.

Huge numbers of genomes are known, thanks to the sequencing capabilities. We have identified new microbial pathogens and want to study their molecular biology and how they interact with human cells, develop markers for the infection etc. For all these purposes, we need to know what kind of proteins the organism has. I was amazed to hear recently that in a big majority of cases (Even for well-known pathogens) it is really not known if the microbe produces that particular protein. In other words, there is no experimental evidence. For example, a proteogenomic study of MTB identified 3176 proteins with approx 250 novel peptides. Even for a heavily studied MTB, it is surprising to learn that we never knew about those 250 peptide sequences. It simply opens up new avenues to study what those proteins do. Proteogenomics is useful to annotate the genome for protein coding regions and provides experimental evidence for the existence of proteins. It helps create the database of protein sequences for that particular organism. Usually, a lot of proteins are similar to what has been already studied. Occasionally the new proteins discovered by this method is something of interest such as a potential marker, previously unidentified virulence marker etc.

Once a protein map is developed, we can further work on protein expression profile (Using quantitative proteomics), under different conditions or study signalling mechanisms (Example Phospho-proteomics). For example, Quantitative proteomics has been used to identify several different host factors that interact with pathogen thus increasing our understanding of the process. In these cases, a single experiment yields data that would have been otherwise collected over a huge number of experiments.

The topic of proteogenomics targeted proteomics or quantitative proteomics is in itself a huge topic to talk about and maybe I will talk about it in a future blog post so that I don't overburden this post. The idea of this post was to couple with an earlier post on MS principle to give you an idea of how MS can give you a protein identification data.

  Reinders J, Lewandrowski U, Moebius J, Wagner Y, Sickmann A. Challenges in mass spectrometry based proteomics. Proteomics. 2004;4(12):3686-3703.

Steen HMann M. The abc's (and xyz's) of peptide sequencing. Nature Reviews Molecular Cell Biology. 2004;5(9):699-711. 

Microbial Proteomics. Proteomics. 2011;11(15):2941-2942.

No comments:

Post a Comment