Digitizing the proteomes from big tissue biobanks
Analyzing 24 proteomes per day by micro-flow SWATH® Acquisition and Spectronaut Pulsar analysis
Jan Muntel1, Roland M. Bruderer1, Lukas Reiter1, Christie Hunter2
1Biognosys, Schlieren, Switzerland, 2SCIEX, USA
Tissue biopsies have been preserved and stored in biobanks for more than a century in the hope that their future analysis will provide a better understanding of health and disease. One of the most common methods of preserving these tissue samples is by formalin-fixed paraffin-embedded (FFPE). These samples are often very well characterized by classical pathological methods and provide great potential for precision medicine and the discovery of new diagnostic/stratification markers and therapeutic targets.
A powerful way to take advantage of this repository is to quantify large numbers of proteins across all the samples so that correlations can be made with respect to various health and disease states. Such an endeavor would require highly reproducible sample preparation, a robust analytical platform for high throughput sample analysis, as well as robust data analysis. Current LC-MS/MS proteomics tools now allow for the reproducible quantitation of 1,000s of proteins in a single run. In particular, SWATH® Acquisition has been shown to provide the very good data completeness, reproducibility, and quantitative precision in comparative studies.1 When coupled with microflow chromatography, sample throughput and assay robustness are improved while still maintaining similar overall workflow sensitivity to nanoflow separations.
Here, microflow SWATH Acquisition2 was used to generate quantitative proteomics data on a cohort of colon cancer samples from a biobank. This study demonstrates how high throughput proteomics can be used to interrogate these precious samples from biobanks and how this research can pave the way to a better understanding of health and disease.
Sample preparation: The formalin fixed paraffin embedded (FFPE) colon tissue samples were ordered from a public repository. These samples were classified as healthy or disease according to clinically accepted protocols. Protein extraction and tryptic digestion of a 10 μm slice were performed using an adapted protocol3 and resulted on average in 140 μg of protein per slice. Prior to LC-MS analysis digests were spiked with iRT peptides (Biognosys) for retention time normalization. Six μg of protein digest was used per run.
Chromatography: Separation was performed using a Triart C18 150 x 0.3 mm column (YMC) coupled to a NanoLC™ 425 System (SCIEX). A non-linear 43 min gradient was used at a flow rate of 5 μL/min.
Mass spectrometry: Data acquisition was performed using a TripleTOF® 6600 System (SCIEX) with Turbo V™ Ion Source plumbed with microflow hybrid electrodes. SWATH Acquisition method consisted of 120 variable Q1 windows, 18 msec MS/MS accumulation time, and one 250 msec MS scan. The cycle time of the method was 2.4 sec resulting in 6 data points per LC peak. Total run time per sample was ~1 hour such that the whole project was completed in <5 days. A spectral library was generated by fractionating a pooled sample using high pH reverse phase fractionation (pooled digests from 10 healthy and 20 cancer samples). These samples were analyzed with the same LC setup using a standard data dependent acquisition (DDA) method.
Data processing: DDA data were searched against the human UniProt database using the Pulsar search engine (Biognosys) and a library was generated using 3-6 fragment ions per precursor. The library comprised 49,176 precursors, 44,807 peptides and 5,499 protein groups. This library was then used for data analysis of the SWATH Acquisition data in Spectronaut Pulsar (Biognosys) using default parameters. The analysis of the whole dataset took ~30 hours. All data were filtered by 1% FDR on precursor and protein level. Protein grouping was performed based on the ID picker algorithm. Data were normalized by local regression normalization. Statistical testing for differential protein abundance was done using the Spectronaut pipeline (t-test, multiple testing correction after Storey).
Highly reproducible sample preparation
Colon and rectal cancers rank third in the USA with respect to the number of new cases and number of deaths per year4 with 1 out of 17 people developing colorectal cancer, highlighting the importance of performing biomarker research in this disease area. This current study was comprised of 105 FFPE tissue samples consisting of 95 cancer samples and 10 healthy samples from various resection sites of the colon (Figure 2). The sample preparation produced a high peptide yield without biases towards the colon region. On average, one slice yielded 140 μg of total protein with slightly lower yields for hepatic flexor and sigmoid region (Figure 3). Overall the sample preparation was highly reproducible with CV=~10%.
A highly robust workflow
SWATH Acquisition uses a library for data analysis. In this study, the spectral library that was generated comprised 5,499 protein groups, 44,807 peptides and 49,176 precursors. As highlighted in Figure 4, the median number of protein groups quantified from the cancer samples was 3,644 proteins, and a median of 2,882 proteins in the healthy samples and remained constant across the entire sample set within 1 standard deviation as indicated by the colored areas. In total, 4,565 proteins groups were quantified in this sample set across the two sample types.
Data normalization is often applied to correct for small differences in sample starting amounts, or variation in the sample preparation steps and LC-MS analysis. In this study, a small amount of variation was observed across the samples as well as a slight decrease in overall intensity over time (Figure 5, top). This variation was corrected for by normalization in Spectronaut software (Figure 5, bottom).
Finally, reproducibility of chromatography is important for targeted data extraction when using the spectral libraries. Using microflow LC, highly reproducible retention times were observed (Figure 6), with a median variation of 0.4% RSD.
Many proteins are significantly altered between cancer and healthy tissues
Figure 1 shows a volcano plot of the data after a t-test for this quantitative dataset. The results revealed 1,023 proteins in a significantly altered abundance between the healthy and the cancer samples (Q value < 0.01, absolute log2 fold-change > 0.58). The majority of these significantly altered proteins (703) were found in an increased amount in the samples from the cancer tissue.
Using principal component analysis (PCA), healthy and cancer samples were clearly separated (PC1 - Figure 7).
Gene ontology information aids in biological interpretation
The PCA findings were supported by gene ontology analysis in Spectronaut in which proteins involved in translation initiation, translation and RNA metabolism were highly enriched in the cancer cohort (Figures 8 and 9). These findings demonstrated an increased protein synthesis capacity in the cancer cells compared to the healthy cells, which has been described as a key physiological task for cancer cells.4
Proteomic profiles can be used to identify potential cancer subtypes
A cluster analysis was performed in order to find patterns in the data, and this revealed three cancer subtypes based on their proteome profiles (Figure 10). These potential sub-populations within the cancer cohort are labeled as proteomic subtype A, B and C. Healthy samples are clustered within the unboxed area to the left. Ribosomal proteins are boxed in red at the bottom of the plot and these show various levels of increasing abundance with respect to the healthy samples for all cancer sub-populations.
Another interesting protein cluster is labeled with a blue arrow. Expression levels were high in subtype A and healthy samples, but low in subtype B and C. This cluster primarily consisted of cell adhesion proteins which were previously shown to play a significant role in the metastatic potential of colon cancer.6
A previous colon cancer study showed that Hepatocyte Nuclear Factor 4-α (HNF4α) expression levels differ between subtypes.7 In this study, HNF4α was significantly higher in abundance in subtype B (Figure 11) indicating an HNF4α amplification in this tumor subtype.
This study demonstrates how high throughput proteomics can be used to analyze large sample sets from tissue biobanks. Microflow SWATH Acquisition on TripleTOF 6600 System combined with Spectronaut data analysis generates data from these large resources in rapid fashion with high analytical depth. The proteomic analysis of large sample sets available in today’s biobanks will enable a better understanding of the molecular pathways behind health and disease and pave the way in the future for a better personalized treatment of cancer.
- High throughput analysis with microflow SWATH Acquisition, 105 samples analyzed in ~5 days
- High analytical depth - 4,500 proteins across the two sample types, healthy, and colon cancer
- Fast data analysis and results processing with Spectronaut software
- Three cancer subtypes were found in this study based on proteomic profiles