Data import

Abstract

This vigenette introduces how DEP2 receives results from upstream quantitative tools like MaxQuant, DIA-NN, Spectronaut, or Fragpipe.

Introduction

The DEP2 package utilizes the make_unique function to format protein/PTM-peptide-level tables and then imports them into S4 objects via make_se/pe function. The parameters in these functions can be adjusted to accommodate different result tables generated from search and quantitative software. Here, we present the workflow and provide detailed information on the inputs from four different quantitative software programs.

Example data

We obtained a benchmark proteomics dataset by spiking Yeast and E.coli lysate into a Hela background. The raw data was obtained using an Orbitrap Fusion Lumos Tribrid equipment in either DDA or DIA mode. The DDA data was searched and quantified using MaxQuant or FragPipe, while the DIA data was analyzed using DIA-NN or Spectronaut in library-free mode.

The example tables used in this paper can be found in this in this repository.

# Download OmicsExample firstly
# Change the path to Omics Example
example_path = "the/path/to/OmicsExample"
knitr::opts_knit$set(
  root.dir = example_path
  )

library(DEP2)
library(dplyr)

Constructing S4 object

The DEP2 analysis workflow is build upon S4 object, so the initial step is to format the table and convert it into an S4 container.

Converting protein-level data into SE format

For protein-level data, the gene name or protein ID should be used as the feature identifier. DEP2 provides make_unique and make_unique_ptm functions to format the feature identifiers in proteomics results. The make_unique function cleans the identifiers, keeps the first name/ID in each group, and makes repeated identifiers unique by adding a suffix (‘ProXXX’, ‘ProXXX.1’, ‘ProXXX.2’). Repeated identifiers are common when library search is performed with isoforms or unreviewed sequences. The names and ids parameters should be gene/protein names or IDs. The unique identifiers are saved in the new columns “name” and “id”. Then, the make_se function can convert the ‘uniqued’ table into a SE object.

Conversion from MaxQuant proteinGroups

In this pipeline, DEP2 use the table proteinGroups.txt as input.

# Load proteinGroups table
mq_pgfile <- "./A_spikeIn/MaxQuant_output/proteinGroups.txt.gz"

pg <- read.csv(mq_pgfile,sep = "\t")

The names and ids are in the columns “Gene.names” and “Gene.names”, respectively.

## Generate unique names and ids
unique_pg <- make_unique(pg, 
                         name = "Gene.names", #gene 
                         ids = "Protein.IDs"   #protein
                         )

## New columns "name" and "id", which is necessary to function make_se
head(unique_pg[,c("name","ID","Gene.names","Protein.IDs")])
#>         name         ID
#> 1      NUDT4     Q9NZJ9
#> 2     P0DPI2     P0DPI2
#> 3 A0A0U1RRE5 A0A0U1RRE5
#> 4 A0A0U1RRL7 A0A0U1RRL7
#> 5     JW5951 A0A385XJE6
#> 6       UBA6     A0AVT1
#>                                                                                                             Gene.names
#> 1                                                                                                                NUDT4
#> 2                                                                                                                     
#> 3                                                                                                                     
#> 4                                                                                                                     
#> 5 JW5951;JW5936;JW5935;JW5928;JW5914;JW5895;JW1882;insH11;insH10;insH9;insH8;insH7;insH6;insH4;insH3;insH2;insH1;insH5
#> 6                                                                                                                 UBA6
#>                                                                                                                                Protein.IDs
#> 1                                                                                                          Q9NZJ9;A0A024RBG1;Q96G61;Q8NFP7
#> 2                                                                                                                        P0DPI2;A0A0B4J2D5
#> 3                                                                                                                               A0A0U1RRE5
#> 4                                                                                                                               A0A0U1RRL7
#> 5 A0A385XJE6;P0CE65;P0CE64;P0CE63;P0CE62;P0CE61;P0CE60;P0CE59;P0CE58;P0CE57;P0CE56;P0CE55;P0CE54;P0CE53;P0CE52;P0CE51;P0CE50;P0CE49;P76071
#> 6                                                                                                                                   A0AVT1

In this pipeline DEP2 uses SE object as the container to store expression assay, features information (row data) and experiment design (columns data). After make_unique identifiers, DEP2 provides the make_se/make_se_parse functions to convert SummarizedExperiment object. The former required an input experiment design table to annotate samples, which must contain three columns: ‘label’, ‘condition’, and ‘replicate’. The ‘parse’ functions can automatically assign experiment design table by parsing the column names of the abundance assay.

## Expression columns in the input data
ecols <- grep("LFQ.intensity.", colnames(unique_pg))

## Experiment design table
expDesign_file <- "./A_spikeIn/MaxQuant_output/expdesign.txt.gz"
expdesign <- read.table(expDesign_file, sep = "\t", header = T)
expdesign
#>   label condition replicate
#> 1  A1_1        A1         1
#> 2  A1_2        A1         2
#> 3  A1_3        A1         3
#> 4  B1_1        B1         1
#> 5  B1_2        B1         2
#> 6  B1_3        B1         3

## Convert SE with expdesign table
se <- make_se(unique_pg, columns = ecols, expdesign = expdesign)

## Convert SE with expdesign table using parse function
se <- make_se_parse(unique_pg, columns = ecols, mode = "delim")

How experiment design is extracted from colnames:

## How experiment design is extracted from colnames. Split by delim or by character number 
sample_names = c("Quantity.A_1","Quantity.A_2","Quantity.B_1",
                 "Quantity.B_3","Quantity.B_2","Quantity.A_3")
get_exdesign_parse(sample_names,mode = "delim", sep = "_", remove_prefix = TRUE)
#>   label  ID condition replicate
#> 1   A_1 A_1         A         1
#> 2   A_2 A_2         A         2
#> 3   B_1 B_1         B         1
#> 4   B_3 B_3         B         3
#> 5   B_2 B_2         B         2
#> 6   A_3 A_3         A         3

sample_names = c("Quantity.A1","Quantity.A2","Quantity.B1",
                 "Quantity.B3","Quantity.B2","Quantity.A3")
get_exdesign_parse(sample_names,mode = "char", chars = 1, remove_prefix = TRUE)
#>   label ID condition replicate
#> 1    A1 A1         A         1
#> 2    A2 A2         A         2
#> 3    B1 B1         B         1
#> 4    B3 B3         B         3
#> 5    B2 B2         B         2
#> 6    A3 A3         A         3

Conversion from FragPipe combined_protein

Next, we use the table combined_protein.tsv from FragPipe as input.

# Load combined_protein.tsv table
FragP_pgfile <- "./A_spikeIn/FragPipe_output/combined_protein.tsv.gz"

FragP_pg <- read.csv(FragP_pgfile,sep = "\t")

Unique identifier. The names and ids are in the columns “Gene” and “Protein.ID”.

## Generate unique names and ids
unique_FragP_pg <- make_unique(FragP_pg, 
                               name = "Gene",  #gene 
                               ids = "Protein.ID"   #protein
                               )

## new columns name and id, which is necessary to make_se
head(unique_FragP_pg[,c("name", "ID", "Gene", "Protein.ID")])
#>     name     ID   Gene Protein.ID
#> 1   ADH1 P00330   ADH1     P00330
#> 2 P00761 P00761            P00761
#> 3 CSN1S1 P02662 CSN1S1     P02662
#> 4 CSN1S2 P02663 CSN1S2     P02663
#> 5   CSN2 P02666   CSN2     P02666
#> 6   CSN3 P02668   CSN3     P02668

Convert SE. Here, We used the “MaxLFQ Intensity” values. “Intesity” is also practicable, but may bring some effect on the test result.

## Expression col in the combined_protein.tsv
ecols <- grep(".MaxLFQ.Intensity", colnames(unique_FragP_pg),value = T)
ecols
#> [1] "A_1.MaxLFQ.Intensity" "A_2.MaxLFQ.Intensity" "A_3.MaxLFQ.Intensity"
#> [4] "B_1.MaxLFQ.Intensity" "B_3.MaxLFQ.Intensity"

## Convert SE with expdesign table
se <- make_se_parse(unique_FragP_pg, 
                    columns = ecols, 
                    mode = "delim",
                    remove_suffix = TRUE    ## remove the col suffix ".MaxLFQ.Intensity"
                    )

## The suffix is removed in SE
colData(se)
#> DataFrame with 5 rows and 4 columns
#>           label          ID   condition   replicate
#>     <character> <character> <character> <character>
#> A_1         A_1         A_1           A           1
#> A_2         A_2         A_2           A           2
#> A_3         A_3         A_3           A           3
#> B_1         B_1         B_1           B           1
#> B_3         B_3         B_3           B           3
colnames(se)
#> [1] "A_1" "A_2" "A_3" "B_1" "B_3"

Conversion from DIA-NN result

As in the example above, make_se and make_se_parse accept wide-format tables which each row represents a features (a protein or a proteingroup), and rows contain the identification information and quantification information of samples. However, long-format tables are also widely used in many cases, such as the report.tsv output from DIA-NN or the MSstats output. Function reshape_long2wide can turn long-format tables into wide tables. Next, we will show the difference between handling wide- or long-tables.

Wide table output

The report.pg_matrix.tsv from DIA-NN is a wide-format table similar to proteinGroups.

# Load combined_protein.tsv table
Diann_pgfile <- "./A_spikeIn/Diann_output/report.pg_matrix.tsv.gz"

Diann_pg <- read.csv(Diann_pgfile,sep = "\t", fileEncoding="latin1")

Unique identifier. The name and ids are in the columns “Genes” and “Protein.Group”.

## Generate unique names and ids
unique_diann_pg <- make_unique(Diann_pg, 
                               name = "Genes",  #gene 
                               ids = "Protein.Group"   #protein
                               )

## New columns ”name“ and "ID", which is necessary to make_se
head(unique_diann_pg[,c("name", "ID", "Genes", "Protein.Group")])
#>        name         ID           Genes     Protein.Group
#> 1 YER079C-A A0A023PZB8       YER079C-A        A0A023PZB8
#> 2   PPIAL4C A0A0B4J2A2 PPIAL4C;PPIAL4G A0A0B4J2A2;P0DN37
#> 3     GATD3 A0A0B4J2D5    GATD3;GATD3B A0A0B4J2D5;P0DPI2
#> 4   PIGBOS1 A0A0B4J2F0         PIGBOS1        A0A0B4J2F0
#> 5      SIK1 A0A0B4J2F2      SIK1;SIK1B A0A0B4J2F2;P57059
#> 6   CENPVL1 A0A0U1RR11 CENPVL1;CENPVL2 A0A0U1RR11;P0DPI3

Convert SE. The expression columns are directly named by the file names of samples. We recommend to renaming the columns or renaming the MS files before DIAN-NN search.

## Expression col in the DIA-NN report.pg_matrix
ecols <- grep(".raw$", colnames(unique_diann_pg),value = T)
ecols
#> [1] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_A6_1.raw"
#> [2] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_A6_2.raw"
#> [3] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_A6_3.raw"
#> [4] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_B6_1.raw"
#> [5] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_B6_2.raw"
#> [6] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_B6_3.raw"

## Convert SE with expdesign table
se <- make_se_parse(unique_diann_pg, 
                    columns = ecols, 
                    mode = "delim",
                    remove_prefix = TRUE,   ## remove the file prefix 
                    remove_suffix = TRUE    ## remove the col suffix ".raw"
                    )

## The file prefix and suffix is removed in SE
colData(se)
#> DataFrame with 6 rows and 4 columns
#>            label          ID   condition   replicate
#>      <character> <character> <character> <character>
#> A6_1        A6_1        A6_1          A6           1
#> A6_2        A6_2        A6_2          A6           2
#> A6_3        A6_3        A6_3          A6           3
#> B6_1        B6_1        B6_1          B6           1
#> B6_2        B6_2        B6_2          B6           2
#> B6_3        B6_3        B6_3          B6           3
colnames(se)
#> [1] "A6_1" "A6_2" "A6_3" "B6_1" "B6_2" "B6_3"

Long table output

Unlike the above cases, the report.tsv from DIA-NN is in long format. Function reshape_long2wide can reshape long tables to wide format, before constructing SE object.

## Load combined_protein.tsv table
Diann_repfile <- "./A_spikeIn/Diann_output/report.tsv.gz"
# Diann_repfile <- "./example_data/OmicsExample/A_spikeIn/Diann_output/report.tsv.gz"
Diann_rep <- read.csv(Diann_repfile,sep = "\t") #
dim(Diann_rep) # 390479 rows, each row is a precursor.
#> [1] 390479     64

## Filter out proteingroups exceeding the threshold value
Diann_rep = dplyr::filter(Diann_rep, PG.Q.Value < 0.01)


## Table report.tsv stores both Precursor- and PG-level quality
DT::datatable(head(Diann_rep) %>% mutate_if(is.character, utf8::utf8_encode), options = list(scrollX = T,pageLength = 6))

Reshape long-table.


Diann_rep_wided = reshape_long2wide(Diann_rep,
                                    sample_col = "File.Name",      # the column labeling sample names
                                    feature_col = "Protein.Group", # PG IDs
                                    expression_col = "PG.MaxLFQ",  # PG quantity. Normalized one is also ok。
                                    shrink_ident_cols  = "Genes",  # Gene names
                                    extend_ident_cols =  "Protein.Q.Value",  # optional, some identification info.
                                    remove_sample_prefix = FALSE,  # remove prefix in sample_col
                                    remove_sample_suffix = FALSE   # remove suffix in sample_col
                                    )

The reshaping result has ‘Genes’ in a column and extends the ‘Protein.Q.Value’ to 6 columns, each representing a different sample. The variables in shrink_ident_cols are combined into a string by concatenating all the values for each feature and separating them with a ‘;’. The variables in extend_ident_cols are expanded into multiple columns, with each column containing the values for each sample (e.g., “Q.Value”).

DT::datatable(head(Diann_rep_wided,3), options = list(scrollX = T,pageLength = 3))

Unique identifier. The names and ids are in the columns “Genes” and “Protein.Group”.

## Generate unique names and ids
unique_diann_pg2 <- make_unique(Diann_rep_wided, 
                                name = "Genes",  #gene 
                                ids = "Protein.Group"   #protein
                                )

## new columns name and id, which is necessary to make_se
head(unique_diann_pg2[,c("name", "ID", "Genes", "Protein.Group")],4)
#>      name         ID                                           Genes
#> 1   GATD3 A0A0B4J2D5                                    GATD3;GATD3B
#> 2 PIGBOS1 A0A0B4J2F0                                         PIGBOS1
#> 3    SIK1 A0A0B4J2F2                                      SIK1;SIK1B
#> 4   insA1 A0A385XJ53 insA1;insA2;insA3;insA4;insA5;insA6;insA8;insA9
#>                                                 Protein.Group
#> 1                                           A0A0B4J2D5;P0DPI2
#> 2                                                  A0A0B4J2F0
#> 3                                           A0A0B4J2F2;P57059
#> 4 A0A385XJ53;P0CF07;P0CF08;P0CF09;P0CF10;P0CF11;P0CF12;P0CF13

Convert SE. The expression columns are also directly named by the file names of samples.

## Expression col in the DIA-NN report.pg_matrix
ecols <- grep(".raw$", colnames(unique_diann_pg2),value = T)
ecols # contains Protein.Q.Value columns
#>  [1] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_A6_1.raw"                
#>  [2] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_A6_2.raw"                
#>  [3] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_A6_3.raw"                
#>  [4] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_B6_1.raw"                
#>  [5] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_B6_2.raw"                
#>  [6] "F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_B6_3.raw"                
#>  [7] "Protein.Q.Value.F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_A6_1.raw"
#>  [8] "Protein.Q.Value.F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_A6_2.raw"
#>  [9] "Protein.Q.Value.F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_A6_3.raw"
#> [10] "Protein.Q.Value.F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_B6_1.raw"
#> [11] "Protein.Q.Value.F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_B6_2.raw"
#> [12] "Protein.Q.Value.F..DIA_5.20211207_LM_SA_ZXF_LY_FZH_Hela_AH109_DH5.Á_FASP_DIA_B6_3.raw"

ecols = ecols[1:6]

## Convert SE with expdesign table
se <- make_se_parse(unique_diann_pg2, 
                    columns = ecols, 
                    mode = "delim",
                    remove_prefix = TRUE,   ## remove the file prefix 
                    remove_suffix = TRUE    ## remove the col suffix ".raw"
                    )

## The file prefix and suffix is removed in SE
colData(se)
#> DataFrame with 6 rows and 4 columns
#>            label          ID   condition   replicate
#>      <character> <character> <character> <character>
#> A6_1        A6_1        A6_1          A6           1
#> A6_2        A6_2        A6_2          A6           2
#> A6_3        A6_3        A6_3          A6           3
#> B6_1        B6_1        B6_1          B6           1
#> B6_2        B6_2        B6_2          B6           2
#> B6_3        B6_3        B6_3          B6           3
colnames(se)
#> [1] "A6_1" "A6_2" "A6_3" "B6_1" "B6_2" "B6_3"

Conversion from Spectronaut Report

Spectronaut offers the ability to customize the output table. In this example, we will use the built-in MSstats style table, which is in a long-format.

Spe_repfile <- "./A_spikeIn/Spectronaut_output/DIA_MSStats_Report.xls.gz"
Spe_rep <- read.csv(Spe_repfile,sep = "\t") 
dim(Spe_rep) # 318132 rows, each row is a precursor.
#> [1] 449106     12

## Store both Precursor-, Peptides- and PG-level quality
DT::datatable(head(Spe_rep) %>% mutate_if(is.character, utf8::utf8_encode), options = list(scrollX = T,pageLength = 6))

Reshape long table.

Spe_rep_wided = reshape_long2wide(Spe_rep,
                                  sample_col = "R.FileName",      # the column labeling sample names
                                  feature_col = "PG.ProteinGroups", # PG IDs
                                  expression_col = "PG.Quantity",   # PG quantity.
                                  shrink_ident_cols  = c("PG.Genes","PG.Qvalue"),  # Gene names
                                  remove_sample_prefix = TRUE,    # remove prefix in sample_col
                                  remove_sample_suffix = FALSE  
                                  )

DT::datatable(head(Spe_rep_wided), options = list(scrollX = T,pageLength = 6))

Unique identifier. The names and Protein.Group are located in the “Genes” and “Protein.Group” columns.

## Generate unique names and ids
unique_spe_pg <- make_unique(Spe_rep_wided, 
                             name = "PG.Genes",  #gene 
                             ids = "PG.ProteinGroups"   #protein
                             )

## new columns name and id, which is necessary to make_se
head(unique_spe_pg[,c("name", "ID", "PG.Genes", "PG.ProteinGroups")],4)
#>      name         ID                                        PG.Genes
#> 1  NUDT4B A0A024RBG1                                    NUDT4B;NUDT4
#> 2  GATD3B A0A0B4J2D5                                    GATD3B;GATD3
#> 3 MMP24OS A0A0U1RRL7                                         MMP24OS
#> 4   insA9 A0A385XJ53 insA9;insA1;insA2;insA3;insA4;insA5;insA6;insA8
#>                                              PG.ProteinGroups
#> 1                                           A0A024RBG1;Q9NZJ9
#> 2                                           A0A0B4J2D5;P0DPI2
#> 3                                                  A0A0U1RRL7
#> 4 A0A385XJ53;P0CF07;P0CF08;P0CF09;P0CF10;P0CF11;P0CF12;P0CF13

Convert SE. The expression columns is directly named by the file names of samples.

## Expression columns in the Spectronaut output
ecols <- 2:7
colnames(unique_spe_pg)[ecols]
#> [1] "A5_1" "A5_2" "A5_3" "B5_1" "B5_2" "B5_3"

## Convert SE with expdesign table
se <- make_se_parse(unique_spe_pg, 
                    columns = ecols, 
                    mode = "delim",
                    remove_prefix = TRUE,   ## remove the file prefix 
                    remove_suffix = FALSE   ## no suffix in this case
                    )

## The file prefix and suffix is removed in SE
colData(se)
#> DataFrame with 6 rows and 4 columns
#>            label          ID   condition   replicate
#>      <character> <character> <character> <character>
#> A5_1        A5_1        A5_1          A5           1
#> A5_2        A5_2        A5_2          A5           2
#> A5_3        A5_3        A5_3          A5           3
#> B5_1        B5_1        B5_1          B5           1
#> B5_2        B5_2        B5_2          B5           2
#> B5_3        B5_3        B5_3          B5           3
colnames(se)
#> [1] "A5_1" "A5_2" "A5_3" "B5_1" "B5_2" "B5_3"

Converting peptide-level data to a QFeatures object

DEP2 provides a pipeline, using the QFeature package, to aggregate and summarize peptide quantities into protein-level.The protein-level quantities used in the preceding section were counted by upstream software and mostly summarized using the maxLFQ algorithm. DEP2 offers other aggregation strategies in this peptide-to-protein analysis pipeline.

The following code snippet demonstrates the conversion of peptide-level data to a QFeatures object using different software outputs.

Conversion from MaxQuant peptides

The first step is to load in peptide-level data into QFeatures class via the make_pe or make_pe_pars function. We start by reading the peptides.txt file from the MaxQuant txt directory.

mq_pepfile <- "./A_spikeIn/MaxQuant_output/peptides.txt.gz"

mq_pep <- read.csv(mq_pepfile,sep = "\t")

To convert the data into a QFeatures object, we extract the ‘intensity’ columns.

ecols <- grep("Intensity.", colnames(mq_pep)) ## the peptides intensity cols
mq_pe <- make_pe(mq_pep, columns = ecols, expdesign = expdesign)
mq_pe
#> An instance of class QFeatures containing 1 assays:
#>  [1] peptideRaw: SummarizedExperiment with 69582 rows and 6 columns

Conversion from FragPipe combined_peptides

Next, we process the combined_peptide.tsv file generated by FragPipe.

FragP_pepfile <- "./A_spikeIn/FragPipe_output/combined_peptide.tsv.gz"

FragP_pep <- read.csv(FragP_pepfile,sep = "\t")

We use the ‘Intensity’ columns as expression columns.

ecols <- grep("[0-9].Intensity", colnames(FragP_pep),value = T)  ## the peptides intensity cols

FragP_pe <- make_pe_parse(FragP_pep, columns = ecols, 
                          mode = "delim", 
                          remove_suffix = T ## remove suffix
                          )
colData(FragP_pe)
#> DataFrame with 5 rows and 4 columns
#>           label          ID   condition replicate
#>     <character> <character> <character> <integer>
#> A_1         A_1         A_1           A         1
#> A_2         A_2         A_2           A         2
#> A_3         A_3         A_3           A         3
#> B_1         B_1         B_1           B         1
#> B_3         B_3         B_3           B         3

Conversion from DIA-NN report

We reshape the peptide quantities into a wide table using the reshape_long2wide function. The ‘Stripped.Sequence’ column represents the peptide IDs. A stripped peptide may have multiple precursors due to variable modifications or different charge states. For these peptides, the function retains the maximum expression values.


## filter out proteingroups exceed the threshold value
Diann_rep = dplyr::filter(Diann_rep, PG.Q.Value < 0.01)

Diann_pep_wided = reshape_long2wide(Diann_rep,
                                    sample_col = "File.Name",      # the column labeling sample names
                                    feature_col = "Stripped.Sequence", # PG IDs
                                    expression_col = "Precursor.Quantity",  # PG quantity. Normalized one is also ok。
                                    shrink_ident_cols  = c("Protein.Group","Genes"),  # Gene names
                                    extend_ident_cols =  c("Global.Q.Value"),  # optional, some identification info.
                                    remove_sample_prefix = FALSE,  # remove prefix in sample_col
                                    remove_sample_suffix = FALSE   # remove suffix in sample_col
                                    )

DT::datatable(head(Diann_pep_wided), options = list(scrollX = T,pageLength = 6))

ecols = 2:7

pe <- make_pe_parse(Diann_pep_wided, columns = ecols, 
                    mode = "delim", 
                    remove_suffix = T ## remove suffix
                    )

Conversion from Spectronaut result

The Spectronaut result also store the peptide-level quantities.

Spe_repfile <- "./A_spikeIn/Spectronaut_output/DIA_MSStats_Report.xls.gz"
Spe_rep <- read.csv(Spe_repfile,sep = "\t") #
dim(Spe_rep) # 318132 rows, each row is a precursor.
#> [1] 449106     12

## Store both Precursor-, Peptides- and PG-level quality
DT::datatable(head(Spe_rep) %>% mutate_if(is.character, utf8::utf8_encode), options = list(scrollX = T,pageLength = 6))

Reshape the long-table.

Spe_rep_wided2 = reshape_long2wide(Spe_rep,
                                   sample_col = "R.FileName",      # the column labeling sample names
                                   feature_col = "PEP.StrippedSequence", # peptide sequence
                                   expression_col = "PEP.Quantity",      # peptide quantity.
                                   shrink_ident_cols  = c("PG.Genes","PG.ProteinAccessions"),  # Gene names and protien IDs
                                   remove_sample_prefix = TRUE,    # remove prefix in sample_col
                                   remove_sample_suffix = FALSE  
                                   )

Convert QFeatures object

ecols = 2:7

pe <- make_pe_parse(Spe_rep_wided2, columns = ecols, 
                    mode = "delim", 
                    remove_suffix = T ## remove suffix
                    )
colData(pe)
#> DataFrame with 6 rows and 4 columns
#>            label          ID   condition replicate
#>      <character> <character> <character> <integer>
#> A5_1        A5_1        A5_1          A5         1
#> A5_2        A5_2        A5_2          A5         2
#> A5_3        A5_3        A5_3          A5         3
#> B5_1        B5_1        B5_1          B5         1
#> B5_2        B5_2        B5_2          B5         2
#> B5_3        B5_3        B5_3          B5         3

Session information

#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>  [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=zh_CN.UTF-8        LC_COLLATE=zh_CN.UTF-8    
#>  [5] LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
#>  [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Asia/Shanghai
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] dplyr_1.1.2                 DEP2_0.3.7.3               
#>  [3] R6_2.5.1                    limma_3.56.2               
#>  [5] MSnbase_2.26.0              ProtGenerics_1.32.0        
#>  [7] mzR_2.34.1                  Rcpp_1.0.11                
#>  [9] MsCoreUtils_1.12.0          SummarizedExperiment_1.30.2
#> [11] Biobase_2.60.0              GenomicRanges_1.52.0       
#> [13] GenomeInfoDb_1.36.1         IRanges_2.34.1             
#> [15] S4Vectors_0.38.1            BiocGenerics_0.46.0        
#> [17] MatrixGenerics_1.12.2       matrixStats_1.0.0          
#> [19] BiocStyle_2.28.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          rstudioapi_0.15.0          
#>   [3] jsonlite_1.8.7              shape_1.4.6                
#>   [5] umap_0.2.10.0               MultiAssayExperiment_1.26.0
#>   [7] magrittr_2.0.3              MALDIquant_1.22.1          
#>   [9] rmarkdown_2.23              GlobalOptions_0.1.2        
#>  [11] fs_1.6.2                    zlibbioc_1.46.0            
#>  [13] ragg_1.2.5                  vctrs_0.6.3                
#>  [15] memoise_2.0.1               Rsamtools_2.16.0           
#>  [17] RCurl_1.98-1.12             askpass_1.1                
#>  [19] BiocBaseUtils_1.2.0         itertools_0.1-3            
#>  [21] htmltools_0.5.5             S4Arrays_1.0.4             
#>  [23] missForest_1.5              mzID_1.38.0                
#>  [25] sass_0.4.6                  bslib_0.5.0                
#>  [27] htmlwidgets_1.6.2           desc_1.4.2                 
#>  [29] plyr_1.8.8                  impute_1.74.1              
#>  [31] cachem_1.0.8                GenomicAlignments_1.36.0   
#>  [33] igraph_1.5.0                lifecycle_1.0.3            
#>  [35] iterators_1.0.14            pkgconfig_2.0.3            
#>  [37] Matrix_1.5-4.1              fastmap_1.1.1              
#>  [39] GenomeInfoDbData_1.2.10     clue_0.3-64                
#>  [41] digest_0.6.33               fdrtool_1.2.17             
#>  [43] pcaMethods_1.92.0           colorspace_2.1-0           
#>  [45] DESeq2_1.40.2               rprojroot_2.0.3            
#>  [47] RSpectra_0.16-1             crosstalk_1.2.0            
#>  [49] textshaping_0.3.6           randomForest_4.7-1.1       
#>  [51] fansi_1.0.4                 compiler_4.3.1             
#>  [53] rngtools_1.5.2              proxy_0.4-27               
#>  [55] withr_2.5.0                 doParallel_1.0.17          
#>  [57] downloader_0.4              BiocParallel_1.34.2        
#>  [59] MASS_7.3-60                 openssl_2.0.6              
#>  [61] DelayedArray_0.26.6         rjson_0.2.21               
#>  [63] tools_4.3.1                 glue_1.6.2                 
#>  [65] QFeatures_1.10.0            grid_4.3.1                 
#>  [67] Rtsne_0.16                  cluster_2.1.4              
#>  [69] reshape2_1.4.4              generics_0.1.3             
#>  [71] gtable_0.3.3                class_7.3-22               
#>  [73] preprocessCore_1.62.1       tidyr_1.3.0                
#>  [75] data.table_1.14.8           utf8_1.2.3                 
#>  [77] XVector_0.40.0              foreach_1.5.2              
#>  [79] pillar_1.9.0                stringr_1.5.0              
#>  [81] circlize_0.4.15             splines_4.3.1              
#>  [83] lattice_0.21-8              survival_3.5-5             
#>  [85] tidyselect_1.2.0            ComplexHeatmap_2.16.0      
#>  [87] locfit_1.5-9.8              Biostrings_2.68.1          
#>  [89] knitr_1.43                  bookdown_0.34              
#>  [91] edgeR_3.42.4                xfun_0.39                  
#>  [93] DT_0.28                     stringi_1.7.12             
#>  [95] lazyeval_0.2.2              yaml_2.3.7                 
#>  [97] evaluate_0.21               codetools_0.2-19           
#>  [99] tibble_3.2.1                BiocManager_1.30.21        
#> [101] cli_3.6.1                   affyio_1.70.0              
#> [103] reticulate_1.30             systemfonts_1.0.4          
#> [105] munsell_0.5.0               jquerylib_0.1.4            
#> [107] TCseq_1.23.0                png_0.1-8                  
#> [109] XML_3.99-0.14               parallel_4.3.1             
#> [111] ellipsis_0.3.2              pkgdown_2.0.7              
#> [113] ggplot2_3.4.2               assertthat_0.2.1           
#> [115] doRNG_1.8.6                 AnnotationFilter_1.24.0    
#> [117] bitops_1.0-7                glmnet_4.1-7               
#> [119] scales_1.2.1                affy_1.78.1                
#> [121] e1071_1.7-13                ncdf4_1.21                 
#> [123] purrr_1.0.1                 crayon_1.5.2               
#> [125] GetoptLong_1.0.5            rlang_1.1.1                
#> [127] vsn_3.68.0

Zhenhuan Feng