Author: Sonali Arora (sarora@fredhutch.org)
 Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor version 3.2
This lab will walk you through an end-to-end RNA-Seq differential expression workflow, using DESeq2 along with other Bioconductor packages.
Note: a number of other Bioconductor packages can also be used for statistical inference of differential expression at the gene level including edgeR, BaySeq, DSS and limma.
Using the data from airway, design and implement
an end-to-end RNA-Seq differential expression analysis, using DESeq2
Steps include -
The data used in this Lab is an RNA-Seq experiment of airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used, for example, in asthma patients to prevent or reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample.
The reference for the experiment is:
Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.
For our analysis, we wil use data from the data package airway.
library("airway")
data(airway)The data stored inside airway is a SummarizedExperiment object.
library("airway")
data(airway)
se <- airway
se## class: RangedSummarizedExperiment 
## dim: 64102 8 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowRanges metadata column names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSampleOnce we have our fully annotated SummarizedExperiment object, we can construct a DESeqDataSet object from it, which will then form the starting point of the actual DESeq2 package, described in the following sections. We add an appropriate design for the analysis.
library("DESeq2")
dds <- DESeqDataSet(se, design = ~ cell + dex)It will be convenient to make sure that untrt is the first level in the dex factor, so that the default log2 fold changes are calculated as treated over untreated (by default R will chose the first alphabetical level, remember: computers don’t know what to do unless you tell them). The function relevel achieves this:
dds$dex <- relevel(dds$dex, "untrt")Finally, we are ready to run the differential expression pipeline. With the data object prepared, the DESeq2 analysis can now be run with a single call to the function DESeq:
dds <- DESeq(dds)## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testingThis function will print out a message for the various steps it performs. These are described in more detail in the manual page for DESeq, which can be accessed by typing ?DESeq. Briefly these are: the estimation of size factors (which control for differences in the library size of the sequencing experiments), the estimation of dispersion for each gene, and fitting a generalized linear model.
A DESeqDataSet is returned which contains all the fitted information within it, and the following section describes how to extract out results tables of interest from this object.
Calling results without any arguments will extract the estimated log2 fold changes and p values for the last variable in the design formula. If there are more than 2 levels for this variable, results will extract the results table for a comparison of the last level over the first level.
(res <- results(dds))## log2 fold change (MAP): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 64102 rows and 6 columns
##                  baseMean log2FoldChange      lfcSE       stat       pvalue        padj
##                 <numeric>      <numeric>  <numeric>  <numeric>    <numeric>   <numeric>
## ENSG00000000003 708.60217    -0.37424998 0.09873107 -3.7906000 0.0001502838 0.001251416
## ENSG00000000005   0.00000             NA         NA         NA           NA          NA
## ENSG00000000419 520.29790     0.20215551 0.10929899  1.8495642 0.0643763851 0.192284345
## ENSG00000000457 237.16304     0.03624826 0.13684258  0.2648902 0.7910940556 0.910776144
## ENSG00000000460  57.93263    -0.08523371 0.24654402 -0.3457140 0.7295576905 0.878646940
## ...                   ...            ...        ...        ...          ...         ...
## LRG_94                  0             NA         NA         NA           NA          NA
## LRG_96                  0             NA         NA         NA           NA          NA
## LRG_97                  0             NA         NA         NA           NA          NA
## LRG_98                  0             NA         NA         NA           NA          NA
## LRG_99                  0             NA         NA         NA           NA          NAAs res is a DataFrame object, it carries metadata with information on the meaning of the columns:
mcols(res, use.names=TRUE)## DataFrame with 6 rows and 2 columns
##                        type                               description
##                 <character>                               <character>
## baseMean       intermediate mean of normalized counts for all samples
## log2FoldChange      results  log2 fold change (MAP): dex trt vs untrt
## lfcSE               results          standard error: dex trt vs untrt
## stat                results          Wald statistic: dex trt vs untrt
## pvalue              results       Wald test p-value: dex trt vs untrt
## padj                results                      BH adjusted p-valuesThe first column, baseMean, is a just the average of the normalized count values, dividing by size factors, taken over all samples. The remaining four columns refer to a specific contrast, namely the comparison of the trt level over the untrt level for the factor variable dex. See the help page for results (by typing ?results) for information on how to obtain other contrasts.
The column log2FoldChange is the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment with dexamethasone in comparison to untreated samples. This value is reported on a logarithmic scale to base 2: for example, a log2 fold change of 1.5 means that the gene’s expression is increased by a multiplicative factor of \(2^{1.5} \approx 2.82\).
We can also summarize the results with the following line of code, which reports some additional information
summary(res)## 
## out of 33469 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 2641, 7.9% 
## LFC < 0 (down)   : 2242, 6.7% 
## outliers [1]     : 0, 0% 
## low counts [2]   : 15441, 46% 
## (mean count < 5)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?resultsA quick way to visualize the counts for a particular gene is to use the plotCounts function, which takes as arguments the DESeqDataSet, a gene name, and the group over which to plot the counts.
topGene <- rownames(res)[which.min(res$padj)]
plotCounts(dds, gene=topGene, intgroup=c("dex"))For a much detailed analysis see
- Case Study- How to build a SummarizedExperiment - airway dataset
- Differential Expression Lab
If you liked this lab and want to learn more in this area, do not miss the following labs at BioC2015
sessionInfo()sessionInfo()## R version 3.2.1 (2015-06-18)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2                      gplots_2.17.0                          
##  [3] DESeq2_1.9.23                           RcppArmadillo_0.5.200.1.0              
##  [5] Rcpp_0.11.6                             airway_0.103.1                         
##  [7] Rattus.norvegicus_1.3.1                 org.Rn.eg.db_3.1.2                     
##  [9] GO.db_3.1.2                             OrganismDbi_1.11.42                    
## [11] BSgenome.Rnorvegicus.UCSC.rn5_1.4.0     BSgenome_1.37.3                        
## [13] rtracklayer_1.29.12                     TxDb.Rnorvegicus.UCSC.rn5.refGene_3.1.3
## [15] org.Hs.eg.db_3.1.2                      RSQLite_1.0.0                          
## [17] DBI_0.3.1                               TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3
## [19] GenomicFeatures_1.21.13                 AnnotationDbi_1.31.17                  
## [21] AnnotationHub_2.1.30                    RNAseqData.HNRNPC.bam.chr14_0.7.0      
## [23] GenomicAlignments_1.5.11                Rsamtools_1.21.14                      
## [25] Biostrings_2.37.2                       XVector_0.9.1                          
## [27] SummarizedExperiment_0.3.2              Biobase_2.29.1                         
## [29] GenomicRanges_1.21.16                   GenomeInfoDb_1.5.8                     
## [31] IRanges_2.3.14                          S4Vectors_0.7.10                       
## [33] BiocGenerics_0.15.3                     ggplot2_1.0.1                          
## [35] BiocStyle_1.7.4                        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                 httr_1.0.0                   tools_3.2.1                 
##  [4] R6_2.1.0                     KernSmooth_2.23-15           rpart_4.1-10                
##  [7] Hmisc_3.16-0                 colorspace_1.2-6             nnet_7.3-10                 
## [10] gridExtra_2.0.0              curl_0.9.1                   graph_1.47.2                
## [13] formatR_1.2                  labeling_0.3                 caTools_1.17.1              
## [16] scales_0.2.5                 genefilter_1.51.0            RBGL_1.45.1                 
## [19] stringr_1.0.0                digest_0.6.8                 foreign_0.8-65              
## [22] rmarkdown_0.7                htmltools_0.2.6              BiocInstaller_1.19.8        
## [25] shiny_0.12.1                 BiocParallel_1.3.34          gtools_3.5.0                
## [28] acepack_1.3-3.3              RCurl_1.95-4.7               magrittr_1.5                
## [31] Formula_1.2-1                futile.logger_1.4.1          munsell_0.4.2               
## [34] proto_0.3-10                 stringi_0.5-5                yaml_2.1.13                 
## [37] MASS_7.3-43                  zlibbioc_1.15.0              plyr_1.8.3                  
## [40] grid_3.2.1                   gdata_2.17.0                 lattice_0.20-33             
## [43] splines_3.2.1                annotate_1.47.1              locfit_1.5-9.1              
## [46] knitr_1.10.5                 geneplotter_1.47.0           reshape2_1.4.1              
## [49] codetools_0.2-14             biomaRt_2.25.1               futile.options_1.0.0        
## [52] XML_3.98-1.3                 evaluate_0.7                 latticeExtra_0.6-26         
## [55] lambda.r_1.1.7               httpuv_1.3.2                 gtable_0.1.2                
## [58] mime_0.3                     xtable_1.7-4                 survival_2.38-3             
## [61] cluster_2.0.2                interactiveDisplayBase_1.7.0