---
title: "Introduction to the DChIPRep package"
author: "Bernd Klaus and Christophe Chabbert"
bibliography: references.bib
date: "`r doc_date()`"
output:
   BiocStyle::html_document:
      toc: true
vignette: >
  %\VignetteIndexEntry{DChIPRepVignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---


# Preparations

```{r style, echo=FALSE, results="asis", cache=FALSE, message=FALSE}
library(knitr)
library(rmarkdown)
BiocStyle::markdown()

opts_chunk$set(fig.width=14, fig.height=10, cache=TRUE, 
error=FALSE, message = FALSE)
options(digits = 4, scipen = 10, stringsAsFactors = FALSE, width=100)
#options("citation_format" = "pandoc")
```


We first set global chunk options and load the
packages that are necessary  to run the vignette. 

```{r setup, cache = FALSE}
library(DChIPRep)
library(ggplot2)
library(DESeq2)
```


# Introduction to the DChIPRep package

The package implements the analysis strategy of  [@Chabbert_2015]. 
Along with an experimental protocol to multiplex ChIP--Seq experiments, 
[@Chabbert_2015] developped a methodology to assess differences 
between chromatin modification profiles in replicated ChIP--Seq studies 
that builds on the packages `r Biocpkg("DESeq2")` and 
`r CRANpkg("fdrtool")`. The package also includes a preprocessing script in Python
that allows the user to import sam files into data structures than can be used 
with the  package.

# Introduction to the pre--processing script

Together with the release of the R package *DChIPRep*, we provide a python framework 
that should allow users to generate count tables. These may then be directly 
used to evaluate metagene enrichment profiles. This framework is 
entirely contained in a script that only requires the installation 
of Python 2.7 and of the packages HTSeq and Numpy, 
which are free to download. This section describes briefly the various 
possibilities of analysis offered by 
this tool together with the various input options and parameters. 
Additional information may be found directly 
in the source code or via the usual help framework offered by Python.

The script comes with the package and can be located on your system by
the following commands: 

```{r pythonScripts ,cache=FALSE}

pythonScriptsDir <- system.file( "exec" , package = "DChIPRep" ) 
pythonScriptsDir
list.files(pythonScriptsDir)

```

## General principle

This script will process alignments of paired-end reads, filter out 
divergent pairs and low quality alignments and ultimately 
identify potential PCR duplicates. For the pairs of read retained after filtering, 
the center of the genomic loci determined by each pair is estimated using mapping 
coordinates. This center constitutes an approximation of the center of each observed 
nucleosome. The provided annotation file is then used to estimate the nucleosome 
counts around the features of interest. These counts are ultimately used to provide 
a count table in which each row corresponds to one feature and each column 
to a genomic position around the feature start. This table may then be imported 
in R using the **importData** function of the *DChIPRep* package.

## Quick start

Here we show how to quickly generate a count table from an alignment file. 
The *DChIPRep.py* script only requires 4 arguments (all other options have a default setting, 
cf below):

- An alignment file in the SAM format (some HTSeq functions are not currently 
supported for BAM files) that may be zipped. It is specified by the '-i' option on the command line.

- An annotation file in the gff format (specifications for the gff format may be found [here](https://www.sanger.ac.uk/resources/software/gff/spec.html#t_1)) specified in the '-a' option

- A tabulated file containing the names of each of the considered chromosomes and the size:
  + chrI  1304563
  + chrII 6536634
  + ...
  
  +  It is crucial to ensure that the chromosome names used in the alignment file and 
    this information file are identical. This information is used by the 
    program to pre--allocate memory to store the counts. A valid path to this 
    text file should be specified under the '-g' option.

- A valid path for an output file specified under the option '-o'. 
A new file will be created automatically - 
if it already exists, its content will be erased and replaced by the script output.

Here is an example of the command line to be run

```
python DChIPRep.py -i alignment.sam -a my_gff.gff -g chromosome_sizes.txt -o my_count_table.txt
```


## Advanced options

In order to provide greater flexibility in data pre-processing, 
we have added a panel of additional options that may be specified 
when running the script. The script is executable, With the *--help* option,
one can get an overview of the available options. They are given 
below.

```{r engine = 'bash', echo = FALSE, eval= FALSE}
./../exec/DChIPRep.py --help 
```

```
usage: DChIPRep.py [-h] -i SAM/BAM -a GFF -g Chromosome Sizes File -o Count Table [-v] [-q TH_QC] [-l TH_LOW] [-L TH_HIGH] [-d TH_PCR] [-f FEATURETYPE]
                   [-w DOWNSTREAM] [-u UPSTREAM]

optional arguments:
  -h, --help            show this help message and exit
  -i SAM/BAM, --input SAM/BAM
                        The alignment file to be used for to generate the count table. The file may be in the sam (zipped or not)format. The extension of
                        the file should contain either the '.sam' indication. Bam files are not supported at the moment due to soome instability in the BAM
                        reader regarding certain aligner formats.
  -a GFF, --annotation GFF
                        The annotation file that will be used to generate the counts. The file should be in the gff format (see
                        https://www.sanger.ac.uk/resources/software/gff/spec.html for details).
  -g Chromosome Sizes File, --genome_details Chromosome Sizes File
                        A tabulated file containing the names and sizes of each chromosome. !!! The the chromosome names should be identical to the ones
                        used to generate the alignment file !!! The file should look like this example (no header): chromI 1304563 chromII 6536634 ...
  -o Count Table, --output_file Count Table
                        The output file where the count table should be stored. If the specified file does not already exist it will be created
                        automatically. Otherwise, it will be overwritten
  -v, --verbose         When specified, the option will result in the printing of informations on the alignments and the filtering steps (number of
                        ambiguous alignments...etc). Default: OFF
  -q TH_QC, --quality_threshold TH_QC
                        The quality threshold below which alignments will be discarded. The alignment quality index typically ranges between 1 and 41.
                        Default: 30.
  -l TH_LOW, --lowest_size TH_LOW
                        The lowest possible size accepted for DNA fragments. Any pair of reads with an insert size below that value will be discarded.
                        Default: 130.
  -L TH_HIGH, --longest_size TH_HIGH
                        The longest possible size accepted for DNA fragments. Any pair of reads with an insert size above this value will be discarded.
                        Default: 180.
  -d TH_PCR, --duplicate_filter TH_PCR
                        The number of estimated PCR duplicates for accepted for a given genomic location. Default: 1.
  -f FEATURETYPE, --feature_type FEATURETYPE
                        The feature types to be used when generating the count table. The feature type will be matched 3rd column of the GFF file. Default:
                        'Transcript
  -w DOWNSTREAM, --downstream_window DOWNSTREAM
                        The window size used to obtain the counts downstream of the TSS. Default: 1000bp
  -u UPSTREAM, --upstream_window UPSTREAM
                        The window size used to obtain the counts upstream of the TSS. Default: 1500bp

```


## Example

As an example, let us assume that we want to process an alignment file with the following criteria:

- Remove reads with an alignment quality below 20

- Focus on the pairs with a spanning size ranging between 120 and 160 bp

- Get the counts information around the TSS of the coding sequences (CDS) in a symmetrical window of 500bp

- Only tolerate two copies on the same molecules upstream

The command line would then be:

```
python DChIPRep.py -i alignment.sam -a my_gff.gff 
-g chromosome_sizes.txt -o my_count_table.txt -v -q 20 -l 120 
-L 160 -d 2 -f CDS -u 500 -w 500
```

### A note on the count table dimensions

The default number of upstream position is 1500bp, the default number of downstream 
positions is 1000bp. This results in count tables with 2502 columns in total, 
corresponding to the basepairs up-- and downstream as well as one column for  
for the 0 position and one column for the feature IDs.

# Importing the data into R

In what follows, we assume that we count nucleosomes around transcription start
sites (TSS), that is the start of transcripts, which is also the default option
in the preprocessing script.

## Importing the output from the Python script

After the  data has been preproccesed, we first need an annotation table for our
samples that looks like this:

```{r sampleTable, markup = 'asis', message = TRUE}
data(exampleSampleTable)
exampleSampleTable
```

It gives the names for the input count files in the columns **ChIP** and **Input**
respectively and the  number of base pairs used for analysis upstream 
and downstream of the TSS  in the columns **upstream** and **downstream**.
Note that the number of upstream and downstream positions needs to be the same
for all samples.

The input files must be tab separated files with genomic features in
the rows and the positions around the TSS in the columns. In addition to the position
columns, a column containing a feature ID must be present. 

As mentioned above, the tables have upstream + downstream + 2 columns in total,
the two extra columns correspond to the center of the profile (e.g. a TSS) and 
a column containing the feature IDs.

The **sampleID** column contains unique sample IDs.

We can then import the data as follows using the function **importData** which needs
the sample annotation  table, a `data.frame` and the directory where the raw data files
are stored.
We use the down--sampled raw data that come with the package to illustrate 
the function use.

By default the feature ID column is assumed to have the name **name**, 
however this can specified in the call to  **importData** via the **ID**
argument.

```{r data_import_R_Py, dependson="sampleTable"}
directory <- file.path(system.file("extdata", package="DChIPRep"))
importedData <- importData(exampleSampleTable, directory)
```

The imported data is a DChIPRepResults object that contains the data as a DESeqDataSet.
It can be accessed via the function **DESeq2Data**. The DESeqDataSet also contains
normalization factors that are equal to the counts from the chromatin 
inputs, after being multiplied by the coverage ratio between the input and the ChIP data sets.

```{r show_imported_data, dependson="data_import_R_Py"}
importedData
DESeq2Data(importedData) 
head(normalizationFactors(DESeq2Data(importedData)))
```

## Importing count matrices

If your data is already available within R, you can import it directly via 
**importDataFromMatrices** from an input and a ChIP Matrix. The rows of the matrices
should correspond to the positions and the columns to the samples. You furthermore need
a sample table as described above, however the columns **Input** and **ChIP** are not
needed.

If you have data that is still on the level of the individual features, you
can use the helper function **summarizeCountsPerPosition** to summarize the counts
for individual features per position.

The package comes with example data sets that correspond to the example sample table
shown above.

We first show 10 random rows from the two data matrices and then use the function
**importDataFromMatrices** to import them to a **DChIPRepResults** object.

As you can see the rows should be named according to the positions up-- and downstream
of the TSS that they contain and the columns should be named after the samples.

```{r inspect_example_data, dependson="sampleTable"}

data(exampleInputData)
data(exampleChipData)

exampleSampleTable

exampleInputData[sample(nrow(exampleInputData), 10), ]
exampleChipData[sample(nrow(exampleChipData), 10), ]

imDataFromMatrices <- importDataFromMatrices(inputData = exampleInputData, 
                                              chipData = exampleChipData, 
                                              sampleTable = exampleSampleTable)
  
```

The imported data is again a **DChIPRepResults** object that contains 
the data as a **DESeqDataSet**.


# Importing .bam files directly using `importData_soGGi`

It is also possible to use the function `importData_soGGi`, which is 
based on the function `regionPlot` from the package  `r Biocpkg("soGGi") `
to import data from .bam files directly.

It will return a matrix with one column per .bam file and the respective counts per
postion in the rows. 

In the example below, we use a subsampled .bam file (0.1 \% of the reads)
from the Galonska et. al. WCE (whole cell extract) H3Kme3 data and associated
TSS near identified peaks. For additional details on the data, see the help pages
on `input_galonska` and `TSS_galonska`. The code below is not evaluated,
since it takes some time to compute.


```{r import_soGGi, eval = FALSE}

data(sample_table_galonska)
data(TSS_galonska)

bam_dir <- file.path(system.file("extdata", package="DChIPRep"))
wce_bam <- "subsampled_0001_pc_SRR2144628_WCE_bowtie2_mapped-only_XS-filt_no-dups.bam"
mat_wce <- importData_soGGi(bam_paths = file.path(bam_dir, wce_bam),
                         TSS = TSS_galonska,
                         fragment_lengths = sample_table_galonska$input_fragment_length[1],
                         sample_ids =  sample_table_galonska$input[1],
                      paired = FALSE,
                        removeDup=FALSE
)

head(mat_wce)

```


# Perform the tests

After the data import, we are ready to perform the tests for differential enrichment. 
The tests are  performed position--wise and wrap   `r Biocpkg("DESeq2") ` 
and `r CRANpkg("fdrtool") `. Briefly the DChIPRep 
testing workflow is as follows:


1. The function **estimateDispersions** from `r Biocpkg("DESeq2") ` is 
called and the dispersions are estimated.

2. Then the position--wise tests to compare the experimental conditions are performed. 
A minimum fold change is used for the null hypothesis, the default value used is 0.05 
on a log2 scale.

  A possible strategy to infer this threshold from the data is 
  to look a  the average fold--change between technical
  replicates. 

3. The p--values are then passed to `r CRANpkg("fdrtool")` and the local FDR values
are computed.


```{r perform_Tests, dependson="inspect_example_data"}
imDataFromMatrices  <- runTesting(imDataFromMatrices, plotFDR = TRUE)
```

The results can now be accessed via the function **resultsDChIPRep**. 

```{r accessResults, dependson="perform_Tests"}

res <- resultsDChIPRep(imDataFromMatrices)
head(res)

table( res$lfdr < 0.2)
```


At an lfdr of 0.2 we identify `r table( res$lfdr < 0.2)[2] ` significant positions.

# Plots implemented in the package

## Plot the significant positions

We can first of all plot the TSS profiles by coloring the individual points by
significance.

Points corresponding to significant positions are colored black in both of the conditions. 
The replicate--samples are sumarized by using a positionwise robust Huber estimator 
for the mean [@Hampel_2011]. 

The function returns the plot as a ggplot2 object that can be modified afterwards.

```{r plot_Sig, dependson="accessResults"}

sigPlot <- plotSignificance(imDataFromMatrices)
sigPlot
```

This plot is similar to Figure S17B of [@Chabbert_2015]. We see an 
enrichment for significant position near the end of the downstream window considered. 

## Produce TSS plots 

We can produce the typical, smoothed plots of the TSS profiles as well. Here we use again
the smoothed Huber estimator for the mean to compute a summary per experimental 
group.

```{r plot_TSS, dependson="accessResults"}
profilePlot <- plotProfiles(imDataFromMatrices)
profilePlot
```

This plot is similar to Figure 5B of [@Chabbert_2015].

# Session information

```{r, cache=FALSE}
sessionInfo()
```


# References