Bioconductor Newsletter
posted by Valerie Obenchain, January 2016
The Bioconductor newsletter is a quarterly review of core infrastructure developments, community projects and future directions. We aim for topics of general interest as well as those with the greatest impact on the software.
In this issue we follow development of the InteractionSet class and review
tips for managing package repositories.  Mike Love talks about constructing
design matrices for gene expression experiments and Jim MacDonald takes us on
tour of Bioconductor annotation packages.
Contents
- Design matrices for differential gene expression
- Getting started with Bioconductor annotation packages
- Reproducible Research
- Infrastructure
- Project Activities
- Acknowledgements
Design matrices for differential gene expression
Mike Love is a postdoc in Rafael Irizarry’s lab in the Department of Biostatistics and Computational Biology at the Dana Farber Cancer Institute and Harvard T.H. Chan School of Public Health. He develops quantitative methods for genomics and epigenetics, teaches edX courses and occasionally tweets about biostatistics and R. Many know him as the author and primary supporter of the very popular DESeq2 package for differential gene expression of RNASeq data.
A visit to the support site shows the number of questions he answers on a daily basis related not only to the DESeq2 package but about gene expression analysis in general.
Of the many DESeq2-related posts on the support site, creating an appropriate design matrix is a regular one and appears to cause a fair bit of confusion. Below Mike shares some of his observations and thoughts about what key concepts cause the most problems.
A little background about ‘experimental design’
Experimental design refers to the inter-relationships between samples, including the biological and experiment information (clone 1, treatment B, etc.) and technical information (the batch in which the sample was prepared and processed). Getting the experimental design correct – this happens before the experiment takes place – is very important as we will see below, because the wrong experimental design can lead to uninterpretable data.
It is best practice to keep track of the experimental design (including preparation batches) in a table, either a CSV or TSV file, or an Excel spreadsheet, which can be exported to CSV when it comes time to do bioinformatic analysis. This is the easiest way to explain the experimental design to someone for either a planned experiment or an experiment that has already taken place.
Design matrix or model matrix is a matrix, typically represented in statistics by an X, that will be used in statistical modeling. Every row in the design matrix describes a sample, and every column provides pieces of information about that sample, such as, whether the sample was treated, whether the sample was in batch 1, 2, or 3, etc. For every column of the design matrix, the model has a matching coefficient, usually denoted by β’s, to describe differences in expression across samples. These coefficients are additive differences on the log scale, so multiplicative differences (fold changes) in RNA-seq counts or microarray expression values, hence they are log fold changes. These coefficients are then estimated using the experimental data.
A single experimental design does not imply a single design matrix, but there are often many choices involved. For example, one could include a coefficient for technical batches (typically a good idea) or not, which would give a design matrix with an addition column. The design matrix encodes assumptions the investigator wants to make regarding the samples, and is formed with respect to the biological question of interest. During significance testing, the biological question is phrased as a null hypothesis, that one or more of the coefficients are equal to zero, resulting in a p value. The p value is a meaningful estimate of a probability only if the assumptions are reasonable and the model is well specified for the data.
I should say, I learned about these topics both from textbooks (John Rice’s Mathematical Statistics and Data Analysis, Sanford Weisberg’s Applied Linear Regression, and Bioconductor books), as well as from reading lots of posts on the Bioconductor support forum from people like Wolfgang Huber, Gordon Smyth, Simon Anders, James MacDonald, Aaron Lun and others.
Case vs control
Simple designs don’t seem to pose much issue. For example, control and treated
samples, or control, treatment 1 and treatment 2. These are easily modeled
using R’s built-in formula and model.matrix functions, and then input to
limma,
edgeR,
or other Bioconductor packages.
DESeq2
directly takes formula expressions and converts to design matrices internally.
Confounding and batch effects
Sometimes, quantitative/computational problems arise in the form of error messages which indicate inherent problems in the experimental design. One of these is when comparisons of interest are confounded with technical factors, such as the sample preparation batch. There is the canonical case of confounding when control and treatment samples are prepared in their own batches, but also common are cases of bad experimental design such as:
| condition | batch | 
|---|---|
| control | 1 | 
| control | 1 | 
| treat. A | 1 | 
| treat. A | 1 | 
| treat. B | 2 | 
| treat. B | 2 | 
| treat. C | 2 | 
| treat. C | 2 | 
While treatment A can be compared against control, and treatment C can be compared against treatment B, no comparisons can be made across the batches. The reason some comparisons cannot be made is that the difference in gene expression due to, for example the effect of treatment B compared to control, cannot separated from the differences which could arise between different sample preparation batches. The most effective solution here is to use a block design, where the batches each include all of the possible conditions. At the least, control samples should be included in each batch, so that the batch effect can be estimated using these samples.
These two links explain why batch effects pose a big problem for high-throughput experiments (or any experiments):
- http://simplystatistics.org/2015/05/20/is-it-species-or-is-it-batch-they-are-confounded-so-we-cant-know/
- http://www.nature.com/nrg/journal/v11/n10/abs/nrg2825.html
Blocking, interactions and nested designs
Block experimental designs, and others, such as those where the significance of interactions between conditions is tested, or nested interactions, can be read about in the excellent limma User’s Guide, in the section on Single-Channel Experimental Designs
The Guide describes in detail how the design matrix can be formulated in different ways to answer the same question and explains how the different parameterizations affect interpretation of the results. The approaches recommended by the limma authors can be applied to other Bioconductor packages as well.
Advanced designs
Then there are some very complicated designs with many technical and biological factors, where the investigator has many comparisons to make and not a solid sense how to make them. In these cases I highly recommend, for people who find themselves not knowing what design to use or how to interpret the coefficients, that they consider partnering with a local statistician or someone with a background in linear modeling or quantitative analysis.
Interpreting quantitative analyses is hard stuff, and while Bioconductor simplifies the analysis of high-throughput assays to a large degree, it’s not necessarily reasonable to expect that complicated results can be compiled or interpreted by someone without a quantitative background. I think it’s safer and more reasonable to find a collaborator who can assist, and such collaborators can help identify issues with experimental design if they are included on projects from the outset.
Getting started with Bioconductor annotation packages
Jim MacDonald is a biostatistician at the University of Washington Department of Environmental and Occupational Health Sciences. He has analyzed the gamut of HTS data from expression (microarray, RNA-Seq), to genomics (SNP arrays, DNA-Seq, ChIP-Seq, methylation arrays, BS-Seq) and other ‘omics’ data. He has been heavily involved in the direction of the Bioconductor project since inception and contributes and maintains a large number of annotation packages.
During the October 2015 release we were short-handed after losing staff to the Buffalo move. Jim stepped in and took responsibility for building all internal Bioconductor annotation packages. Jim’s comprehensive understanding of the annotation world is evident in his numerous posts on the support site. In this section we’ve teamed up (90% Jim, 10% Val) to give an overview of key packages and how they can be used to answer some common analysis questions.
The primary packages
This section highlights the most heavily used Bioconductor annotation packages.
- 
    OrgDb:The OrgDb packages encapsulate all the information we know about a given organism’s genes as of a given date, except for location information. This includes GO terms and ontology, Entrez IDs, RefSeq ID, Ensembl IDs and many others. Most of these data are updated by the annotation services on a regular basis; our semi-annual releases ‘freeze’ the data as of the release date. This is a tradeoff we make between being completely updated at all times and being able to generate reproducible research, based on a static set of annotations. 
- 
    ChipDb:The ChipDbpackages are lightweight packages that contain a single mapping: probe ID to Entrez gene ID. These packages work in concert with theOrgDbpackage for the same species to provide mappings from the probe ID to all other annotation data, using the internal probe ID to Entrez Gene mapping as a starting point.
- 
    TxDb:TxDbpackages contain location information of transcripts, genes, exons and other gene-related features for a specific organism based on a given build of the genome.
- 
    BSgenome:These packages contain full genome sequences for a specific organism based on a given build of the genome. 
- 
    SNPlocs:SNP locations and alleles for a specific organism extracted from a particular dbSNP build which is based on genome build. 
- 
    AnnotationHub:This package provides an interface to browse and download a wide collection of annotation packages and individual resources. Much of the data are pre-parsed into R / Bioconductor objects. 
- 
    OrganismDb:The OrganismDbpackages encapsulate multiple annotation packages in a single wrapper to enable inter-package queries. The encapsulated packages are theGO.dbpackage, which provides mappings to Gene Ontology data, as well as anOrgDbandTxDbpackage for a particular species. Examples include theHomo.sapiensandMus.musculuspackages.
It’s worth noting that some annotation packages are tied to specific genome
builds and others are not. The TxDb family contain the location of
genes/transcripts/exons/etc. based on a given build. BSgenome and SNPlocs
are other examples of build-specific packages. Because genome assembly requires
piecing together the structure of the whole genome it follows that new builds
are only released every few years.  Data in the build-specific annotation
packages can be quite stable and stay current for years.
Other packages have nothing to do with where a gene is found and are therefore
not related to a genome build, e.g., the OrgDb family. These packages can be
thought of as encapsulating all information we have about the genes of a given
organism on a given date, knowing that it can become obsolete, at least
in part, the very next week. They contain such information as RefSeq, GenBank,
or UniGene IDs which represent provisional transcripts. These are a work in
progress and are constantly being updated and modified based on public submissions.
Bioconductor updates these packages every 6 months at release time. The
AnnotationForge package offers functions to build your own OrgDb
(or other package) if you want something more current.
The OrganismDb packages contain a combination of build-specific and
non-build-specific packages, which may not be correct for your use
case. However, it is simple to switch the TxDb package for a more
appropriate version, using the TxDb<- function.
Common tasks
Before considering particular tasks, we should first cover the
question of how to determine what input (keys) and output
(columns) are available for a particular annotation package. The
keytypes function will return all types of annotation that can be
used as input. As an example, let’s use the org.Hs.eg.db package.
> library(org.Hs.eg.db)
> keytypes(org.Hs.eg.db)
[1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
[6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"
[11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"
[16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"
[21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"
[26] "UNIPROT"
We can list all the available keys for a given keytype using the
keys function.
> head(keys(org.Hs.eg.db))
[1] "1"  "2"  "3"  "9"  "10" "11"
> head(keys(org.Hs.eg.db, "ENSEMBLPROT"))
[1] "ENSP00000263100" "ENSP00000470909" "ENSP00000323929" "ENSP00000438599"
[5] "ENSP00000445717" "ENSP00000385710"
And we can get all the available columns, or annotation data that we
can map our keys to.
> columns(org.Hs.eg.db)
[1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
[6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"
[11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"
[16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"
[21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"
[26] "UNIPROT"
Map manufacturer IDs to gene symbol
One common task is to annotate a microarray experiment by mapping the manufacturer’s IDs to something more general, such as a HUGO gene symbol, or an NCBI (Gene, GenBank, RefSeq, UniGene) or Ensembl (Ensembl gene, Ensembl transcript) ID. As an example, we can map an Affymetrix ID from the Human Gene 1.0 ST array to the corresponding HUGO symbol.
> library(hugene10sttranscriptcluster.db)
> hugene <- hugene10sttranscriptcluster.db ## minimize typing
> select(hugene, "8012257", "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
 PROBEID SYMBOL
1 8012257   TP53
This is a very simple example, and probably not that useful, except
as a quick interactive query. Note that we did not specify the
keytype argument. The default keytype argument for select is the
central ID for the package being used. In this case, that is the
PROBEID, and since we are using probeset IDs as keys, it is not
necessary to specify the keytype.
A more common use case is to annotate a vector of keys, returning
one or more output IDs (or columns). As an example, we will use just
five keys from the hugene10sttranscriptcluster.db package, and query
for the HUGO symbol and Entrez Gene IDs.
> ids <- keys(hugene)[15000:15005]
> ids
[1] "8005171" "8005191" "8005200" "8005202" "8005204"
> annot <- c("SYMBOL","ENTREZID")
> select(hugene, ids, annot)
'select()' returned 1:many mapping between keys and columns
   PROBEID       SYMBOL  ENTREZID
1  8005171        TRPV2     51393
2  8005191  LRRC75A-AS1    125144
3  8005191     SNORD49A     26800
4  8005191     SNORD49B    692087
5  8005191      SNORD65    692106
6  8005200  LRRC75A-AS1    125144
7  8005200     SNORD49A     26800
8  8005200     SNORD49B    692087
9  8005200      SNORD65    692106
10 8005202     SNORD49A     26800
11 8005202  LRRC75A-AS1    125144
12 8005202     SNORD49B    692087
13 8005202      SNORD65    692106
14 8005204     CCDC144A      9720
15 8005204    CCDC144CP    348254
16 8005204     CCDC144B    284047
17 8005204    CCDC144NL    339184
18 8005204 LOC101929141 101929141
19 8005221         <NA>      <NA>
Please note three things about the above results. First, the PROBEID
column in the returned data.frame has the same order as the input
ids. Second, some of the Affymetrix IDs map to more than one
gene. All of the mappings are returned, with a message that there
was a 1:many mapping for some of the keys. Because of the 1:many
mappings, the dimensions of the returned data.frame do not match the
dimensions of the data we would like to annotate (e.g., we wanted
information for five IDs, and got 19 rows of data returned). Third, if
one of the keys has no annotation (the last one), an NA value is
returned.
If we want to guarantee that the returned data are in the same order
and are the same length as the input keys vector, we can use
mapIds instead. However, mapIds can only do one keytype at a
time, and returns a vector or list rather than a data.frame. Unlike
select, which has a default value for the keytype, mapIds requires
a fourth argument, specifying the keytype of the keys we are
using.
> mapIds(hugene, ids, "SYMBOL", "PROBEID")
  8005171       8005191       8005200       8005202       8005204
  "TRPV2" "LRRC75A-AS1" "LRRC75A-AS1"    "SNORD49A"    "CCDC144A"
  8005221
  NA
We can easily wrap this in a small script to return a data.frame
with just one row per key.
> d.f <- as.data.frame(lapply(annot, function(x) mapIds(hugene, ids, x, "PROBEID")))
> names(d.f) <- annot
> d.f
	         SYMBOL ENTREZID
8005171       TRPV2    51393
8005191 LRRC75A-AS1   125144
8005200 LRRC75A-AS1   125144
8005202    SNORD49A    26800
8005204    CCDC144A     9720
8005221        <NA>     <NA>
The default for mapIds is to take the first instance for any 1:many
mappings. This is fine for some use cases (e.g., a RefSeq ID), but is
less useful in other situations (e.g., GO IDs), where we want all
values returned. We can use the multiVals argument to control what
is returned. Please note that this argument comes after an ellipsis
(...) argument, so you cannot use positional arguments, and must
instead specify the multiVals argument directly.
> mapIds(hugene, ids, "SYMBOL", "PROBEID", multiVals = "list")
$`8005171`
[1] "TRPV2"
$`8005191`
[1] "LRRC75A-AS1" "SNORD49A"    "SNORD49B"    "SNORD65"
$`8005200`
[1] "LRRC75A-AS1" "SNORD49A"    "SNORD49B"    "SNORD65"
$`8005202`
[1] "SNORD49A"    "LRRC75A-AS1" "SNORD49B"    "SNORD65"
$`8005204`
[1] "CCDC144A"     "CCDC144CP"    "CCDC144B"     "CCDC144NL"    "LOC101929141"
$`8005221`
[1] NA
If we want to have a rectangular format for our annotation, where we
keep all the 1:many mappings while ensuring that each row maps
directly to our array of expression values, we can use a DataFrame
instead, telling mapIds to return a CharacterList.
> lst <- lapply(annot, function(x)
mapIds(hugene, ids, x, "PROBEID", multiVals = "CharacterList")
> d.f <- as(lst, "DataFrame")
> names(d.f) <- annot
> d.f
DataFrame with 6 rows and 2 columns
                               SYMBOL                ENTREZID
                      <CharacterList>         <CharacterList>
8005171                             TRPV2                   51393
8005191 LRRC75A-AS1,SNORD49A,SNORD49B,... 125144,26800,692087,...
8005200 LRRC75A-AS1,SNORD49A,SNORD49B,... 125144,26800,692087,...
8005202 SNORD49A,LRRC75A-AS1,SNORD49B,... 26800,125144,692087,...
8005204   CCDC144A,CCDC144CP,CCDC144B,...  9720,348254,284047,...
8005221                                NA                      NA
Map Entrez gene ID to TRPV2 chromosomal location
Given the above data, perhaps we are interested in TRPV2, and want to
know its chromosomal location. We can use the Homo.sapiens package
to get that information. While it is possible to use the HUGO symbol
for this gene to get the location, it is a better idea to use the
Entrez Gene ID, which is more likely to be unique.
> select(Homo.sapiens, "51393", c("TXCHROM","TXSTART","TXEND"), "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
   ENTREZID TXCHROM  TXSTART    TXEND
1  51393     chr17 16318856 16340317
This just tells us the start and stop positions for the transcript. If we want exonic locations, we can get those as well.
> select(Homo.sapiens, "TRPV2", c("EXONCHROM","EXONSTART","EXONEND"), "SYMBOL")
'select()' returned 1:many mapping between keys and columns
    SYMBOL EXONCHROM EXONSTART  EXONEND
1   TRPV2     chr17  16318856 16319147
2   TRPV2     chr17  16320876 16321182
3   TRPV2     chr17  16323429 16323562
4   TRPV2     chr17  16325913 16326203
5   TRPV2     chr17  16326783 16327081
6   TRPV2     chr17  16329413 16329583
7   TRPV2     chr17  16330036 16330191
8   TRPV2     chr17  16330763 16330861
9   TRPV2     chr17  16331631 16331701
10  TRPV2     chr17  16332131 16332296
11  TRPV2     chr17  16335098 16335164
12  TRPV2     chr17  16335280 16335614
13  TRPV2     chr17  16336888 16337012
14  TRPV2     chr17  16338204 16338283
15  TRPV2     chr17  16340103 16340317
16  TRPV2     chr17  16325913 16326207
17  TRPV2     chr17  16330177 16330191
18  TRPV2     chr17  16330766 16330861
While this is useful for a single gene, it can get unwieldy for large
numbers of genes. We can instead use the transcriptsBy or exonsBy
functions with the TxDb.Hsapiens.UCSC.hg19.knownGene package, to get
information about all genes at once, and subset to those we care
about.
> trscpts <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
> trscpts[["51393"]]
GRanges object with 2 ranges and 2 metadata columns:
    seqnames               ranges strand |     tx_id     tx_name
       <Rle>            <IRanges>  <Rle> | <integer> <character>
[1]    chr17 [16318856, 16340317]      + |     60527  uc002gpy.3
[2]    chr17 [16318856, 16340317]      + |     60528  uc002gpz.4
> exns <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
> exns[["51393"]]
GRanges object with 18 ranges and 2 metadata columns:
     seqnames               ranges strand   |   exon_id   exon_name
        <Rle>            <IRanges>  <Rle>   | <integer> <character>
 [1]    chr17 [16318856, 16319147]      +   |    217024        <NA>
 [2]    chr17 [16320876, 16321182]      +   |    217025        <NA>
 [3]    chr17 [16323429, 16323562]      +   |    217026        <NA>
 [4]    chr17 [16325913, 16326203]      +   |    217027        <NA>
 [5]    chr17 [16325913, 16326207]      +   |    217028        <NA>
...      ...                  ...    ... ...       ...         ...
[14]    chr17 [16335098, 16335164]      +   |    217037        <NA>
[15]    chr17 [16335280, 16335614]      +   |    217038        <NA>
[16]    chr17 [16336888, 16337012]      +   |    217039        <NA>
[17]    chr17 [16338204, 16338283]      +   |    217040        <NA>
[18]    chr17 [16340103, 16340317]      +   |    217041        <NA>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
Using *Ranges objects is beyond the scope of this newsletter, so we
won’t explore them further. For more information, please see the
IRanges
vignette, as well as the
GRanges
vignettes.
Reproducible Research
Managing package versions with biocLite()
Bioconductor follows a biannual schedule with one release in Spring and one in Fall. R has a single major release per year, usually in the Spring. Because each Bioconductor release is tied to a version of R this asymmetrical schedule creates some confusion.
When releases coincide in the Spring, the development branches of both R and Bioconductor become release branches. For the next 6 months, packages in both the Bioconductor ‘devel’ and release branch are built against the ‘release’ version of R.
In Fall, Bioconductor has a release but R does not. The Bioconductor ‘devel’ branch becomes the current ‘release’ and uses the release version of R. The new Bioconductor ‘devel’ branch uses the ‘devel’ version of R. The purpose of building Bioconductor ‘devel’ against R devel is to allow for a smooth transition in Fall, specifically, it allows the Bioconductor ‘release’ branch to always be in sync with the R ‘release’ branch.
The BiocInstaller package has several functions to help manage clean ‘release’ and ‘devel’ package repositories. Below are a few troubleshooting tips for common install and version mis-match problems.
- 
    Confirm a single writeable installation path. Make sure only a single installation directory reported by .libPaths()can be ‘written to’ by an ordinary (i.e., not administrator) user. If multiple paths are reported, remove one.
- 
    Check the version of BiocInstaller:packageVersion("BiocInstaller")reports the version of the BiocInstaller package in use. The ‘correct’ version will depend on whether you are using the ‘devel’ or ‘release’ branch of Bioconductor. You can check the current version ofBiocInstalleron the release and devel landing pages.If you have the wrong package version (or multiple versions) installed, remove them with repeated calls to remove.packages("BiocInstaller")untilRsays there is no package to remove. RestartR, verify there is no BiocInstaller package and install the correct version withsource("https://bioconductor.org/biocLite.R")Invoking biocLite()with no arguments will update all packages. When asked whether to update old packages, choose ‘a’ for ‘all’.biocLite()
- 
    Identify mis-matched package versions with biocValid():Use biocValid()to identify version mis-matches between packages:BiocInstaller::biocValid()Resolve by calling remove.packages()on the offending package, confirm the correct version of BiocInstaller and reinstall withbiocLite().
- 
    Upgrade to the most recent Bioconductor for a version of R: When release and devel versions of Bioconductor are both built against the release version of R (i.e., after the Spring Bioconductor release). You can upgrade to the most current Bioconductor (devel) with BiocInstaller::biocLite("BiocUpgrade")This installs the most recent Bioconductor packages without having to reinstall R. 
More information on keeping your versions in sync can be found at the Why use biocLite()? section of the web site.
Infrastructure
InteractionSet package
Aaron Lun, Liz Ing-Simmons and Malcolm Perry have been working on an InteractionSet package to store and manipulate data from ChIA-PET and Hi-C experiments.
ChIA-PET stands for Chromatin Interaction Analysis with Paired-End Tags. These experiments probe for genome-wide interactions brought about or associated with some protein of interest. An essential step in this technology that differentiates it from Hi-C is the antibody-driven immunoprecipitation step to enrich for chromatin bound by a specific protein. Chromatin interactions can only be determined for parts of the genome that have a binding site for the protein of interest. Interaction networks can be elucidated for transcription factors, insulator proteins or transcription machinery. A ChIA-PET experiment gives information about the potential role of proteins in structuring 3D genome organization.
The Hi-C method provides information about 3D genome structure by identifying long range chromatin interactions on a genome-wide scale. These data are often used to study aspects of genome architecture such as chromosome territories, topological domains, open/closed compartments and chromatin structure.
Data from both technologies enable the study of physical interactions between pairs of genomic regions. The InteractionSet package provides classes to represent these interactions and store associated experimental data. The aim is to provide package developers with stable class definitions that can be manipulated through a large set of methods.
The package defines the following classes:
- GInteractions: represents pairwise interactions between genomic regions
- InteractionSet: contains experimental data relevant to each interaction
- ContactMatrix: stores a data matrix where each row and column represent a genomic region.
The classes have methods for sorting and duplicate detection; for performing
one- or two-dimensional overlaps; for calculating distances between interacting
loci; and for calculating the minimum bounding box for groups of interactions.
Methods are also available to convert between classes, or to standard
Bioconductor objects like a RangedSummarizedExperiment or GRangesList.
New functions in R / Bioconductor
New functions added to R (3.3) and Bioconductor (3.3) this quarter:
- 
    SummarizedExperiment::readKallisto() Reads kallisto data into a SummarizedExperiment. (contributed by Martin Morgan) 
- 
    GenomicFeatures::mapToIds() and GenomicFeatures::mapToRanges() Maps between genomic identifiers (gene names, symbols etc.) and the genomic ranges they represent. (contributed by Jim Hester) 
Project Activities
Recognizing community contributions
At the European Bioconductor Developers meeting last December, a prize was awarded to recognize individual(s) contributing to the Bioconductor support site forum.
The prize was sponsored by F1000 research which recently launched a dedicated Bioconductor channel. The terms of the award were ‘greatest contribution to the support site’ and ‘those attending the European developer conference’.
Congratulations to winners Aaron Lun and Michael Love! Each were awarded the prize of waived publication costs for an article appearing in the F1000 Bioconductor channel. Other contributors with substantial posts to their credit are Jim MacDonald, Gordon Smyth, Ryan Thompson and Steve Lianoglou. Thanks to everyone who takes the time to answer questions and share their experience on the support site.
Thanks to Mark Dunning and Laurent Gatto for suggesting the prize (and organizing the conference!) and to Thomas Ingraham and F1000 Research for sponsoring it.
October release
Bioconductor 3.2 was released on October 14, consisting of 1104 software packages, 257 experiment data packages, and 917 annotation packages. There are 80 new software packages.
NOTE This is the last version of Bioconductor to be supported on Snow Leopard. Snow Leopard users should plan to migrate to Mavericks or newer before the next release in Spring 2016.
There are 80 new software packages included in this release. Package summaries and the official release schedule can be found on the web site.
Website traffic
The following compares the number of sessions and new users from the fourth quarter of 2015 (November 1 - December 28) with the fourth quarter of 2014.
The fourth quarter of 2015 saw an increase over 2014 in the number of Sessions (18.05%), Users (11.27%) and Pageviews (12.11%). Decreases were seen in the Average number of pages viewed (-5.39%), Average session duration (-0.54%), Bounce rate (aka single page visits; -3.30%) and Percent of new sessions (-8.86%).
The majority of users are still on desktops (and laptops) but the number of mobile users is steadily increasing. In the fourth quarter 2015, the number of new users increased by 39.8% for mobile devices vs 6.7% for desktops.
| Desktop (includes laptops) | |
| Nov 1, 2015 - Dec 27, 2015 | 71,111 (92.86%) | 
| Nov 1, 2014 - Dec 27, 2014 | 66,622 (93.96%) | 
| % change | 6.74% | 
| Mobile | |
| Nov 1, 2015 - Dec 27, 2015 | 4,230 (5.52%) | 
| Nov 1, 2014 - Dec 27, 2014 | 3,026 (4.27%) | 
| % change | 39.79% | 
| Tablet | |
| Nov 1, 2015 - Dec 27, 2015 | 1,241 (1.62%) | 
| Nov 1, 2014 - Dec 27, 2014 | 1,260 (1.78%) | 
| % change | -1.51% | 
Statistics were generated with Google Analytics.
Package downloads and new submissions
The number of unique IP downloads of software packages for October, November and December of 2015 were 40085, 41499 and 34216, respectively. For the same time period in 2014, numbers were 44593, 32728 and 30622. Numbers must be compared by month (vs quarterly sum) because some IPs are the same between months. See the web site for a full summary of download stats.
A total of 17 software, 1 annotation and 5 experimental data packages were added in the fourth quarter of 2015.
| Bioconductor Version | Software | Annotation | Experimental Data | |||
|---|---|---|---|---|---|---|
| 3.2 | 1104 | 895 | 257 | |||
| 3.3 | 1121 | 896 | 262 | 
Upcoming Events
See the events page for a listing of all courses and conferences.
- BioC 2016 24-26 June, Stanford, USA.
- CSAMA 2016 (14th edition) - Statistics and Computing in Genome Data Science: 10-15 of July in Bressanone-Brixen, Italy.
Acknowledgements
Thanks to Jim MacDonald and Mike Love for contributing sections, Aaron Lun for
proofing the InteractionSet section and the Bioconductor core team for
editorial review.
Send comments or questions to Valerie at valerie.obenchain@roswellpark.org.
