This package represents a manually curated data collection for gene expression meta-analysis of patients with colorectal cancer. This resource provides uniformly prepared microarray data with curated and documented clinical metadata. It allows a computational user to efficiently identify studies and patient subgroups of interest for analysis and to run such analyses immediately without the challenges posed by harmonizing heterogeneous microarray technologies, study designs, expression data processing methods, and clinical data formats.
In this vignette, we give a short tour of the package and will show how to use it efficiently.
Loading a single dataset is very easy. First we load the package:
library(curatedCRCData)
To get a listing of all the datasets, use the data function:
data(package="curatedCRCData")
Now to load a single dataset, we use the data function again:
data(TCGA.COAD_eset)
TCGA.COAD_eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 17814 features, 130 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: TCGA.AA.3520 TCGA.AA.3532 ... TCGA.A6.2685 (130 total)
## varLabels: unique_patient_ID alt_sample_name ...
## uncurated_author_metadata (59 total)
## varMetadata: labelDescription
## featureData
## featureNames: 15E1.2 2'-PDE ... ZZZ3 (17814 total)
## fvarLabels: probeset gene
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 22810696
## Annotation: agilent-014850 whole human genome microarray 4x44k g4112f
The datasets are provided as Bioconductor ExpressionSet objects and
we refer to the Bioconductor documentation for users unfamiliar with this data
structure.
For a meta-analysis, we typically want to filter datasets and patients to
get a population of patients we are interested in. We provide a short but
powerful R script that does the filtering and provides the data as a list of
ExpressionSet objects. One can use this script within R by first sourcing a config
file which specifies the filters, like the minimum numbers of patients in each
dataset. It is also possible to filter samples by annotation, for example to remove
early stage and normal samples.
source(system.file("extdata",
"patientselection_all.config",package="curatedCRCData"))
ls()
## [1] "TCGA.COAD_eset" "keep.common.only" "meta.required"
## [4] "min.number.of.events" "min.sample.size" "quantile.cutoff"
## [7] "rescale" "strict.checking"
See what the values of these variables we have loaded are. The
variable names are fairly descriptive, but note that rule.1 is a
character vector of length 2, where the first entry is the name of a
clinical data variable, and the second entry is a Regular Expression
providing a requirement for that variable. Any number of rules can be
added, with increasing identifiers, e.g. rule.2, rule.3, etc.
Here strict.checking is FALSE, meaning that samples not annotated for the variables in these rules are allowed to pass the filter. If strict.checking == TRUE, samples missing this annotation will be removed.
sapply(ls(), get)
## $TCGA.COAD_eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 17814 features, 130 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: TCGA.AA.3520 TCGA.AA.3532 ... TCGA.A6.2685 (130 total)
## varLabels: unique_patient_ID alt_sample_name ...
## uncurated_author_metadata (59 total)
## varMetadata: labelDescription
## featureData
## featureNames: 15E1.2 2'-PDE ... ZZZ3 (17814 total)
## fvarLabels: probeset gene
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 22810696
## Annotation: agilent-014850 whole human genome microarray 4x44k g4112f
##
## $keep.common.only
## [1] FALSE
##
## $meta.required
## NULL
##
## $min.number.of.events
## [1] 0
##
## $min.sample.size
## [1] 1
##
## $quantile.cutoff
## [1] 0
##
## $rescale
## [1] FALSE
##
## $strict.checking
## [1] FALSE
Now that we have defined the sample filter, we create a list of
ExpressionSet objects by sourcing the createEsetList.R file:
source(system.file("extdata", "createEsetList.R", package = "curatedCRCData"))
## 2026-04-21 11:23:03.071297 INFO::Inside script createEsetList.R - inputArgs =
## 2026-04-21 11:23:03.081659 INFO::None provided
## 2026-04-21 11:23:03.120143 INFO::Loading curatedCRCData 2.43.1
## 2026-04-21 11:23:49.381831 INFO::Clean up the esets.
## 2026-04-21 11:23:49.479896 INFO::including GSE11237_eset
## 2026-04-21 11:23:49.511652 INFO::including GSE12225.GPL3676_eset
## 2026-04-21 11:23:49.555642 INFO::including GSE12945_eset
## 2026-04-21 11:23:49.603887 INFO::including GSE13067_eset
## 2026-04-21 11:23:49.705368 INFO::including GSE13294_eset
## 2026-04-21 11:23:50.524375 INFO::including GSE14095_eset
## 2026-04-21 11:23:50.633713 INFO::including GSE14333_eset
## 2026-04-21 11:23:50.702488 INFO::including GSE16125.GPL5175_eset
## 2026-04-21 11:23:50.771919 INFO::including GSE17536_eset
## 2026-04-21 11:23:50.828039 INFO::including GSE17537_eset
## 2026-04-21 11:23:50.906623 INFO::including GSE17538.GPL570_eset
## 2026-04-21 11:23:51.50024 INFO::including GSE18105_eset
## 2026-04-21 11:23:51.600216 INFO::including GSE2109_eset
## 2026-04-21 11:23:51.702346 INFO::including GSE21510_eset
## 2026-04-21 11:23:51.775983 INFO::including GSE21815_eset
## 2026-04-21 11:23:51.844356 INFO::including GSE24549.GPL5175_eset
## 2026-04-21 11:23:51.876413 INFO::including GSE24550.GPL5175_eset
## 2026-04-21 11:23:51.932876 INFO::including GSE2630_eset
## 2026-04-21 11:23:51.962722 INFO::including GSE26682.GPL570_eset
## 2026-04-21 11:23:52.720757 INFO::including GSE26682.GPL96_eset
## 2026-04-21 11:23:52.766761 INFO::including GSE26906_eset
## 2026-04-21 11:23:52.806504 INFO::including GSE27544_eset
## 2026-04-21 11:23:52.851373 INFO::including GSE28702_eset
## 2026-04-21 11:23:52.903906 INFO::including GSE3294_eset
## 2026-04-21 11:23:52.947043 INFO::including GSE33113_eset
## 2026-04-21 11:23:53.062207 INFO::including GSE39582_eset
## 2026-04-21 11:23:53.17594 INFO::including GSE3964_eset
## 2026-04-21 11:23:53.203278 INFO::including GSE4045_eset
## 2026-04-21 11:23:53.255177 INFO::including GSE4526_eset
## 2026-04-21 11:23:53.291909 INFO::including GSE45270_eset
## 2026-04-21 11:23:53.377233 INFO::including TCGA.COAD_eset
## 2026-04-21 11:23:53.428694 INFO::including TCGA.READ_eset
## 2026-04-21 11:23:53.465168 INFO::including TCGA.RNASeqV2.READ_eset
## 2026-04-21 11:23:53.549343 INFO::including TCGA.RNASeqV2_eset
## 2026-04-21 11:23:53.991302 INFO::Ids with missing data: GSE2630_eset, GSE3294_eset, TCGA.COAD_eset, TCGA.READ_eset
It is also possible to run the script from the command line and then load the R data file within R:
R --vanilla "--args patientselection.config crc.eset.rda tmp.log" < createEsetList.R
Now we have 34 datasets with samples that passed our filter in a list of
ExpressionSet objects called esets:
names(esets)
## [1] "GSE11237_eset" "GSE12225.GPL3676_eset"
## [3] "GSE12945_eset" "GSE13067_eset"
## [5] "GSE13294_eset" "GSE14095_eset"
## [7] "GSE14333_eset" "GSE16125.GPL5175_eset"
## [9] "GSE17536_eset" "GSE17537_eset"
## [11] "GSE17538.GPL570_eset" "GSE18105_eset"
## [13] "GSE2109_eset" "GSE21510_eset"
## [15] "GSE21815_eset" "GSE24549.GPL5175_eset"
## [17] "GSE24550.GPL5175_eset" "GSE2630_eset"
## [19] "GSE26682.GPL570_eset" "GSE26682.GPL96_eset"
## [21] "GSE26906_eset" "GSE27544_eset"
## [23] "GSE28702_eset" "GSE3294_eset"
## [25] "GSE33113_eset" "GSE39582_eset"
## [27] "GSE3964_eset" "GSE4045_eset"
## [29] "GSE4526_eset" "GSE45270_eset"
## [31] "TCGA.COAD_eset" "TCGA.READ_eset"
## [33] "TCGA.RNASeqV2.READ_eset" "TCGA.RNASeqV2_eset"
In the standard version of curatedCRCData (the version available on
Bioconductor), we collapse manufacturer probesets to official HGNC symbols
using the Biomart database. Some probesets are mapped to multiple HGNC symbols
in this database. For these probesets, we provide all the symbols. For example
220159_at maps to ABCA11P and ZNF721 and we
provide ABCA11P///ZNF721 as probeset name. If you have an array of
gene symbols for which you want to access the expression data, “ABCA11P” would
not be found in curatedCRCData in this example. The following function
will create a new ExpressionSet in which both ZNF721 and
ABCA11P are features with identical expression data:
expandProbesets <- function (eset, sep = "///")
{
x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
eset <- eset[order(sapply(x, length)), ]
x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
idx <- unlist(sapply(1:length(x), function(i) rep(i, length(x[[i]]))))
xx <- !duplicated(unlist(x))
idx <- idx[xx]
x <- unlist(x)[xx]
eset <- eset[idx, ]
featureNames(eset) <- x
eset
}
X <- TCGA.COAD_eset[head(grep("AA", featureNames(TCGA.COAD_eset))),]
exprs(X)[,1:3]
## TCGA.AA.3520 TCGA.AA.3532 TCGA.AA.3553
## AAAS -0.72125 -1.51150 -1.01250
## AACS 0.02225 0.82375 -0.08500
## AADAC 0.02775 -0.42900 1.12525
## AADACL1 1.06600 1.85550 0.92800
## AADACL2 0.08750 -0.61825 -0.47525
## AADACL3 0.38100 0.31100 0.33950
exprs(expandProbesets(X))[,1:3]
## TCGA.AA.3520 TCGA.AA.3532 TCGA.AA.3553
## AAAS -0.72125 -1.51150 -1.01250
## AACS 0.02225 0.82375 -0.08500
## AADAC 0.02775 -0.42900 1.12525
## AADACL1 1.06600 1.85550 0.92800
## AADACL2 0.08750 -0.61825 -0.47525
## AADACL3 0.38100 0.31100 0.33950
Figure 1: Available clinical annotation
This heatmap visualizes for each curated clinical characteristic (rows) the availability in each dataset (columns). Red indicates that the corresponding characteristic is available for at least one sample in the dataset.
This example provides a table summarizing the datasets being used, and is useful when publishing analyses based on curatedCRCData. First, define some useful functions for this purpose:
source(system.file("extdata", "summarizeEsets.R", package = "curatedCRCData"))
## Warning in min(which(km.fit$surv < 0.5)): no non-missing arguments to min;
## returning Inf
## Warning in min(which(km.fit$surv < 0.5)): no non-missing arguments to min;
## returning Inf
## Warning in min(which(reverse.km.fit$surv < 0.5)): no non-missing arguments to
## min; returning Inf
## Warning in min(which(reverse.km.fit$surv < 0.5)): no non-missing arguments to
## min; returning Inf
## Warning in min(which(reverse.km.fit$surv < 0.5)): no non-missing arguments to
## min; returning Inf
Optionally write this table to file, for example ( replace myfile <- tempfile() with something like myfile <- "nicetable.csv" )
(myfile <- tempfile())
## [1] "/tmp/RtmpBIjWr7/file77e5f31611a7c"
write.table(summary.table, file=myfile, row.names=FALSE, quote=TRUE, sep=",")
| Study | PMID | N samples | msi status | G | Platform | median.survival | median follow-up | percent censoring | binarized OS (long/short) | |
|---|---|---|---|---|---|---|---|---|---|---|
| GSE11237 | Auman, Mcleod 2008 | 18653328 | 23 | 0/0/23 | 3/16/4/0/0 | Affymetrix HG_U95Av2 | NA | NA | NA | NA |
| GSE12225.GPL3676 | Lips, Morreau 2008 | 18959792 | 42 | 0/0/42 | 0/0/0/0/42 | NA | NA | NA | NA | NA |
| GSE12945 | Staub, Rosenthal 2009 | 19399471 | 62 | 0/0/62 | 0/31/31/0/0 | Affymetrix HG-U133A | NA | 49 | 81 | NA |
| GSE13067 | Jorissen and Sieber 2008 | 19088021 | 74 | 0/0/74 | 0/0/0/0/74 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE13294 | Jorissen and Sieber 2008 | 19088021 | 155 | 0/0/155 | 0/0/0/0/155 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE14095 | Watanabe, Hashimoto 2008 | 21680303 | 189 | 0/0/189 | 0/0/0/0/189 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE14333 | Jorissen and Sieber 2008 | 19996206 | 290 | 0/0/290 | 0/0/0/0/290 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE16125.GPL5175 | Reid, Pierotti 2009 | 19672874 | 36 | 0/0/36 | 0/0/0/0/36 | Affymetrix HuEx-1_0-st | 74 | 22 | 69 | NA |
| GSE17536 | Smith JJ,??Beauchamp RD 2009 | 19914252 | 177 | 0/0/177 | 16/134/27/0/0 | Affymetrix HG-U133Plus2 | 133 | 60 | 59 | NA |
| GSE17537 | Smith JJ,??Beauchamp RD 2009 | 19914252 | 55 | 0/0/55 | 1/32/3/0/19 | Affymetrix HG-U133Plus2 | NA | 57 | 64 | NA |
| GSE17538.GPL570 | Smith JJ,??Beauchamp RD 2009 | 19914252 | 232 | 0/0/232 | 17/166/30/0/19 | Affymetrix HG-U133Plus2 | 133 | 59 | 60 | NA |
| GSE18105 | Matsuyama, Sugihara 2009 | 20162577 | 111 | 0/0/111 | 0/0/0/0/111 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE2109 | expO, IGC 2005 | PMID unknown | 427 | 0/0/427 | 10/260/71/4/82 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE21510 | Tsukamoto, Sugihara 2010 | 21270110 | 148 | 0/0/148 | 0/0/0/0/148 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE21815 | Mori M,??Mimori K,??Yokobori T 2010 | 21862635 | 141 | 0/0/141 | 32/33/1/0/75 | Agilent G4112F | NA | NA | NA | NA |
| GSE24549.GPL5175 | Sveen A,????esen TH,??Rognum TO,??Lothe RA,??Skotheim RI 2011 | 21619627 | 83 | 0/0/83 | 0/0/0/0/83 | Affymetrix HuEx-1_0-st | NA | NA | NA | NA |
| GSE24550.GPL5175 | Sveen A,????esen TH,??Rognum TO,??Lothe RA,??Skotheim RI 2011 | 21619627 | 90 | 0/0/90 | 0/0/0/0/90 | Affymetrix HuEx-1_0-st | NA | NA | NA | NA |
| GSE2630 | Bandr??E,??Malumbres R,??Cubedo E,??Sola J,??Garc??Foncillas J,??Labarga A 2005 | 17390049 | 16 | 0/0/16 | 0/0/0/0/16 | NA | NA | NA | NA | NA |
| GSE26682.GPL570 | Vilar E,??Morgan MA 2011 | 21300766 | 156 | 0/0/156 | 0/0/0/0/156 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE26682.GPL96 | Vilar E,??Morgan MA 2011 | 21300766 | 155 | 0/0/155 | 0/0/0/0/155 | Affymetrix HG-U133A | NA | NA | NA | NA |
| GSE26906 | Olschwang S 2011 | 22496922 | 90 | 0/0/90 | 0/0/0/0/90 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE27544 | Bernal M,??Garc??Alcalde F,??Concha ????Blanco A,??Garrido F,??Ruiz-Cabello F 2011 | PMID unknown | 22 | 0/0/22 | 0/0/0/0/22 | Affymetrix HT HG-U133+ PM | NA | NA | NA | NA |
| GSE28702 | Tsuji 2011 | 22095227 | 83 | 0/0/83 | 62/16/5/0/0 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE3294 | Bianchini 2005 | 16773188 | 24 | 0/0/24 | 10/11/3/0/0 | UHN SS-Human 19Kv7 | NA | NA | NA | NA |
| GSE33113 | Medema JP,??Tanis PJ 2011 | 22496204 | 96 | 0/0/96 | 0/0/0/0/96 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE39582 | Marisa, Boige 2012 | 23700391 | 566 | 0/0/566 | 0/0/0/0/566 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE3964 | Graudens, Imbeaud 2006 | 16542501 | 29 | 0/0/29 | 0/0/0/0/29 | NA | NA | NA | NA | NA |
| GSE4045 | Laiho, Aaltonen 2007 | 16819509 | 37 | 0/0/37 | 4/28/4/0/1 | Affymetrix HG-U133A | NA | NA | NA | NA |
| GSE4526 | Watanabe T,??Kobunai T,??Toda E,??Oka T 2006 | 19016304 | 36 | 0/0/36 | 0/0/0/0/36 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| GSE45270 | Medema JP,??de Sousa E Melo F,??Vermeulen L,??Jansen M,??Dekker E,??Van Noesel C,??Fessler E 2013 | PMID unknown | 13 | 0/0/13 | 0/0/0/0/13 | Affymetrix HG-U133Plus2 | NA | NA | NA | NA |
| TCGA.COAD | The Cancer Genome Atlas Network 2012 | 22810696 | 130 | 0/0/130 | 0/0/0/0/130 | Agilent G4502A-07-3 | 17 | 25 | 17 | NA |
| TCGA.READ | The Cancer Genome Atlas Network 2012 | 22810696 | 51 | 0/0/51 | 0/0/0/0/51 | Agilent G4502A-07-3 | 10 | NA | 0 | NA |
| TCGA.RNASeqV2.READ | The Cancer Genome Atlas Network 2012 | 22810696 | 6 | 0/0/6 | 0/0/0/0/6 | 41 | NA | 0 | NA | |
| TCGA.RNASeqV2 | The Cancer Genome Atlas Network 2012 | 22810696 | 195 | 0/0/195 | 0/0/0/0/195 | 15 | NA | 14 | NA |
If you are not doing your analysis in R, and just want to get some data you have identified from the curatedCRCData manual, here is a simple way to do it. For one dataset:
library(curatedCRCData)
library(Biobase)
data(TCGA.COAD_eset)
write.csv(exprs(TCGA.COAD_eset), file="TCGA.COAD_eset_exprs.csv")
write.csv(pData(TCGA.COAD_eset), file="TCGA.COAD_eset_clindata.csv")
Or for several datasets:
data.to.fetch <- c("TCGA.COAD_eset", "GSE37317_eset")
for (onedata in data.to.fetch){
print(paste("Fetching", onedata))
data(list=onedata)
write.csv(exprs(get(onedata)), file=paste(onedata, "_exprs.csv", sep=""))
write.csv(pData(get(onedata)), file=paste(onedata, "_clindata.csv", sep=""))
}
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.51 logging_0.10-108 survival_3.8-6
## [4] curatedCRCData_2.43.1 Biobase_2.71.0 BiocGenerics_0.57.1
## [7] generics_0.1.4 xtable_1.8-8 sva_3.59.0
## [10] BiocParallel_1.45.0 genefilter_1.93.0 mgcv_1.9-4
## [13] nlme_3.1-169 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 RSQLite_2.4.6 lattice_0.22-9
## [4] magrittr_2.0.5 digest_0.6.39 evaluate_1.0.5
## [7] grid_4.6.0 bookdown_0.46 fastmap_1.2.0
## [10] blob_1.3.0 jsonlite_2.0.0 Matrix_1.7-5
## [13] AnnotationDbi_1.73.1 limma_3.67.1 DBI_1.3.0
## [16] tinytex_0.59 BiocManager_1.30.27 httr_1.4.8
## [19] XML_3.99-0.23 Biostrings_2.79.5 codetools_0.2-20
## [22] jquerylib_0.1.4 cli_3.6.6 rlang_1.2.0
## [25] crayon_1.5.3 XVector_0.51.0 bit64_4.6.0-1
## [28] splines_4.6.0 cachem_1.1.0 yaml_2.3.12
## [31] otel_0.2.0 parallel_4.6.0 tools_4.6.0
## [34] annotate_1.89.0 memoise_2.0.1 locfit_1.5-9.12
## [37] vctrs_0.7.3 R6_2.6.1 png_0.1-9
## [40] magick_2.9.1 matrixStats_1.5.0 stats4_4.6.0
## [43] lifecycle_1.0.5 Seqinfo_1.1.0 KEGGREST_1.51.1
## [46] edgeR_4.9.8 S4Vectors_0.49.2 IRanges_2.45.0
## [49] bit_4.6.0 bslib_0.10.0 Rcpp_1.1.1-1
## [52] statmod_1.5.1 xfun_0.57 MatrixGenerics_1.23.0
## [55] htmltools_0.5.9 rmarkdown_2.31 compiler_4.6.0