Contents

1 curatedCRCData: Clinically Annotated Data for the Colorectal Cancer Transcriptome

This package represents a manually curated data collection for gene expression meta-analysis of patients with colorectal cancer. This resource provides uniformly prepared microarray data with curated and documented clinical metadata. It allows a computational user to efficiently identify studies and patient subgroups of interest for analysis and to run such analyses immediately without the challenges posed by harmonizing heterogeneous microarray technologies, study designs, expression data processing methods, and clinical data formats.

In this vignette, we give a short tour of the package and will show how to use it efficiently.

2 Load data sets

Loading a single dataset is very easy. First we load the package:

library(curatedCRCData)

To get a listing of all the datasets, use the data function:

data(package="curatedCRCData")

Now to load a single dataset, we use the data function again:

data(TCGA.COAD_eset)
TCGA.COAD_eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 17814 features, 130 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: TCGA.AA.3520 TCGA.AA.3532 ... TCGA.A6.2685 (130 total)
##   varLabels: unique_patient_ID alt_sample_name ...
##     uncurated_author_metadata (59 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 15E1.2 2'-PDE ... ZZZ3 (17814 total)
##   fvarLabels: probeset gene
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 22810696 
## Annotation: agilent-014850 whole human genome microarray 4x44k g4112f

The datasets are provided as Bioconductor ExpressionSet objects and we refer to the Bioconductor documentation for users unfamiliar with this data structure.

3 Load datasets based on rules

For a meta-analysis, we typically want to filter datasets and patients to get a population of patients we are interested in. We provide a short but powerful R script that does the filtering and provides the data as a list of ExpressionSet objects. One can use this script within R by first sourcing a config file which specifies the filters, like the minimum numbers of patients in each dataset. It is also possible to filter samples by annotation, for example to remove early stage and normal samples.

source(system.file("extdata", 
"patientselection_all.config",package="curatedCRCData"))
ls()
## [1] "TCGA.COAD_eset"       "keep.common.only"     "meta.required"       
## [4] "min.number.of.events" "min.sample.size"      "quantile.cutoff"     
## [7] "rescale"              "strict.checking"

See what the values of these variables we have loaded are. The variable names are fairly descriptive, but note that rule.1 is a character vector of length 2, where the first entry is the name of a clinical data variable, and the second entry is a Regular Expression providing a requirement for that variable. Any number of rules can be added, with increasing identifiers, e.g. rule.2, rule.3, etc.

Here strict.checking is FALSE, meaning that samples not annotated for the variables in these rules are allowed to pass the filter. If strict.checking == TRUE, samples missing this annotation will be removed.

sapply(ls(), get)
## $TCGA.COAD_eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 17814 features, 130 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: TCGA.AA.3520 TCGA.AA.3532 ... TCGA.A6.2685 (130 total)
##   varLabels: unique_patient_ID alt_sample_name ...
##     uncurated_author_metadata (59 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 15E1.2 2'-PDE ... ZZZ3 (17814 total)
##   fvarLabels: probeset gene
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 22810696 
## Annotation: agilent-014850 whole human genome microarray 4x44k g4112f 
## 
## $keep.common.only
## [1] FALSE
## 
## $meta.required
## NULL
## 
## $min.number.of.events
## [1] 0
## 
## $min.sample.size
## [1] 1
## 
## $quantile.cutoff
## [1] 0
## 
## $rescale
## [1] FALSE
## 
## $strict.checking
## [1] FALSE

Now that we have defined the sample filter, we create a list of ExpressionSet objects by sourcing the createEsetList.R file:

source(system.file("extdata", "createEsetList.R", package = "curatedCRCData"))
## 2026-04-16 11:12:32.674264 INFO::Inside script createEsetList.R - inputArgs =
## 2026-04-16 11:12:32.684735 INFO::None provided
## 2026-04-16 11:12:32.721212 INFO::Loading curatedCRCData 2.43.1
## 2026-04-16 11:13:18.743068 INFO::Clean up the esets.
## 2026-04-16 11:13:18.832569 INFO::including GSE11237_eset
## 2026-04-16 11:13:18.858108 INFO::including GSE12225.GPL3676_eset
## 2026-04-16 11:13:18.892389 INFO::including GSE12945_eset
## 2026-04-16 11:13:18.930048 INFO::including GSE13067_eset
## 2026-04-16 11:13:19.006383 INFO::including GSE13294_eset
## 2026-04-16 11:13:19.47858 INFO::including GSE14095_eset
## 2026-04-16 11:13:19.569769 INFO::including GSE14333_eset
## 2026-04-16 11:13:19.628888 INFO::including GSE16125.GPL5175_eset
## 2026-04-16 11:13:19.687736 INFO::including GSE17536_eset
## 2026-04-16 11:13:19.740675 INFO::including GSE17537_eset
## 2026-04-16 11:13:19.812373 INFO::including GSE17538.GPL570_eset
## 2026-04-16 11:13:20.25375 INFO::including GSE18105_eset
## 2026-04-16 11:13:20.348391 INFO::including GSE2109_eset
## 2026-04-16 11:13:20.440227 INFO::including GSE21510_eset
## 2026-04-16 11:13:20.503528 INFO::including GSE21815_eset
## 2026-04-16 11:13:20.548581 INFO::including GSE24549.GPL5175_eset
## 2026-04-16 11:13:20.576108 INFO::including GSE24550.GPL5175_eset
## 2026-04-16 11:13:20.605044 INFO::including GSE2630_eset
## 2026-04-16 11:13:20.630424 INFO::including GSE26682.GPL570_eset
## 2026-04-16 11:13:21.045229 INFO::including GSE26682.GPL96_eset
## 2026-04-16 11:13:21.082334 INFO::including GSE26906_eset
## 2026-04-16 11:13:21.111844 INFO::including GSE27544_eset
## 2026-04-16 11:13:21.151021 INFO::including GSE28702_eset
## 2026-04-16 11:13:21.186451 INFO::including GSE3294_eset
## 2026-04-16 11:13:21.22219 INFO::including GSE33113_eset
## 2026-04-16 11:13:21.333061 INFO::including GSE39582_eset
## 2026-04-16 11:13:21.430076 INFO::including GSE3964_eset
## 2026-04-16 11:13:21.446424 INFO::including GSE4045_eset
## 2026-04-16 11:13:21.470465 INFO::including GSE4526_eset
## 2026-04-16 11:13:21.499032 INFO::including GSE45270_eset
## 2026-04-16 11:13:21.555116 INFO::including TCGA.COAD_eset
## 2026-04-16 11:13:21.594059 INFO::including TCGA.READ_eset
## 2026-04-16 11:13:21.619469 INFO::including TCGA.RNASeqV2.READ_eset
## 2026-04-16 11:13:21.673183 INFO::including TCGA.RNASeqV2_eset
## 2026-04-16 11:13:22.02208 INFO::Ids with missing data: GSE2630_eset, GSE3294_eset, TCGA.COAD_eset, TCGA.READ_eset

It is also possible to run the script from the command line and then load the R data file within R:

R --vanilla "--args patientselection.config crc.eset.rda tmp.log"  < createEsetList.R 

Now we have 34 datasets with samples that passed our filter in a list of ExpressionSet objects called esets:

names(esets)
##  [1] "GSE11237_eset"           "GSE12225.GPL3676_eset"  
##  [3] "GSE12945_eset"           "GSE13067_eset"          
##  [5] "GSE13294_eset"           "GSE14095_eset"          
##  [7] "GSE14333_eset"           "GSE16125.GPL5175_eset"  
##  [9] "GSE17536_eset"           "GSE17537_eset"          
## [11] "GSE17538.GPL570_eset"    "GSE18105_eset"          
## [13] "GSE2109_eset"            "GSE21510_eset"          
## [15] "GSE21815_eset"           "GSE24549.GPL5175_eset"  
## [17] "GSE24550.GPL5175_eset"   "GSE2630_eset"           
## [19] "GSE26682.GPL570_eset"    "GSE26682.GPL96_eset"    
## [21] "GSE26906_eset"           "GSE27544_eset"          
## [23] "GSE28702_eset"           "GSE3294_eset"           
## [25] "GSE33113_eset"           "GSE39582_eset"          
## [27] "GSE3964_eset"            "GSE4045_eset"           
## [29] "GSE4526_eset"            "GSE45270_eset"          
## [31] "TCGA.COAD_eset"          "TCGA.READ_eset"         
## [33] "TCGA.RNASeqV2.READ_eset" "TCGA.RNASeqV2_eset"

4 Non-unique gene symbols

In the standard version of curatedCRCData (the version available on Bioconductor), we collapse manufacturer probesets to official HGNC symbols using the Biomart database. Some probesets are mapped to multiple HGNC symbols in this database. For these probesets, we provide all the symbols. For example 220159_at maps to ABCA11P and ZNF721 and we provide ABCA11P///ZNF721 as probeset name. If you have an array of gene symbols for which you want to access the expression data, “ABCA11P” would not be found in curatedCRCData in this example. The following function will create a new ExpressionSet in which both ZNF721 and ABCA11P are features with identical expression data:

expandProbesets <- function (eset, sep = "///") 
{
    x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
    eset <- eset[order(sapply(x, length)), ]
    x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
    idx <- unlist(sapply(1:length(x), function(i) rep(i, length(x[[i]]))))
    xx <- !duplicated(unlist(x))
    idx <- idx[xx]
    x <- unlist(x)[xx]
    eset <- eset[idx, ]
    featureNames(eset) <- x
    eset
}

X <- TCGA.COAD_eset[head(grep("AA", featureNames(TCGA.COAD_eset))),]
exprs(X)[,1:3]
##         TCGA.AA.3520 TCGA.AA.3532 TCGA.AA.3553
## AAAS        -0.72125     -1.51150     -1.01250
## AACS         0.02225      0.82375     -0.08500
## AADAC        0.02775     -0.42900      1.12525
## AADACL1      1.06600      1.85550      0.92800
## AADACL2      0.08750     -0.61825     -0.47525
## AADACL3      0.38100      0.31100      0.33950
exprs(expandProbesets(X))[,1:3]
##         TCGA.AA.3520 TCGA.AA.3532 TCGA.AA.3553
## AAAS        -0.72125     -1.51150     -1.01250
## AACS         0.02225      0.82375     -0.08500
## AADAC        0.02775     -0.42900      1.12525
## AADACL1      1.06600      1.85550      0.92800
## AADACL2      0.08750     -0.61825     -0.47525
## AADACL3      0.38100      0.31100      0.33950

5 Appendix

5.1 Available Clinical Characteristics

Available clinical annotation. This heatmap visualizes for each curated clinical characteristic (rows) the availability in each dataset (columns). Red indicates that the corresponding characteristic is available for at least one sample in the dataset.

Figure 1: Available clinical annotation
This heatmap visualizes for each curated clinical characteristic (rows) the availability in each dataset (columns). Red indicates that the corresponding characteristic is available for at least one sample in the dataset.

5.2 Summarizing the List of ExpressionSets

This example provides a table summarizing the datasets being used, and is useful when publishing analyses based on curatedCRCData. First, define some useful functions for this purpose:

source(system.file("extdata", "summarizeEsets.R", package = "curatedCRCData"))
## Warning in min(which(km.fit$surv < 0.5)): no non-missing arguments to min;
## returning Inf
## Warning in min(which(km.fit$surv < 0.5)): no non-missing arguments to min;
## returning Inf
## Warning in min(which(reverse.km.fit$surv < 0.5)): no non-missing arguments to
## min; returning Inf
## Warning in min(which(reverse.km.fit$surv < 0.5)): no non-missing arguments to
## min; returning Inf
## Warning in min(which(reverse.km.fit$surv < 0.5)): no non-missing arguments to
## min; returning Inf

Optionally write this table to file, for example ( replace myfile <- tempfile() with something like myfile <- "nicetable.csv" )

(myfile <- tempfile())
## [1] "/tmp/Rtmp3NSnvg/filededf6e4ad13a"
write.table(summary.table, file=myfile, row.names=FALSE, quote=TRUE, sep=",")

Table 1: Datasets provided by curatedCRCData.
Study PMID N samples msi status G Platform median.survival median follow-up percent censoring binarized OS (long/short)
GSE11237 Auman, Mcleod 2008 18653328 23 0/0/23 3/16/4/0/0 Affymetrix HG_U95Av2 NA NA NA NA
GSE12225.GPL3676 Lips, Morreau 2008 18959792 42 0/0/42 0/0/0/0/42 NA NA NA NA NA
GSE12945 Staub, Rosenthal 2009 19399471 62 0/0/62 0/31/31/0/0 Affymetrix HG-U133A NA 49 81 NA
GSE13067 Jorissen and Sieber 2008 19088021 74 0/0/74 0/0/0/0/74 Affymetrix HG-U133Plus2 NA NA NA NA
GSE13294 Jorissen and Sieber 2008 19088021 155 0/0/155 0/0/0/0/155 Affymetrix HG-U133Plus2 NA NA NA NA
GSE14095 Watanabe, Hashimoto 2008 21680303 189 0/0/189 0/0/0/0/189 Affymetrix HG-U133Plus2 NA NA NA NA
GSE14333 Jorissen and Sieber 2008 19996206 290 0/0/290 0/0/0/0/290 Affymetrix HG-U133Plus2 NA NA NA NA
GSE16125.GPL5175 Reid, Pierotti 2009 19672874 36 0/0/36 0/0/0/0/36 Affymetrix HuEx-1_0-st 74 22 69 NA
GSE17536 Smith JJ,??Beauchamp RD 2009 19914252 177 0/0/177 16/134/27/0/0 Affymetrix HG-U133Plus2 133 60 59 NA
GSE17537 Smith JJ,??Beauchamp RD 2009 19914252 55 0/0/55 1/32/3/0/19 Affymetrix HG-U133Plus2 NA 57 64 NA
GSE17538.GPL570 Smith JJ,??Beauchamp RD 2009 19914252 232 0/0/232 17/166/30/0/19 Affymetrix HG-U133Plus2 133 59 60 NA
GSE18105 Matsuyama, Sugihara 2009 20162577 111 0/0/111 0/0/0/0/111 Affymetrix HG-U133Plus2 NA NA NA NA
GSE2109 expO, IGC 2005 PMID unknown 427 0/0/427 10/260/71/4/82 Affymetrix HG-U133Plus2 NA NA NA NA
GSE21510 Tsukamoto, Sugihara 2010 21270110 148 0/0/148 0/0/0/0/148 Affymetrix HG-U133Plus2 NA NA NA NA
GSE21815 Mori M,??Mimori K,??Yokobori T 2010 21862635 141 0/0/141 32/33/1/0/75 Agilent G4112F NA NA NA NA
GSE24549.GPL5175 Sveen A,????esen TH,??Rognum TO,??Lothe RA,??Skotheim RI 2011 21619627 83 0/0/83 0/0/0/0/83 Affymetrix HuEx-1_0-st NA NA NA NA
GSE24550.GPL5175 Sveen A,????esen TH,??Rognum TO,??Lothe RA,??Skotheim RI 2011 21619627 90 0/0/90 0/0/0/0/90 Affymetrix HuEx-1_0-st NA NA NA NA
GSE2630 Bandr??E,??Malumbres R,??Cubedo E,??Sola J,??Garc??Foncillas J,??Labarga A 2005 17390049 16 0/0/16 0/0/0/0/16 NA NA NA NA NA
GSE26682.GPL570 Vilar E,??Morgan MA 2011 21300766 156 0/0/156 0/0/0/0/156 Affymetrix HG-U133Plus2 NA NA NA NA
GSE26682.GPL96 Vilar E,??Morgan MA 2011 21300766 155 0/0/155 0/0/0/0/155 Affymetrix HG-U133A NA NA NA NA
GSE26906 Olschwang S 2011 22496922 90 0/0/90 0/0/0/0/90 Affymetrix HG-U133Plus2 NA NA NA NA
GSE27544 Bernal M,??Garc??Alcalde F,??Concha ????Blanco A,??Garrido F,??Ruiz-Cabello F 2011 PMID unknown 22 0/0/22 0/0/0/0/22 Affymetrix HT HG-U133+ PM NA NA NA NA
GSE28702 Tsuji 2011 22095227 83 0/0/83 62/16/5/0/0 Affymetrix HG-U133Plus2 NA NA NA NA
GSE3294 Bianchini 2005 16773188 24 0/0/24 10/11/3/0/0 UHN SS-Human 19Kv7 NA NA NA NA
GSE33113 Medema JP,??Tanis PJ 2011 22496204 96 0/0/96 0/0/0/0/96 Affymetrix HG-U133Plus2 NA NA NA NA
GSE39582 Marisa, Boige 2012 23700391 566 0/0/566 0/0/0/0/566 Affymetrix HG-U133Plus2 NA NA NA NA
GSE3964 Graudens, Imbeaud 2006 16542501 29 0/0/29 0/0/0/0/29 NA NA NA NA NA
GSE4045 Laiho, Aaltonen 2007 16819509 37 0/0/37 4/28/4/0/1 Affymetrix HG-U133A NA NA NA NA
GSE4526 Watanabe T,??Kobunai T,??Toda E,??Oka T 2006 19016304 36 0/0/36 0/0/0/0/36 Affymetrix HG-U133Plus2 NA NA NA NA
GSE45270 Medema JP,??de Sousa E Melo F,??Vermeulen L,??Jansen M,??Dekker E,??Van Noesel C,??Fessler E 2013 PMID unknown 13 0/0/13 0/0/0/0/13 Affymetrix HG-U133Plus2 NA NA NA NA
TCGA.COAD The Cancer Genome Atlas Network 2012 22810696 130 0/0/130 0/0/0/0/130 Agilent G4502A-07-3 17 25 17 NA
TCGA.READ The Cancer Genome Atlas Network 2012 22810696 51 0/0/51 0/0/0/0/51 Agilent G4502A-07-3 10 NA 0 NA
TCGA.RNASeqV2.READ The Cancer Genome Atlas Network 2012 22810696 6 0/0/6 0/0/0/0/6 41 NA 0 NA
TCGA.RNASeqV2 The Cancer Genome Atlas Network 2012 22810696 195 0/0/195 0/0/0/0/195 15 NA 14 NA

5.3 For non-R users

If you are not doing your analysis in R, and just want to get some data you have identified from the curatedCRCData manual, here is a simple way to do it. For one dataset:

library(curatedCRCData)
library(Biobase)
data(TCGA.COAD_eset)
write.csv(exprs(TCGA.COAD_eset), file="TCGA.COAD_eset_exprs.csv")
write.csv(pData(TCGA.COAD_eset), file="TCGA.COAD_eset_clindata.csv")

Or for several datasets:

data.to.fetch <- c("TCGA.COAD_eset", "GSE37317_eset")
for (onedata in data.to.fetch){
    print(paste("Fetching", onedata))
    data(list=onedata)
    write.csv(exprs(get(onedata)), file=paste(onedata, "_exprs.csv", sep=""))
    write.csv(pData(get(onedata)), file=paste(onedata, "_clindata.csv", sep=""))
}

5.4 Session Info

## R version 4.6.0 alpha (2026-04-05 r89794)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.51            logging_0.10-108      survival_3.8-6       
##  [4] curatedCRCData_2.43.1 Biobase_2.71.0        BiocGenerics_0.57.0  
##  [7] generics_0.1.4        xtable_1.8-8          sva_3.59.0           
## [10] BiocParallel_1.45.0   genefilter_1.93.0     mgcv_1.9-4           
## [13] nlme_3.1-169          BiocStyle_2.39.0     
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10           RSQLite_2.4.6         lattice_0.22-9       
##  [4] magrittr_2.0.5        digest_0.6.39         evaluate_1.0.5       
##  [7] grid_4.6.0            bookdown_0.46         fastmap_1.2.0        
## [10] blob_1.3.0            jsonlite_2.0.0        Matrix_1.7-5         
## [13] AnnotationDbi_1.73.1  limma_3.67.1          DBI_1.3.0            
## [16] tinytex_0.59          BiocManager_1.30.27   httr_1.4.8           
## [19] XML_3.99-0.23         Biostrings_2.79.5     codetools_0.2-20     
## [22] jquerylib_0.1.4       cli_3.6.6             rlang_1.2.0          
## [25] crayon_1.5.3          XVector_0.51.0        bit64_4.6.0-1        
## [28] splines_4.6.0         cachem_1.1.0          yaml_2.3.12          
## [31] otel_0.2.0            parallel_4.6.0        tools_4.6.0          
## [34] annotate_1.89.0       memoise_2.0.1         locfit_1.5-9.12      
## [37] vctrs_0.7.3           R6_2.6.1              png_0.1-9            
## [40] magick_2.9.1          matrixStats_1.5.0     stats4_4.6.0         
## [43] lifecycle_1.0.5       Seqinfo_1.1.0         KEGGREST_1.51.1      
## [46] edgeR_4.9.7           S4Vectors_0.49.1-1    IRanges_2.45.0       
## [49] bit_4.6.0             bslib_0.10.0          Rcpp_1.1.1-1         
## [52] statmod_1.5.1         xfun_0.57             MatrixGenerics_1.23.0
## [55] htmltools_0.5.9       rmarkdown_2.31        compiler_4.6.0