1 curatedCRCData: Clinically Annotated Data for the Colorectal Cancer Transcriptome
2 Load data sets
3 Load datasets based on rules
4 Non-unique gene symbols
5 Appendix

1 curatedCRCData: Clinically Annotated Data for the Colorectal Cancer Transcriptome

This package represents a manually curated data collection for gene expression meta-analysis of patients with colorectal cancer. This resource provides uniformly prepared microarray data with curated and documented clinical metadata. It allows a computational user to efficiently identify studies and patient subgroups of interest for analysis and to run such analyses immediately without the challenges posed by harmonizing heterogeneous microarray technologies, study designs, expression data processing methods, and clinical data formats.

In this vignette, we give a short tour of the package and will show how to use it efficiently.

2 Load data sets

Loading a single dataset is very easy. First we load the package:

library(curatedCRCData)

To get a listing of all the datasets, use the data function:

data(package="curatedCRCData")

Now to load a single dataset, we use the data function again:

data(TCGA.COAD_eset)
TCGA.COAD_eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 17814 features, 130 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: TCGA.AA.3520 TCGA.AA.3532 ... TCGA.A6.2685 (130 total)
##   varLabels: unique_patient_ID alt_sample_name ...
##     uncurated_author_metadata (59 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 15E1.2 2'-PDE ... ZZZ3 (17814 total)
##   fvarLabels: probeset gene
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 22810696 
## Annotation: agilent-014850 whole human genome microarray 4x44k g4112f

The datasets are provided as Bioconductor ExpressionSet objects and we refer to the Bioconductor documentation for users unfamiliar with this data structure.

3 Load datasets based on rules

For a meta-analysis, we typically want to filter datasets and patients to get a population of patients we are interested in. We provide a short but powerful R script that does the filtering and provides the data as a list of ExpressionSet objects. One can use this script within R by first sourcing a config file which specifies the filters, like the minimum numbers of patients in each dataset. It is also possible to filter samples by annotation, for example to remove early stage and normal samples.

source(system.file("extdata", 
"patientselection_all.config",package="curatedCRCData"))
ls()

## [1] "TCGA.COAD_eset"       "keep.common.only"     "meta.required"       
## [4] "min.number.of.events" "min.sample.size"      "quantile.cutoff"     
## [7] "rescale"              "strict.checking"

See what the values of these variables we have loaded are. The variable names are fairly descriptive, but note that rule.1 is a character vector of length 2, where the first entry is the name of a clinical data variable, and the second entry is a Regular Expression providing a requirement for that variable. Any number of rules can be added, with increasing identifiers, e.g. rule.2, rule.3, etc.

Here strict.checking is FALSE, meaning that samples not annotated for the variables in these rules are allowed to pass the filter. If strict.checking == TRUE, samples missing this annotation will be removed.

sapply(ls(), get)

## $TCGA.COAD_eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 17814 features, 130 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: TCGA.AA.3520 TCGA.AA.3532 ... TCGA.A6.2685 (130 total)
##   varLabels: unique_patient_ID alt_sample_name ...
##     uncurated_author_metadata (59 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 15E1.2 2'-PDE ... ZZZ3 (17814 total)
##   fvarLabels: probeset gene
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 22810696 
## Annotation: agilent-014850 whole human genome microarray 4x44k g4112f 
## 
## $keep.common.only
## [1] FALSE
## 
## $meta.required
## NULL
## 
## $min.number.of.events
## [1] 0
## 
## $min.sample.size
## [1] 1
## 
## $quantile.cutoff
## [1] 0
## 
## $rescale
## [1] FALSE
## 
## $strict.checking
## [1] FALSE

Now that we have defined the sample filter, we create a list of ExpressionSet objects by sourcing the createEsetList.R file:

source(system.file("extdata", "createEsetList.R", package = "curatedCRCData"))

## 2026-04-30 11:30:02.748845 INFO::Inside script createEsetList.R - inputArgs =
## 2026-04-30 11:30:02.759232 INFO::None provided
## 2026-04-30 11:30:02.795908 INFO::Loading curatedCRCData 2.45.1
## 2026-04-30 11:30:49.334537 INFO::Clean up the esets.
## 2026-04-30 11:30:49.426177 INFO::including GSE11237_eset
## 2026-04-30 11:30:49.454124 INFO::including GSE12225.GPL3676_eset
## 2026-04-30 11:30:49.491708 INFO::including GSE12945_eset
## 2026-04-30 11:30:49.530485 INFO::including GSE13067_eset
## 2026-04-30 11:30:49.613579 INFO::including GSE13294_eset
## 2026-04-30 11:30:50.090391 INFO::including GSE14095_eset
## 2026-04-30 11:30:50.194601 INFO::including GSE14333_eset
## 2026-04-30 11:30:50.257255 INFO::including GSE16125.GPL5175_eset
## 2026-04-30 11:30:50.321212 INFO::including GSE17536_eset
## 2026-04-30 11:30:50.376991 INFO::including GSE17537_eset
## 2026-04-30 11:30:50.460367 INFO::including GSE17538.GPL570_eset
## 2026-04-30 11:30:50.913183 INFO::including GSE18105_eset
## 2026-04-30 11:30:51.030655 INFO::including GSE2109_eset
## 2026-04-30 11:30:51.134426 INFO::including GSE21510_eset
## 2026-04-30 11:30:51.203719 INFO::including GSE21815_eset
## 2026-04-30 11:30:51.250534 INFO::including GSE24549.GPL5175_eset
## 2026-04-30 11:30:51.280319 INFO::including GSE24550.GPL5175_eset
## 2026-04-30 11:30:51.309708 INFO::including GSE2630_eset
## 2026-04-30 11:30:51.338835 INFO::including GSE26682.GPL570_eset
## 2026-04-30 11:30:51.757845 INFO::including GSE26682.GPL96_eset
## 2026-04-30 11:30:51.798867 INFO::including GSE26906_eset
## 2026-04-30 11:30:51.830607 INFO::including GSE27544_eset
## 2026-04-30 11:30:51.873937 INFO::including GSE28702_eset
## 2026-04-30 11:30:51.910318 INFO::including GSE3294_eset
## 2026-04-30 11:30:51.948017 INFO::including GSE33113_eset
## 2026-04-30 11:30:52.062692 INFO::including GSE39582_eset
## 2026-04-30 11:30:52.16108 INFO::including GSE3964_eset
## 2026-04-30 11:30:52.177329 INFO::including GSE4045_eset
## 2026-04-30 11:30:52.201149 INFO::including GSE4526_eset
## 2026-04-30 11:30:52.230719 INFO::including GSE45270_eset
## 2026-04-30 11:30:52.289726 INFO::including TCGA.COAD_eset
## 2026-04-30 11:30:52.330087 INFO::including TCGA.READ_eset
## 2026-04-30 11:30:52.356704 INFO::including TCGA.RNASeqV2.READ_eset
## 2026-04-30 11:30:52.410317 INFO::including TCGA.RNASeqV2_eset
## 2026-04-30 11:30:52.764018 INFO::Ids with missing data: GSE2630_eset, GSE3294_eset, TCGA.COAD_eset, TCGA.READ_eset

It is also possible to run the script from the command line and then load the R data file within R:

R --vanilla "--args patientselection.config crc.eset.rda tmp.log"  < createEsetList.R

Now we have 34 datasets with samples that passed our filter in a list of ExpressionSet objects called esets:

names(esets)

##  [1] "GSE11237_eset"           "GSE12225.GPL3676_eset"  
##  [3] "GSE12945_eset"           "GSE13067_eset"          
##  [5] "GSE13294_eset"           "GSE14095_eset"          
##  [7] "GSE14333_eset"           "GSE16125.GPL5175_eset"  
##  [9] "GSE17536_eset"           "GSE17537_eset"          
## [11] "GSE17538.GPL570_eset"    "GSE18105_eset"          
## [13] "GSE2109_eset"            "GSE21510_eset"          
## [15] "GSE21815_eset"           "GSE24549.GPL5175_eset"  
## [17] "GSE24550.GPL5175_eset"   "GSE2630_eset"           
## [19] "GSE26682.GPL570_eset"    "GSE26682.GPL96_eset"    
## [21] "GSE26906_eset"           "GSE27544_eset"          
## [23] "GSE28702_eset"           "GSE3294_eset"           
## [25] "GSE33113_eset"           "GSE39582_eset"          
## [27] "GSE3964_eset"            "GSE4045_eset"           
## [29] "GSE4526_eset"            "GSE45270_eset"          
## [31] "TCGA.COAD_eset"          "TCGA.READ_eset"         
## [33] "TCGA.RNASeqV2.READ_eset" "TCGA.RNASeqV2_eset"

4 Non-unique gene symbols

In the standard version of curatedCRCData (the version available on Bioconductor), we collapse manufacturer probesets to official HGNC symbols using the Biomart database. Some probesets are mapped to multiple HGNC symbols in this database. For these probesets, we provide all the symbols. For example 220159_at maps to ABCA11P and ZNF721 and we provide ABCA11P///ZNF721 as probeset name. If you have an array of gene symbols for which you want to access the expression data, “ABCA11P” would not be found in curatedCRCData in this example. The following function will create a new ExpressionSet in which both ZNF721 and ABCA11P are features with identical expression data:

expandProbesets <- function (eset, sep = "///") 
{
    x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
    eset <- eset[order(sapply(x, length)), ]
    x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
    idx <- unlist(sapply(1:length(x), function(i) rep(i, length(x[[i]]))))
    xx <- !duplicated(unlist(x))
    idx <- idx[xx]
    x <- unlist(x)[xx]
    eset <- eset[idx, ]
    featureNames(eset) <- x
    eset
}

X <- TCGA.COAD_eset[head(grep("AA", featureNames(TCGA.COAD_eset))),]
exprs(X)[,1:3]

##         TCGA.AA.3520 TCGA.AA.3532 TCGA.AA.3553
## AAAS        -0.72125     -1.51150     -1.01250
## AACS         0.02225      0.82375     -0.08500
## AADAC        0.02775     -0.42900      1.12525
## AADACL1      1.06600      1.85550      0.92800
## AADACL2      0.08750     -0.61825     -0.47525
## AADACL3      0.38100      0.31100      0.33950

exprs(expandProbesets(X))[,1:3]

##         TCGA.AA.3520 TCGA.AA.3532 TCGA.AA.3553
## AAAS        -0.72125     -1.51150     -1.01250
## AACS         0.02225      0.82375     -0.08500
## AADAC        0.02775     -0.42900      1.12525
## AADACL1      1.06600      1.85550      0.92800
## AADACL2      0.08750     -0.61825     -0.47525
## AADACL3      0.38100      0.31100      0.33950

5 Appendix

5.1 Available Clinical Characteristics

Available clinical annotation. This heatmap visualizes for each curated clinical characteristic (rows) the availability in each dataset (columns). Red indicates that the corresponding characteristic is available for at least one sample in the dataset.

Figure 1: Available clinical annotation
This heatmap visualizes for each curated clinical characteristic (rows) the availability in each dataset (columns). Red indicates that the corresponding characteristic is available for at least one sample in the dataset.

5.2 Summarizing the List of ExpressionSets

This example provides a table summarizing the datasets being used, and is useful when publishing analyses based on curatedCRCData. First, define some useful functions for this purpose:

source(system.file("extdata", "summarizeEsets.R", package = "curatedCRCData"))

## Warning in min(which(km.fit$surv < 0.5)): no non-missing arguments to min;
## returning Inf
## Warning in min(which(km.fit$surv < 0.5)): no non-missing arguments to min;
## returning Inf

## Warning in min(which(reverse.km.fit$surv < 0.5)): no non-missing arguments to
## min; returning Inf
## Warning in min(which(reverse.km.fit$surv < 0.5)): no non-missing arguments to
## min; returning Inf
## Warning in min(which(reverse.km.fit$surv < 0.5)): no non-missing arguments to
## min; returning Inf

Optionally write this table to file, for example ( replace myfile <- tempfile() with something like myfile <- "nicetable.csv" )

(myfile <- tempfile())

## [1] "/tmp/RtmprX86Dc/file1075d247efaf59"

write.table(summary.table, file=myfile, row.names=FALSE, quote=TRUE, sep=",")

Table 1: Datasets provided by curatedCRCData.
	Study	PMID	N samples	msi status	G	Platform	median.survival	median follow-up	percent censoring	binarized OS (long/short)
GSE11237	Auman, Mcleod 2008	18653328	23	0/0/23	3/16/4/0/0	Affymetrix HG_U95Av2	NA	NA	NA	NA
GSE12225.GPL3676	Lips, Morreau 2008	18959792	42	0/0/42	0/0/0/0/42	NA	NA	NA	NA	NA
GSE12945	Staub, Rosenthal 2009	19399471	62	0/0/62	0/31/31/0/0	Affymetrix HG-U133A	NA	49	81	NA
GSE13067	Jorissen and Sieber 2008	19088021	74	0/0/74	0/0/0/0/74	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE13294	Jorissen and Sieber 2008	19088021	155	0/0/155	0/0/0/0/155	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE14095	Watanabe, Hashimoto 2008	21680303	189	0/0/189	0/0/0/0/189	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE14333	Jorissen and Sieber 2008	19996206	290	0/0/290	0/0/0/0/290	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE16125.GPL5175	Reid, Pierotti 2009	19672874	36	0/0/36	0/0/0/0/36	Affymetrix HuEx-1_0-st	74	22	69	NA
GSE17536	Smith JJ,??Beauchamp RD 2009	19914252	177	0/0/177	16/134/27/0/0	Affymetrix HG-U133Plus2	133	60	59	NA
GSE17537	Smith JJ,??Beauchamp RD 2009	19914252	55	0/0/55	1/32/3/0/19	Affymetrix HG-U133Plus2	NA	57	64	NA
GSE17538.GPL570	Smith JJ,??Beauchamp RD 2009	19914252	232	0/0/232	17/166/30/0/19	Affymetrix HG-U133Plus2	133	59	60	NA
GSE18105	Matsuyama, Sugihara 2009	20162577	111	0/0/111	0/0/0/0/111	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE2109	expO, IGC 2005	PMID unknown	427	0/0/427	10/260/71/4/82	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE21510	Tsukamoto, Sugihara 2010	21270110	148	0/0/148	0/0/0/0/148	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE21815	Mori M,??Mimori K,??Yokobori T 2010	21862635	141	0/0/141	32/33/1/0/75	Agilent G4112F	NA	NA	NA	NA
GSE24549.GPL5175	Sveen A,????esen TH,??Rognum TO,??Lothe RA,??Skotheim RI 2011	21619627	83	0/0/83	0/0/0/0/83	Affymetrix HuEx-1_0-st	NA	NA	NA	NA
GSE24550.GPL5175	Sveen A,????esen TH,??Rognum TO,??Lothe RA,??Skotheim RI 2011	21619627	90	0/0/90	0/0/0/0/90	Affymetrix HuEx-1_0-st	NA	NA	NA	NA
GSE2630	Bandr??E,??Malumbres R,??Cubedo E,??Sola J,??Garc??Foncillas J,??Labarga A 2005	17390049	16	0/0/16	0/0/0/0/16	NA	NA	NA	NA	NA
GSE26682.GPL570	Vilar E,??Morgan MA 2011	21300766	156	0/0/156	0/0/0/0/156	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE26682.GPL96	Vilar E,??Morgan MA 2011	21300766	155	0/0/155	0/0/0/0/155	Affymetrix HG-U133A	NA	NA	NA	NA
GSE26906	Olschwang S 2011	22496922	90	0/0/90	0/0/0/0/90	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE27544	Bernal M,??Garc??Alcalde F,??Concha ????Blanco A,??Garrido F,??Ruiz-Cabello F 2011	PMID unknown	22	0/0/22	0/0/0/0/22	Affymetrix HT HG-U133+ PM	NA	NA	NA	NA
GSE28702	Tsuji 2011	22095227	83	0/0/83	62/16/5/0/0	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE3294	Bianchini 2005	16773188	24	0/0/24	10/11/3/0/0	UHN SS-Human 19Kv7	NA	NA	NA	NA
GSE33113	Medema JP,??Tanis PJ 2011	22496204	96	0/0/96	0/0/0/0/96	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE39582	Marisa, Boige 2012	23700391	566	0/0/566	0/0/0/0/566	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE3964	Graudens, Imbeaud 2006	16542501	29	0/0/29	0/0/0/0/29	NA	NA	NA	NA	NA
GSE4045	Laiho, Aaltonen 2007	16819509	37	0/0/37	4/28/4/0/1	Affymetrix HG-U133A	NA	NA	NA	NA
GSE4526	Watanabe T,??Kobunai T,??Toda E,??Oka T 2006	19016304	36	0/0/36	0/0/0/0/36	Affymetrix HG-U133Plus2	NA	NA	NA	NA
GSE45270	Medema JP,??de Sousa E Melo F,??Vermeulen L,??Jansen M,??Dekker E,??Van Noesel C,??Fessler E 2013	PMID unknown	13	0/0/13	0/0/0/0/13	Affymetrix HG-U133Plus2	NA	NA	NA	NA
TCGA.COAD	The Cancer Genome Atlas Network 2012	22810696	130	0/0/130	0/0/0/0/130	Agilent G4502A-07-3	17	25	17	NA
TCGA.READ	The Cancer Genome Atlas Network 2012	22810696	51	0/0/51	0/0/0/0/51	Agilent G4502A-07-3	10	NA	0	NA
TCGA.RNASeqV2.READ	The Cancer Genome Atlas Network 2012	22810696	6	0/0/6	0/0/0/0/6		41	NA	0	NA
TCGA.RNASeqV2	The Cancer Genome Atlas Network 2012	22810696	195	0/0/195	0/0/0/0/195		15	NA	14	NA

5.3 For non-R users

If you are not doing your analysis in R, and just want to get some data you have identified from the curatedCRCData manual, here is a simple way to do it. For one dataset:

library(curatedCRCData)
library(Biobase)
data(TCGA.COAD_eset)
write.csv(exprs(TCGA.COAD_eset), file="TCGA.COAD_eset_exprs.csv")
write.csv(pData(TCGA.COAD_eset), file="TCGA.COAD_eset_clindata.csv")

Or for several datasets:

data.to.fetch <- c("TCGA.COAD_eset", "GSE37317_eset")
for (onedata in data.to.fetch){
    print(paste("Fetching", onedata))
    data(list=onedata)
    write.csv(exprs(get(onedata)), file=paste(onedata, "_exprs.csv", sep=""))
    write.csv(pData(get(onedata)), file=paste(onedata, "_clindata.csv", sep=""))
}

5.4 Session Info

## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.51            logging_0.10-108      survival_3.8-6       
##  [4] curatedCRCData_2.45.1 Biobase_2.73.1        BiocGenerics_0.59.0  
##  [7] generics_0.1.4        xtable_1.8-8          sva_3.61.0           
## [10] BiocParallel_1.47.0   genefilter_1.95.0     mgcv_1.9-4           
## [13] nlme_3.1-169          BiocStyle_2.41.0     
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10           RSQLite_2.4.6         lattice_0.22-9       
##  [4] magrittr_2.0.5        digest_0.6.39         evaluate_1.0.5       
##  [7] grid_4.6.0            bookdown_0.46         fastmap_1.2.0        
## [10] blob_1.3.0            jsonlite_2.0.0        Matrix_1.7-5         
## [13] AnnotationDbi_1.75.0  limma_3.69.0          DBI_1.3.0            
## [16] tinytex_0.59          BiocManager_1.30.27   httr_1.4.8           
## [19] XML_3.99-0.23         Biostrings_2.81.0     codetools_0.2-20     
## [22] jquerylib_0.1.4       cli_3.6.6             rlang_1.2.0          
## [25] crayon_1.5.3          XVector_0.53.0        bit64_4.8.0          
## [28] splines_4.6.0         cachem_1.1.0          yaml_2.3.12          
## [31] otel_0.2.0            parallel_4.6.0        tools_4.6.0          
## [34] annotate_1.91.0       memoise_2.0.1         locfit_1.5-9.12      
## [37] vctrs_0.7.3           R6_2.6.1              png_0.1-9            
## [40] magick_2.9.1          matrixStats_1.5.0     stats4_4.6.0         
## [43] lifecycle_1.0.5       Seqinfo_1.3.0         KEGGREST_1.53.0      
## [46] edgeR_4.11.0          S4Vectors_0.51.1      IRanges_2.47.0       
## [49] bit_4.6.0             bslib_0.10.0          Rcpp_1.1.1-1.1       
## [52] statmod_1.5.1         xfun_0.57             MatrixGenerics_1.25.0
## [55] htmltools_0.5.9       rmarkdown_2.31        compiler_4.6.0

curatedCRCData: Clinically Annotated Data for the Colorectal Cancer Transcriptome

2013

Contents