h5ad filesThe purpose of this package is to make it easy to query the Human Cell Atlas Data Portal via their data browser API. Visit the Human Cell Atlas for more information on the project.
Evaluate the following code chunk to install packages required for this vignette.
## install from Bioconductor if you haven't already
pkgs <- c("httr", "dplyr", "LoomExperiment", "hca")
pkgs_needed <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(pkgs_needed)Load the packages into your R session.
library(httr)
library(dplyr)
library(LoomExperiment)
library(hca)To illustrate use of this package, consider the task of downloading a ‘loom’ file summarizing single-cell gene expression observed in an HCA research project. This could be accomplished by visiting the HCA data portal (at https://data.humancellatlas.org/explore) in a web browser and selecting projects interactively, but it is valuable to accomplish the same goal in a reproducible, flexible, programmatic way. We will (1) discover projects available in the HCA Data Coordinating Center that have loom files; and (2) retrieve the file from the HCA and import the data into R as a ‘LoomExperiment’ object. For illustration, we focus on the ‘Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns’ project.
Use projects() to retrieve the first 200 projects in the HCA’s
default catalog.
projects(size = 200)
## # A tibble: 200 × 14
##    projectId            projectTitle genusSpecies sampleEntityType specimenOrgan
##    <chr>                <chr>        <list>       <list>           <list>       
##  1 74b6d569-3b11-42ef-… 1.3 Million… <chr [1]>    <chr [1]>        <chr [1]>    
##  2 53c53cd4-8127-4e12-… A Cellular … <chr [1]>    <chr [1]>        <chr [1]>    
##  3 7027adc6-c9c9-46f3-… A Cellular … <chr [1]>    <chr [1]>        <chr [1]>    
##  4 94e4ee09-9b4b-410a-… A Human Liv… <chr [1]>    <chr [2]>        <chr [1]>    
##  5 c5b475f2-76b3-4a8e-… A Partial P… <chr [1]>    <chr [1]>        <chr [1]>    
##  6 60ea42e1-af49-42f5-… A Protocol … <chr [1]>    <chr [1]>        <chr [1]>    
##  7 ef1e3497-515e-4bbe-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [3]>    
##  8 9ac53858-606a-4b89-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [1]>    
##  9 258c5e15-d125-4f2d-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [1]>    
## 10 894ae6ac-5b48-41a8-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [1]>    
## # ℹ 190 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## #   libraryConstructionApproach <list>, nucleicAcidSource <list>,
## #   pairedEnd <list>, workflow <list>, specimenDisease <list>,
## #   donorDisease <list>, developmentStage <list>Use filters() to restrict the projects to just those that contain at
least one ‘loom’ file.
project_filter <- filters(fileFormat = list(is = "loom"))
project_tibble <- projects(project_filter)
project_tibble
## # A tibble: 79 × 14
##    projectId            projectTitle genusSpecies sampleEntityType specimenOrgan
##    <chr>                <chr>        <list>       <list>           <list>       
##  1 53c53cd4-8127-4e12-… A Cellular … <chr [1]>    <chr [1]>        <chr [1]>    
##  2 7027adc6-c9c9-46f3-… A Cellular … <chr [1]>    <chr [1]>        <chr [1]>    
##  3 c1810dbc-16d2-45c3-… A cell atla… <chr [2]>    <chr [1]>        <chr [2]>    
##  4 a9301beb-e9fa-42fe-… A human cel… <chr [1]>    <chr [1]>        <chr [14]>   
##  5 996120f9-e84f-409f-… A human sin… <chr [1]>    <chr [1]>        <chr [1]>    
##  6 842605c7-375a-47c5-… A single ce… <chr [1]>    <chr [1]>        <chr [1]>    
##  7 cc95ff89-2e68-4a08-… A single ce… <chr [1]>    <chr [1]>        <chr [3]>    
##  8 a004b150-1c36-4af6-… A single-ce… <chr [1]>    <chr [1]>        <chr [1]>    
##  9 4a95101c-9ffc-4f30-… A single-ce… <chr [1]>    <chr [1]>        <chr [4]>    
## 10 1cd1f41f-f81a-486b-… A single-ce… <chr [1]>    <chr [1]>        <chr [1]>    
## # ℹ 69 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## #   libraryConstructionApproach <list>, nucleicAcidSource <list>,
## #   pairedEnd <list>, workflow <list>, specimenDisease <list>,
## #   donorDisease <list>, developmentStage <list>Use standard R commands to further filter projects to the one we are
interested in, with title starting with “Single…”. Extract the
unique projectId for the first project with this title.
project_tibble |>
    filter(startsWith(projectTitle, "Single")) |>
    head(1) |>
    t()
##                             [,1]                                               
## projectId                   "0c3b7785-f74d-4091-8616-a68757e4c2a8"             
## projectTitle                "Single cell RNA sequencing of multiple myeloma II"
## genusSpecies                "Homo sapiens"                                     
## sampleEntityType            "specimens"                                        
## specimenOrgan               "hematopoietic system"                             
## specimenOrganPart           character,2                                        
## selectedCellType            "plasma cell"                                      
## libraryConstructionApproach character,2                                        
## nucleicAcidSource           "single cell"                                      
## pairedEnd                   logical,2                                          
## workflow                    character,4                                        
## specimenDisease             "plasma cell myeloma"                              
## donorDisease                "plasma cell myeloma"                              
## developmentStage            "human adult stage"
projectIds <-
    project_tibble |>
    filter(startsWith(projectTitle, "Single")) |>
    dplyr::pull(projectId)
projectId <- projectIds[1]A project id can be used to discover the title or additional project information.
project_title(projectId)
## [1] "Single cell RNA sequencing of multiple myeloma II"
project_information(projectId)
## Title
##   Single cell RNA sequencing of multiple myeloma II
## Contributors (unknown order; any role)
##   Daeun Ryu, Hernandez Karina, Nguy Beagan, Parisa Nejad
## Description
##   To investigate the relationship between genetic and transcriptional
##   heterogeneity in a context of cancer progression, we devised a
##   computational approach called HoneyBADGER to identify copy number
##   variation and loss-of-heterozygosity in individual cells from
##   single-cell RNA-sequencing data. By combining allele frequency and
##   expression magnitude deviations, HoneyBADGER is able to infer the
##   presence of subclone-specific alterations in individual cells and
##   reconstruct subclonal architecture. Also HoneyBADGER to analyze
##   single cells from a progressive multiple myeloma (MM) patient to
##   identify major genetic subclones that exhibit distinct
##   transcriptional signatures relevant to cancer progression.
## DOI
##   10.1101/gr.228080.117 10.1158/1078-0432.CCR-19-0694
## URL
##   https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6071640/
##   https://clincancerres.aacrjournals.org/content/26/4/935.long#sec-6
## Project
##   https://data.humancellatlas.org/explore/projects/0c3b7785-f74d-4091-8616-a68757e4c2a8files() retrieves (the first 1000) files from the Human Cell Atlas
data portal. Construct a filter to restrict the files to loom files
from the project we are interested in.
file_filter <- filters(
    projectId = list(is = projectId),
    fileFormat = list(is = "loom")
)
# only the two smallest files
file_tibble <- files(file_filter, size = 2, sort = "fileSize", order = "asc")
file_tibble
## # A tibble: 2 × 8
##   fileId            name  fileFormat   size version projectTitle projectId url  
##   <chr>             <chr> <chr>       <int> <chr>   <chr>        <chr>     <chr>
## 1 fe214fea-cc68-56… bone… loom       4.04e7 2021-0… Single cell… 0c3b7785… http…
## 2 3014ec47-1399-57… Bone… loom       6.35e7 2021-1… Single cell… 0c3b7785… http…files_download() will download one or more files (one for each row)
in file_tibble. The download is more complicated than simply
following the url column of file_tibble, so it is not possible to
simply copy the url into a browser. We’ll download the file and then
immediately import it into R.
file_locations <- file_tibble |> files_download()
LoomExperiment::import(unname(file_locations[1]),
                       type ="SingleCellLoomExperiment")
## class: SingleCellLoomExperiment 
## dim: 58347 3762 
## metadata(15): last_modified CreationDate ...
##   project.provenance.document_id specimen_from_organism.organ
## assays(1): matrix
## rownames: NULL
## rowData names(29): Gene antisense_reads ... reads_per_molecule
##   spliced_reads
## colnames: NULL
## colData names(43): CellID antisense_reads ... reads_unmapped
##   spliced_reads
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowGraphs(0): NULL
## colGraphs(0): NULLNote that files_download() uses [BiocFileCache][https://bioconductor.org/packages/BiocFileCache],
so individual files are only downloaded once.
h5ad filesThis example walks through the process of file discovery and retrieval
in a little more detail, using h5ad files created by the Python
AnnData analysis software and available for some experiments in the
default catalog.
The first challenge is to understand what file formats are available from the HCA. Obtain a tibble describing the ‘facets’ of the data, the number of terms used in each facet, and the number of distinct values used to describe projects.
projects_facets()
## # A tibble: 38 × 3
##    facet              n_terms n_values
##    <chr>                <int>    <int>
##  1 accessible               1      450
##  2 assayType                2      450
##  3 biologicalSex            5      781
##  4 bionetworkName           7      451
##  5 cellLineType             6      466
##  6 contactName           5632     6988
##  7 contentDescription      70     1828
##  8 developmentStage       171      997
##  9 donorDisease           465     1155
## 10 effectiveOrgan         164      802
## # ℹ 28 more rowsNote the fileFormat facet, and repeat projects_facets() to
discover detail about available file formats
projects_facets("fileFormat")
## # A tibble: 78 × 3
##    facet      term     count
##    <chr>      <chr>    <int>
##  1 fileFormat fastq.gz   330
##  2 fileFormat xlsx       322
##  3 fileFormat tsv.gz      91
##  4 fileFormat tar         89
##  5 fileFormat loom        79
##  6 fileFormat bam         77
##  7 fileFormat mtx.gz      77
##  8 fileFormat csv.gz      72
##  9 fileFormat txt.gz      67
## 10 fileFormat csv         45
## # ℹ 68 more rowsNote that there are 8 uses of the h5ad file format. Use this as a
filter to discover relevant projects.
filters <- filters(fileFormat = list(is = "h5ad"))
projects(filters)
## # A tibble: 36 × 14
##    projectId            projectTitle genusSpecies sampleEntityType specimenOrgan
##    <chr>                <chr>        <list>       <list>           <list>       
##  1 cdabcf0b-7602-4abf-… A blood atl… <chr [1]>    <chr [1]>        <chr [1]>    
##  2 c1810dbc-16d2-45c3-… A cell atla… <chr [2]>    <chr [1]>        <chr [2]>    
##  3 c0518445-3b3b-49c6-… A cellular … <chr [1]>    <chr [1]>        <chr [2]>    
##  4 2fe3c60b-ac1a-4c61-… A human fet… <chr [1]>    <chr [2]>        <chr [2]>    
##  5 73769e0a-5fcd-41f4-… A proximal-… <chr [1]>    <chr [1]>        <chr [2]>    
##  6 cc95ff89-2e68-4a08-… A single ce… <chr [1]>    <chr [1]>        <chr [3]>    
##  7 957261f7-2bd6-4358-… A spatially… <chr [1]>    <chr [1]>        <chr [1]>    
##  8 1dddae6e-3753-48af-… Cell Types … <chr [1]>    <chr [2]>        <chr [2]>    
##  9 ad98d3cd-26fb-4ee3-… Cells of th… <chr [1]>    <chr [1]>        <chr [1]>    
## 10 fde199d2-a841-4ed1-… Cells of th… <chr [1]>    <chr [1]>        <chr [4]>    
## # ℹ 26 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## #   libraryConstructionApproach <list>, nucleicAcidSource <list>,
## #   pairedEnd <list>, workflow <list>, specimenDisease <list>,
## #   donorDisease <list>, developmentStage <list>The default tibble produced by projects() contains only some of the
information available; the information is much richer.
To obtain a tibble with an expanded set of columns, you can specify that using
the as parameter set to "tibble_expanded".
# an expanded set of columns for all or the first 4 projects
projects(as = 'tibble_expanded', size = 4)
## # A tibble: 4 × 126
##   projectId  cellSuspensions.orga…¹ cellSuspensions.organ cellSuspensions.sele…²
##   <chr>      <list>                 <chr>                 <list>                
## 1 74b6d569-… <chr [1]>              brain                 <chr [1]>             
## 2 53c53cd4-… <chr [2]>              prostate gland        <chr [7]>             
## 3 7027adc6-… <chr [0]>              heart                 <chr [0]>             
## 4 94e4ee09-… <chr [0]>              liver                 <chr [0]>             
## # ℹ abbreviated names: ¹cellSuspensions.organPart,
## #   ²cellSuspensions.selectedCellType
## # ℹ 122 more variables: cellSuspensions.totalCells <int>,
## #   cellSuspensions.totalCellsRedundant <int>,
## #   dates.aggregateLastModifiedDate <chr>, dates.aggregateSubmissionDate <chr>,
## #   dates.aggregateUpdateDate <chr>, dates.lastModifiedDate <chr>,
## #   dates.submissionDate <chr>, dates.updateDate <chr>, …In the next sections, we’ll cover other options for the as parameter, and the data formats
they return.
projects() as an R listInstead of retrieving the result of projects() as a tibble, retrieve
it as a ‘list-of-lists’
projects_list <- projects(size = 200, as = "list")This is a complicated structure. We will use lengths(), names(),
and standard R list selection operations to navigate this a bit. At
the top level there are three elements.
lengths(projects_list)
## pagination termFacets       hits 
##          8         39        200hits represents each project as a list, e.g,.
lengths(projects_list$hits[[1]])
##         protocols           entryId           sources          projects 
##                 2                 1                 1                 1 
##           samples         specimens         cellLines    donorOrganisms 
##                 1                 1                 0                 1 
##         organoids   cellSuspensions             dates fileTypeSummaries 
##                 0                 1                 1                 2shows that there are 10 different ways in which the first project is described. Each component is itself a list-of-lists, e.g.,
lengths(projects_list$hits[[1]]$projects[[1]])
##            projectId         projectTitle     projectShortname 
##                    1                    1                    1 
##           laboratory   estimatedCellCount isTissueAtlasProject 
##                    1                    1                    1 
##          tissueAtlas       bionetworkName   projectDescription 
##                    0                    1                    1 
##         contributors         publications   supplementaryLinks 
##                    6                    1                    1 
##             matrices  contributedAnalyses           accessions 
##                    0                    1                    3 
##           accessible 
##                    1
projects_list$hits[[1]]$projects[[1]]$projectTitle
## [1] "1.3 Million Brain Cells from E18 Mice"One can use standard R commands to navigate this data structure, and
to, e.g., extract the projectTitle of each project.
projects() as an lolUse as = "lol" to create a more convenient way to select, filter and
extract elements from the list-of-lists by projects().
lol <- projects(size = 200, as = "lol")
lol
## # class: lol_hca lol
## # number of distinct paths: 24035
## # total number of elements: 174617
## # number of leaf paths: 18601
## # number of leaf elements: 138362
## # lol_path():
## # A tibble: 24,035 × 3
##    path                                     n is_leaf
##    <chr>                                <int> <lgl>  
##  1 hits                                     1 FALSE  
##  2 hits[*]                                200 FALSE  
##  3 hits[*].cellLines                      200 FALSE  
##  4 hits[*].cellLines[*]                    31 FALSE  
##  5 hits[*].cellLines[*].cellLineType       31 FALSE  
##  6 hits[*].cellLines[*].cellLineType[*]    41 TRUE   
##  7 hits[*].cellLines[*].id                 31 FALSE  
##  8 hits[*].cellLines[*].id[*]             132 TRUE   
##  9 hits[*].cellLines[*].modelOrgan         31 FALSE  
## 10 hits[*].cellLines[*].modelOrgan[*]      43 TRUE   
## # ℹ 24,025 more rowsUse lol_select() to restrict the lol to particular paths, and
lol_filter() to filter results to paths that are leafs, or with
specific numbers of entries.
lol_select(lol, "hits[*].projects[*]")
## # class: lol_hca lol
## # number of distinct paths: 23910
## # total number of elements: 121728
## # number of leaf paths: 18540
## # number of leaf elements: 100045
## # lol_path():
## # A tibble: 23,910 × 3
##    path                                                         n is_leaf
##    <chr>                                                    <int> <lgl>  
##  1 hits[*].projects[*]                                        200 FALSE  
##  2 hits[*].projects[*].accessible                             200 TRUE   
##  3 hits[*].projects[*].accessions                             200 FALSE  
##  4 hits[*].projects[*].accessions[*]                          578 FALSE  
##  5 hits[*].projects[*].accessions[*].accession                578 TRUE   
##  6 hits[*].projects[*].accessions[*].namespace                578 TRUE   
##  7 hits[*].projects[*].bionetworkName                         200 FALSE  
##  8 hits[*].projects[*].bionetworkName[*]                      200 TRUE   
##  9 hits[*].projects[*].contributedAnalyses                    200 FALSE  
## 10 hits[*].projects[*].contributedAnalyses.developmentStage     2 FALSE  
## # ℹ 23,900 more rows
lol_select(lol, "hits[*].projects[*]") |>
    lol_filter(n == 44, is_leaf)
## # class: lol_hca lol
## # number of distinct paths: 0
## # total number of elements: 0
## # number of leaf paths: 0
## # number of leaf elements: 0
## # lol_path():
## # A tibble: 0 × 3
## # ℹ 3 variables: path <chr>, n <int>, is_leaf <lgl>lol_pull() extracts a path from the lol as a vector; lol_lpull()
extracts paths as lists.
titles <- lol_pull(lol, "hits[*].projects[*].projectTitle")
length(titles)
## [1] 200
head(titles, 2)
## [1] "1.3 Million Brain Cells from E18 Mice"                                      
## [2] "A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra"projects() tibbles with specific columnsThe path or its abbreviation can be used to specify the columns of
the tibble to be returned by the projects() query.
Here we retrieve additional details of donor count and total cells by adding appropriate path abbreviations to a named character vector. Names on the character vector can be used to rename the path more concisely, but the paths must uniquely identify elements in the list-of-lists.
columns <- c(
    projectId = "hits[*].entryId",
    projectTitle = "hits[*].projects[*].projectTitle",
    genusSpecies = "hits[*].donorOrganisms[*].genusSpecies[*]",
    donorCount = "hits[*].donorOrganisms[*].donorCount",
    cellSuspensions.organ = "hits[*].cellSuspensions[*].organ[*]",
    totalCells = "hits[*].cellSuspensions[*].totalCells"
)
projects <- projects(filters, columns = columns)
projects
## # A tibble: 36 × 6
##    projectId          projectTitle genusSpecies donorCount cellSuspensions.organ
##    <chr>              <chr>        <list>            <int> <list>               
##  1 cdabcf0b-7602-4ab… A blood atl… <chr [1]>           124 <chr [1]>            
##  2 c1810dbc-16d2-45c… A cell atla… <chr [2]>            24 <chr [2]>            
##  3 c0518445-3b3b-49c… A cellular … <chr [1]>            17 <chr [2]>            
##  4 2fe3c60b-ac1a-4c6… A human fet… <chr [1]>            38 <chr [2]>            
##  5 73769e0a-5fcd-41f… A proximal-… <chr [1]>             3 <chr [2]>            
##  6 cc95ff89-2e68-4a0… A single ce… <chr [1]>            28 <chr [3]>            
##  7 957261f7-2bd6-435… A spatially… <chr [1]>            13 <chr [1]>            
##  8 1dddae6e-3753-48a… Cell Types … <chr [1]>             6 <chr [1]>            
##  9 ad98d3cd-26fb-4ee… Cells of th… <chr [1]>            14 <chr [1]>            
## 10 fde199d2-a841-4ed… Cells of th… <chr [1]>            12 <chr [4]>            
## # ℹ 26 more rows
## # ℹ 1 more variable: totalCells <list>Note that the cellSuspensions.organ and totalCells columns have more than
one entry per project.
projects |>
   select(projectId, cellSuspensions.organ, totalCells)
## # A tibble: 36 × 3
##    projectId                            cellSuspensions.organ totalCells
##    <chr>                                <list>                <list>    
##  1 cdabcf0b-7602-4abf-9afb-3b410e545703 <chr [1]>             <int [0]> 
##  2 c1810dbc-16d2-45c3-b45e-3e675f88d87b <chr [2]>             <int [2]> 
##  3 c0518445-3b3b-49c6-b8fc-c41daa4eacba <chr [2]>             <int [2]> 
##  4 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 <chr [2]>             <int [1]> 
##  5 73769e0a-5fcd-41f4-9083-41ae08bfa4c1 <chr [2]>             <int [0]> 
##  6 cc95ff89-2e68-4a08-a234-480eca21ce79 <chr [3]>             <int [3]> 
##  7 957261f7-2bd6-4358-a6ed-24ee080d5cfc <chr [1]>             <int [0]> 
##  8 1dddae6e-3753-48af-b20e-fa22abad125d <chr [1]>             <int [0]> 
##  9 ad98d3cd-26fb-4ee3-99c9-8a2ab085e737 <chr [1]>             <int [1]> 
## 10 fde199d2-a841-4ed1-aa65-b9e0af8969b1 <chr [4]>             <int [0]> 
## # ℹ 26 more rowsIn this case, the mapping between cellSuspensions.organ and totalCells
is clear, but in general more refined navigation of the lol structure may be
necessary.
projects |>
    select(projectId, cellSuspensions.organ, totalCells) |>
    filter(
        ## 2023-06-06 two projects have different 'organ' and
        ## 'totalCells' lengths, causing problems with `unnest()`
        lengths(cellSuspensions.organ) == lengths(totalCells)
    ) |>
    tidyr::unnest(c("cellSuspensions.organ", "totalCells"))
## # A tibble: 25 × 3
##    projectId                            cellSuspensions.organ totalCells
##    <chr>                                <chr>                      <int>
##  1 c1810dbc-16d2-45c3-b45e-3e675f88d87b thymus                    456000
##  2 c1810dbc-16d2-45c3-b45e-3e675f88d87b colon                      16000
##  3 c0518445-3b3b-49c6-b8fc-c41daa4eacba lung                       40200
##  4 c0518445-3b3b-49c6-b8fc-c41daa4eacba nose                        7087
##  5 cc95ff89-2e68-4a08-a234-480eca21ce79 immune system             274182
##  6 cc95ff89-2e68-4a08-a234-480eca21ce79 blood                    1615910
##  7 cc95ff89-2e68-4a08-a234-480eca21ce79 bone marrow               600000
##  8 ad98d3cd-26fb-4ee3-99c9-8a2ab085e737 heart                     791000
##  9 83f5188e-3bf7-4956-9544-cea4f8997756 immune organ                 381
## 10 83f5188e-3bf7-4956-9544-cea4f8997756 large intestine             1141
## # ℹ 15 more rowsSelect the following entry, augment the filter, and query available files
projects |>
    filter(startsWith(projectTitle, "Reconstruct")) |>
    glimpse()
## Rows: 1
## Columns: 6
## $ projectId             <chr> "f83165c5-e2ea-4d15-a5cf-33f3550bffde"
## $ projectTitle          <chr> "Reconstructing the human first trimester fetal-…
## $ genusSpecies          <list> "Homo sapiens"
## $ donorCount            <int> 16
## $ cellSuspensions.organ <list> <"blood", "decidua", "placenta">
## $ totalCells            <list> <>This approach can be used to customize the tibbles returned by the
other main functions in the package, files(), samples(), and
bundles().
The relevant file can be selected and downloaded using the technique in the first example.
filters <- filters(
    projectId = list(is = "f83165c5-e2ea-4d15-a5cf-33f3550bffde"),
    fileFormat = list(is = "h5ad")
)
files <-
    files(filters) |>
    head(1)            # only first file, for demonstration
files |> t()
##              [,1]                                                                                                                                                      
## fileId       "6d4fedcf-857d-5fbb-9928-8b9605500a69"                                                                                                                    
## name         "vento18_ss2.processed.h5ad"                                                                                                                              
## fileFormat   "h5ad"                                                                                                                                                    
## size         "82121633"                                                                                                                                                
## version      "2021-02-10T16:56:40.419579Z"                                                                                                                             
## projectTitle "Reconstructing the human first trimester fetal-maternal interface using single cell transcriptomics"                                                     
## projectId    "f83165c5-e2ea-4d15-a5cf-33f3550bffde"                                                                                                                    
## url          "https://service.azul.data.humancellatlas.org/repository/files/6d4fedcf-857d-5fbb-9928-8b9605500a69?catalog=dcp37&version=2021-02-10T16%3A56%3A40.419579Z"file_path <- files_download(files)"h5ad" files can be read as SingleCellExperiment objects using the
zellkonverter package.
## don't want large amount of data read from disk
sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE)
sceproject_filter <- filters(fileFormat = list(is = "csv"))
project_tibble <- projects(project_filter)
project_tibble |>
    filter(
        startsWith(
            projectTitle,
            "Reconstructing the human first trimester"
        )
    )
## # A tibble: 1 × 14
##   projectId             projectTitle genusSpecies sampleEntityType specimenOrgan
##   <chr>                 <chr>        <list>       <list>           <list>       
## 1 f83165c5-e2ea-4d15-a… Reconstruct… <chr [1]>    <chr [1]>        <chr [3]>    
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## #   libraryConstructionApproach <list>, nucleicAcidSource <list>,
## #   pairedEnd <list>, workflow <list>, specimenDisease <list>,
## #   donorDisease <list>, developmentStage <list>
projectId <-
    project_tibble |>
    filter(
        startsWith(
            projectTitle,
            "Reconstructing the human first trimester"
        )
    ) |>
    pull(projectId)
file_filter <- filters(
    projectId = list(is = projectId),
    fileFormat = list(is = "csv")
)
## first 4 files will be returned
file_tibble <- files(file_filter, size = 4)
file_tibble |>
    files_download()
## 7f9a181e-24c5-5462-b308-7fef5b1bda2a-2021-02-10T16:56:40.419579Z 
## "/home/biocbuild/.cache/R/hca/1ae384482950ad_1ae384482950ad.csv" 
## d04c6e3c-b740-5586-8420-4480a1b5706c-2021-02-10T16:56:40.419579Z 
## "/home/biocbuild/.cache/R/hca/1ae38442f8e5c7_1ae38442f8e5c7.csv" 
## d30ffc0b-7d6e-5b85-aff9-21ec69663a81-2021-02-10T16:56:40.419579Z 
##   "/home/biocbuild/.cache/R/hca/1ae384164cd3d_1ae384164cd3d.csv" 
## e1517725-01b0-5346-9788-afca63e9993a-2021-02-10T16:56:40.419579Z 
## "/home/biocbuild/.cache/R/hca/1ae3843e64e746_1ae3843e64e746.csv"The files(), bundles(), and samples() can all return many 1000’s
of results. It is necessary to ‘page’ through these to see all of
them. We illustrate pagination with projects(), retrieving only 30 projects.
Pagination works for the default tibble output
page_1_tbl <- projects(size = 30)
page_1_tbl
## # A tibble: 30 × 14
##    projectId            projectTitle genusSpecies sampleEntityType specimenOrgan
##    <chr>                <chr>        <list>       <list>           <list>       
##  1 74b6d569-3b11-42ef-… 1.3 Million… <chr [1]>    <chr [1]>        <chr [1]>    
##  2 53c53cd4-8127-4e12-… A Cellular … <chr [1]>    <chr [1]>        <chr [1]>    
##  3 7027adc6-c9c9-46f3-… A Cellular … <chr [1]>    <chr [1]>        <chr [1]>    
##  4 94e4ee09-9b4b-410a-… A Human Liv… <chr [1]>    <chr [2]>        <chr [1]>    
##  5 c5b475f2-76b3-4a8e-… A Partial P… <chr [1]>    <chr [1]>        <chr [1]>    
##  6 60ea42e1-af49-42f5-… A Protocol … <chr [1]>    <chr [1]>        <chr [1]>    
##  7 ef1e3497-515e-4bbe-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [3]>    
##  8 9ac53858-606a-4b89-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [1]>    
##  9 258c5e15-d125-4f2d-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [1]>    
## 10 894ae6ac-5b48-41a8-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [1]>    
## # ℹ 20 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## #   libraryConstructionApproach <list>, nucleicAcidSource <list>,
## #   pairedEnd <list>, workflow <list>, specimenDisease <list>,
## #   donorDisease <list>, developmentStage <list>
page_2_tbl <- page_1_tbl |> hca_next()
page_2_tbl
## # A tibble: 30 × 14
##    projectId            projectTitle genusSpecies sampleEntityType specimenOrgan
##    <chr>                <chr>        <list>       <list>           <list>       
##  1 9f17ed7d-9325-4723-… A single ce… <chr [1]>    <chr [1]>        <chr [1]>    
##  2 842605c7-375a-47c5-… A single ce… <chr [1]>    <chr [1]>        <chr [1]>    
##  3 cc95ff89-2e68-4a08-… A single ce… <chr [1]>    <chr [1]>        <chr [3]>    
##  4 a62dae2e-cd69-4d5c-… A single-ce… <chr [2]>    <chr [1]>        <chr [6]>    
##  5 6663070f-fd8b-41a9-… A single-ce… <chr [1]>    <chr [1]>        <chr [1]>    
##  6 c31fa434-c9ed-4263-… A single-ce… <chr [1]>    <chr [1]>        <chr [18]>   
##  7 dcc28fb3-7bab-48ce-… A single-ce… <chr [1]>    <chr [1]>        <chr [1]>    
##  8 a004b150-1c36-4af6-… A single-ce… <chr [1]>    <chr [1]>        <chr [1]>    
##  9 1defdada-a365-44ad-… A single-ce… <chr [1]>    <chr [1]>        <chr [1]>    
## 10 4a95101c-9ffc-4f30-… A single-ce… <chr [1]>    <chr [1]>        <chr [4]>    
## # ℹ 20 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## #   libraryConstructionApproach <list>, nucleicAcidSource <list>,
## #   pairedEnd <list>, workflow <list>, specimenDisease <list>,
## #   donorDisease <list>, developmentStage <list>
## should be identical to page_1_tbl
page_2_tbl |> hca_prev()
## # A tibble: 30 × 14
##    projectId            projectTitle genusSpecies sampleEntityType specimenOrgan
##    <chr>                <chr>        <list>       <list>           <list>       
##  1 74b6d569-3b11-42ef-… 1.3 Million… <chr [1]>    <chr [1]>        <chr [1]>    
##  2 53c53cd4-8127-4e12-… A Cellular … <chr [1]>    <chr [1]>        <chr [1]>    
##  3 7027adc6-c9c9-46f3-… A Cellular … <chr [1]>    <chr [1]>        <chr [1]>    
##  4 94e4ee09-9b4b-410a-… A Human Liv… <chr [1]>    <chr [2]>        <chr [1]>    
##  5 c5b475f2-76b3-4a8e-… A Partial P… <chr [1]>    <chr [1]>        <chr [1]>    
##  6 60ea42e1-af49-42f5-… A Protocol … <chr [1]>    <chr [1]>        <chr [1]>    
##  7 ef1e3497-515e-4bbe-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [3]>    
##  8 9ac53858-606a-4b89-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [1]>    
##  9 258c5e15-d125-4f2d-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [1]>    
## 10 894ae6ac-5b48-41a8-… A Single-Ce… <chr [1]>    <chr [1]>        <chr [1]>    
## # ℹ 20 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## #   libraryConstructionApproach <list>, nucleicAcidSource <list>,
## #   pairedEnd <list>, workflow <list>, specimenDisease <list>,
## #   donorDisease <list>, developmentStage <list>Pagination also works for the lol objects
page_1_lol <- projects(size = 5, as = "lol")
page_1_lol |>
    lol_pull("hits[*].projects[*].projectTitle")
## [1] "1.3 Million Brain Cells from E18 Mice"                                        
## [2] "A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra"  
## [3] "A Cellular Atlas of Pitx2-Dependent Cardiac Development."                     
## [4] "A Human Liver Cell Atlas reveals Heterogeneity and Epithelial Progenitors"    
## [5] "A Partial Picture of the Single-Cell Transcriptomics of Human IgA Nephropathy"
page_2_lol <-
    page_1_lol |>
    hca_next()
page_2_lol |>
    lol_pull("hits[*].projects[*].projectTitle")
## [1] "A Protocol for Revealing Oral Neutrophil Heterogeneity by Single-Cell Immune Profiling in Human Saliva"                                  
## [2] "A Single-Cell Atlas of the Human Healthy Airways"                                                                                        
## [3] "A Single-Cell Characterization of Human Post-implantation Embryos Cultured In Vitro Delineates Morphogenesis in Primary Syncytialization"
## [4] "A Single-Cell Transcriptome Atlas of Glia Diversity in the Human Hippocampus across the Lifespan and in Alzheimer’s Disease"             
## [5] "A Single-Cell Transcriptome Atlas of the Human Pancreas."
## should be identical to page_1_lol
page_2_lol |>
    hca_prev() |>
    lol_pull("hits[*].projects[*].projectTitle")
## [1] "1.3 Million Brain Cells from E18 Mice"                                        
## [2] "A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra"  
## [3] "A Cellular Atlas of Pitx2-Dependent Cardiac Development."                     
## [4] "A Human Liver Cell Atlas reveals Heterogeneity and Epithelial Progenitors"    
## [5] "A Partial Picture of the Single-Cell Transcriptomics of Human IgA Nephropathy"Much like projects() and files(), samples() and bundles() allow you to
provide a filter object and additional criteria to retrieve data in the
form of samples and bundles respectively
heart_filters <- filters(organ = list(is = "heart"))
heart_samples <- samples(filters = heart_filters, size = 4)
heart_samples
## # A tibble: 4 × 6
##   entryId                         projectTitle genusSpecies disease format count
##   <chr>                           <chr>        <chr>        <chr>   <list> <lis>
## 1 012c52ff-4770-4c0c-8c2e-c348da… A Cellular … Mus musculus normal  <chr>  <int>
## 2 035db5b9-a219-4df8-bfc9-117cd0… A Cellular … Mus musculus normal  <chr>  <int>
## 3 09e425f7-22d7-487e-b78b-78b449… A Cellular … Mus musculus normal  <chr>  <int>
## 4 2273e44d-9fbc-4c13-8cb3-3caf8a… A Cellular … Mus musculus normal  <chr>  <int>
heart_bundles <- bundles(filters = heart_filters, size = 4)
heart_bundles
## # A tibble: 4 × 6
##   projectTitle               genusSpecies samples files bundleUuid bundleVersion
##   <chr>                      <chr>        <list>  <lis> <chr>      <chr>        
## 1 A Cellular Atlas of Pitx2… Mus musculus <chr>   <chr> 0d391bd1-… 2021-02-26T0…
## 2 A Cellular Atlas of Pitx2… Mus musculus <chr>   <chr> 165a2df1-… 2021-02-26T0…
## 3 A Cellular Atlas of Pitx2… Mus musculus <chr>   <chr> 166c1b1a-… 2023-07-19T1…
## 4 A Cellular Atlas of Pitx2… Mus musculus <chr>   <chr> 18bad6b1-… 2021-02-26T0…HCA experiments are organized into catalogs, each of which can be summarized
with the hca::summary() function
heart_filters <- filters(organ = list(is = "heart"))
hca::summary(filters = heart_filters, type = "fileTypeSummaries")
## # A tibble: 33 × 3
##    format   count totalSize
##    <chr>    <int>     <dbl>
##  1 fastq.gz 30086   2.66e13
##  2 fastq      316   6.53e11
##  3 png        270   8.96e 6
##  4 tsv.gz     188   6.08e10
##  5 h5         180   1.75e10
##  6 loom       169   3.56e11
##  7 bam        164   3.28e12
##  8 zip        148   9.46e 9
##  9 csv         89   1.15e 8
## 10 mtx.gz      67   1.61e10
## # ℹ 23 more rows
first_catalog <- catalogs()[1]
hca::summary(type = "overview", catalog = first_catalog)
## # A tibble: 7 × 2
##   name            value
##   <chr>           <dbl>
## 1 projectCount  4.5 e 2
## 2 specimenCount 2.17e 4
## 3 speciesCount  3   e 0
## 4 fileCount     5.20e 5
## 5 totalFileSize 3.06e14
## 6 donorCount    8.63e 3
## 7 labCount      7.49e 2Each project, file, sample, and bundles has its own unique ID by which, in conjunction with its catalog, can be to uniquely identify them.
heart_filters <- filters(organ = list(is = "heart"))
heart_projects <- projects(filters = heart_filters, size = 4)
heart_projects
## # A tibble: 4 × 14
##   projectId             projectTitle genusSpecies sampleEntityType specimenOrgan
##   <chr>                 <chr>        <chr>        <list>           <list>       
## 1 7027adc6-c9c9-46f3-8… A Cellular … Mus musculus <chr [1]>        <chr [1]>    
## 2 a9301beb-e9fa-42fe-b… A human cel… Homo sapiens <chr [1]>        <chr [14]>   
## 3 902dc043-7091-445c-9… A human cel… Homo sapiens <chr [1]>        <chr [1]>    
## 4 2fe3c60b-ac1a-4c61-9… A human fet… Homo sapiens <chr [2]>        <chr [2]>    
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <lgl>,
## #   libraryConstructionApproach <list>, nucleicAcidSource <list>,
## #   pairedEnd <lgl>, workflow <list>, specimenDisease <chr>,
## #   donorDisease <chr>, developmentStage <list>
projectId <-
    heart_projects |>
    filter(
        startsWith(
            projectTitle,
            "Cells of the adult human"
        )
    ) |>
    dplyr::pull(projectId)
result <- projects_detail(uuid = projectId)The result is a list containing three elements representing
information for navigating next or previous (alphabetical, by default)
(pagination) project, the filters (termFacets) available, and
details of the project (hits).
names(result)
## [1] "pagination" "termFacets" "hits"As mentioned above, the hits are a complicated list-of-lists
structure. A very convenient way to explore this structure visually
is with listview::jsonedit(result). Selecting individual elements is
possible using the lol interface; an alternative is
cellxgenedp::jmespath().
lol(result)
## # class: lol
## # number of distinct paths: 679
## # total number of elements: 45803
## # number of leaf paths: 400
## # number of leaf elements: 30348
## # lol_path():
## # A tibble: 679 × 3
##    path                                            n is_leaf
##    <chr>                                       <int> <lgl>  
##  1 hits                                            1 FALSE  
##  2 hits[*]                                        10 FALSE  
##  3 hits[*].cellLines                              10 FALSE  
##  4 hits[*].cellSuspensions                        10 FALSE  
##  5 hits[*].cellSuspensions[*]                     12 FALSE  
##  6 hits[*].cellSuspensions[*].organ               12 FALSE  
##  7 hits[*].cellSuspensions[*].organPart           12 FALSE  
##  8 hits[*].cellSuspensions[*].organPart[*]        14 TRUE   
##  9 hits[*].cellSuspensions[*].organ[*]            12 TRUE   
## 10 hits[*].cellSuspensions[*].selectedCellType    12 FALSE  
## # ℹ 669 more rowsSee the accompanying “Human Cell Atlas Manifests” vignette on details
pertaining to the use of the manifest endpoint and further
annotation of .loom files.
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] httr_1.4.7                  hca_1.12.0                 
##  [3] LoomExperiment_1.22.0       BiocIO_1.14.0              
##  [5] rhdf5_2.48.0                SingleCellExperiment_1.26.0
##  [7] SummarizedExperiment_1.34.0 Biobase_2.64.0             
##  [9] GenomicRanges_1.56.0        GenomeInfoDb_1.40.0        
## [11] IRanges_2.38.0              S4Vectors_0.42.0           
## [13] BiocGenerics_0.50.0         MatrixGenerics_1.16.0      
## [15] matrixStats_1.3.0           dplyr_1.1.4                
## [17] BiocStyle_2.32.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        blob_1.2.4              filelock_1.0.3         
##  [4] fastmap_1.1.1           BiocFileCache_2.12.0    promises_1.3.0         
##  [7] digest_0.6.35           mime_0.12               lifecycle_1.0.4        
## [10] RSQLite_2.3.6           magrittr_2.0.3          compiler_4.4.0         
## [13] rlang_1.1.3             sass_0.4.9              tools_4.4.0            
## [16] utf8_1.2.4              yaml_2.3.8              knitr_1.46             
## [19] S4Arrays_1.4.0          htmlwidgets_1.6.4       bit_4.0.5              
## [22] curl_5.2.1              DelayedArray_0.30.0     abind_1.4-5            
## [25] miniUI_0.1.1.1          HDF5Array_1.32.0        withr_3.0.0            
## [28] purrr_1.0.2             grid_4.4.0              fansi_1.0.6            
## [31] xtable_1.8-4            Rhdf5lib_1.26.0         cli_3.6.2              
## [34] rmarkdown_2.26          crayon_1.5.2            generics_0.1.3         
## [37] tzdb_0.4.0              DBI_1.2.2               cachem_1.0.8           
## [40] stringr_1.5.1           zlibbioc_1.50.0         parallel_4.4.0         
## [43] BiocManager_1.30.22     XVector_0.44.0          vctrs_0.6.5            
## [46] Matrix_1.7-0            jsonlite_1.8.8          bookdown_0.39          
## [49] hms_1.1.3               bit64_4.0.5             archive_1.1.8          
## [52] jquerylib_0.1.4         tidyr_1.3.1             glue_1.7.0             
## [55] DT_0.33                 stringi_1.8.3           later_1.3.2            
## [58] UCSC.utils_1.0.0        tibble_3.2.1            pillar_1.9.0           
## [61] htmltools_0.5.8.1       rhdf5filters_1.16.0     GenomeInfoDbData_1.2.12
## [64] R6_2.5.1                dbplyr_2.5.0            vroom_1.6.5            
## [67] evaluate_0.23           shiny_1.8.1.1           lattice_0.22-6         
## [70] readr_2.1.5             memoise_2.0.1           httpuv_1.6.15          
## [73] bslib_0.7.0             Rcpp_1.0.12             SparseArray_1.4.0      
## [76] xfun_0.43               pkgconfig_2.0.3