--- title: "Getting Started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE # Set to FALSE since API calls require credentials ) ``` `rsynthbio` is an R package that provides a convenient interface to the [Synthesize Bio](https://www.synthesize.bio/) API, allowing users to generate realistic gene expression data based on specified biological conditions. This package enables researchers to easily access AI-generated transcriptomic data for various modalities including bulk RNA-seq and single-cell RNA-seq. Alternatively, you can AI generate datasets from our [web platform](https://app.synthesize.bio/datasets/). ## How to install You can install `rsynthbio` from CRAN: ```{r installation, eval=FALSE} install.packages("rsynthbio") ``` If you want the development version, you can install using the `remotes` package to install from GitHub: ```{r github-installation, eval=FALSE} if (!("remotes" %in% installed.packages())) { install.packages("remotes") } remotes::install_github("synthesizebio/rsynthbio") ``` Once installed, load the package: ```{r} library(rsynthbio) ``` ## Authentication Before using the Synthesize Bio API, you need to set up your API token. The package provides a secure way to handle authentication: ```{r auth-secure, eval=FALSE} # Securely prompt for and store your API token # The token will not be visible in the console set_synthesize_token() # You can also store the token in your system keyring for persistence # across R sessions (requires the 'keyring' package) set_synthesize_token(use_keyring = TRUE) ``` Loading your API key for a session. ```{r, eval=FALSE} # In future sessions, load the stored token load_synthesize_token_from_keyring() # Check if a token is already set has_synthesize_token() ``` You can obtain an API token by registering at [Synthesize Bio](https://app.synthesize.bio). ### Security Best Practices For security reasons, remember to clear your token when you're done: ```{r clear-token, eval = FALSE} # Clear token from current session clear_synthesize_token() # Clear token from both session and keyring clear_synthesize_token(remove_from_keyring = TRUE) ``` Never hard-code your token in scripts that will be shared or committed to version control. ## Basic Usage ### Available Modalities The package supports multiple data modalities: ```{r modalities} # Check available modalities get_valid_modalities() ``` Currently supported modalities: - **`bulk`**: Bulk RNA-seq data - **`single-cell`**: Single-cell RNA-seq data ### Creating a Query The first step to generating AI-generated gene expression data is to create a query. The package provides sample queries for each modality: ```{r query} # Get a sample query for bulk RNA-seq query <- get_valid_query(modality = "bulk") # Get a sample query for single-cell RNA-seq query_sc <- get_valid_query(modality = "single-cell") # Inspect the query structure str(query) ``` The query consists of: 1. `modality`: The type of gene expression data to generate ("bulk" or "single-cell") 2. `mode`: The prediction mode (e.g., "mean estimation" or "sample generation") 3. `inputs`: A list of biological conditions to generate data for We train our models with diverse multi-omics datasets. There are two model modes available today: + **Mean estimation**: These models create a distribution capturing the biological heterogeneity consistent with the supplied metadata. This distribution is then sampled to predict a gene expression distribution that captures measurement error. The mean of that distribution serves as the prediction + **Sample generation**: This model works identically to the mean estimation approach except that the final gene expression distribution is also sampled to generate realistic looking synthetic data that captures error associated with measurement ### Making a Prediction Once your query is ready, you can send it to the API to generate gene expression data: ```{r predict, eval=FALSE} result <- predict_query(query, as_counts = TRUE) ``` This result will be a list of two dataframes: `metadata` and `expression` ### Understanding the Async API Behind the scenes, the API uses an **asynchronous model** to handle queries efficiently: 1. Your query is submitted to the API, which returns a query ID 2. The function automatically polls the status endpoint (default: every 2 seconds) 3. When the query completes, results are downloaded from a signed URL 4. Data is parsed and returned as R data frames All of this happens automatically when you call `predict_query()`. ### Controlling Async Behavior You can customize the polling behavior if needed: ```{r async-options, eval=FALSE} # Increase timeout for large queries (default: 900 seconds = 15 minutes) result <- predict_query( query, poll_timeout_seconds = 1800, # 30 minutes poll_interval_seconds = 5 # Check every 5 seconds instead of 2 ) ``` ### Modifying a Query You can customize the query to fit your specific research needs: ```{r modify-query} # Adjust number of samples query$inputs[[1]]$num_samples <- 10 # Add a new condition query$inputs[[3]] <- list( metadata = list( sex = "male", sample_type = "primary tissue" ), num_samples = 3 ) ``` The input metadata is a list of lists. Here are the available metadata fields: __Biological:__ - ``age_years`` - ``cell_line_ontology_id`` - ``cell_type_ontology_id`` - ``developmental_stage`` - ``disease_ontology_id`` - ``ethnicity`` - ``genotype`` - ``race`` - ``sample_type`` ("cell line", "organoid", "other", "primary cells", "primary tissue", "xenograft") - ``sex`` ("male", "female") - ``tissue_ontology_id`` __Perturbational:__ - ``perturbation_dose`` - ``perturbation_ontology_id`` - ``perturbation_time`` - ``perturbation_type`` ("coculture","compound","control","crispr","genetic","infection","other","overexpression","peptide or biologic","shrna","sirna") __Technical:__ - ``study`` (Bioproject ID) - ``library_selection`` (e.g., "cDNA", "polyA", "Oligo-dT" - see https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-selection) - ``library_layout`` ("PAIRED", "SINGLE") - ``platform`` ("illumina") ### Acceptable Metadata Values The following are the valid values or expected formats for selected metadata keys: | Metadata Field | Requirement / Example | |----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `cell_line_ontology_id` | Requires a [Cellosaurus ID](https://www.cellosaurus.org/). | | `cell_type_ontology_id` | Requires a [CL ID](https://www.ebi.ac.uk/ols4/ontologies/cl). | | `disease_ontology_id` | Requires a [MONDO ID](https://www.ebi.ac.uk/ols4/ontologies/mondo). | | `perturbation_ontology_id` | Must be a valid Ensembl gene ID (e.g., `ENSG00000156127`), [ChEBI ID](https://www.ebi.ac.uk/chebi/) (e.g., `CHEBI:16681`), [ChEMBL ID](https://www.ebi.ac.uk/chembl/) (e.g., `CHEMBL1234567`), or [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy) (e.g., `9606`). | | `tissue_ontology_id` | Requires a [UBERON ID](https://www.ebi.ac.uk/ols4/ontologies/uberon). | To lookup ontology terms, we recommend using the [EMBL-EBI Ontology Lookup Service](https://www.ebi.ac.uk/ols4/). Models have a limited acceptable range of metadata input values. If you provide a value that is not in the acceptable range, the API will return an error. ### Additional Prediction Options You can also request log-transformed CPM instead of raw counts: ```{r predict-2, eval=FALSE} # Request log-transformed CPM instead of raw counts result_log <- predict_query(query, as_counts = FALSE) ``` ### Working with Results ```{r analyze, eval=FALSE} # Access metadata and expression matrices metadata <- result$metadata expression <- result$expression # Check dimensions dim(expression) # View metadata sample head(metadata) ``` You may want to process the data in chunks or save it for later use: ```{r large-data, eval=FALSE} # Save results to RDS file saveRDS(result, "synthesize_results.rds") # Load previously saved results result <- readRDS("synthesize_results.rds") # Export as CSV write.csv(result$expression, "expression_matrix.csv") write.csv(result$metadata, "sample_metadata.csv") ``` ### Custom Validation You can validate your queries before sending them to the API: ```{r validation} # Validate structure validate_query(query) # Validate modality validate_modality(query) ``` ## Session info ```{r session-info} sessionInfo() ``` ## Additional Resources - [Package Source Code](https://github.com/synthesizebio/rsynthbio) - [File Bug Reports](https://github.com/synthesizebio/rsynthbio/issues)