--- title: "Simple Workflow with tq_apply" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Simple Workflow with tq_apply} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Overview `tq_apply()` provides a simplified workflow for running parallel tasks on HPC clusters. It combines multiple steps (project creation, resource assignment, task addition, and worker scheduling) into a single function call, similar to base R's `lapply()` or `sapply()`. This is the easiest way to get started with `taskqueue` if you: - Have a simple function to run multiple times - Don't need complex project management - Want to quickly parallelize work on an HPC cluster Before using `taskqueue`, ensure you have: 1. PostgreSQL installed and configured (see [PostgreSQL Setup](postgresql-setup.html) vignette) 2. SSH access configured for remote resources (see [SSH Setup](ssh-setup.html) vignette) 3. Database initialized: ```r library(taskqueue) db_init() ``` 4. A resource already defined: ```r resource_add( name = "hpc", type = "slurm", host = "hpc.example.com", nodename = "hpc", workers = 500, log_folder = "/home/user/log_folder/" ) ``` ## Basic Usage The simplest use of `tq_apply()` requires just a few arguments: ```r library(taskqueue) # Define your function my_simulation <- function(i) { # Your computation here result <- i^2 Sys.sleep(1) # Simulate some work return(result) } # Run 100 tasks in parallel tq_apply( n = 100, fun = my_simulation, project = "my_project", resource = "hpc" ) ``` This will: 1. Create or update the project "my_project" 2. Add the resource "hpc" to the project 3. Create 100 tasks 4. Schedule workers on the SLURM cluster 5. Execute `my_simulation(1)`, `my_simulation(2)`, ..., `my_simulation(100)` in parallel ## Function Arguments ### Required Arguments - **`n`**: Number of tasks to run (integer) - **`fun`**: The function to execute for each task - **`project`**: Project name (string) - **`resource`**: Resource name (string, must already exist) ### Optional Arguments - **`memory`**: Memory per task in GB (default: 10) - **`hour`**: Maximum runtime in hours (default: 24) - **`account`**: Account name for cluster billing (optional) - **`working_dir`**: Working directory on cluster (default: `getwd()`) - **`...`**: Additional arguments passed to your function ## Passing Arguments to Your Function You can pass additional arguments to your function using `...`: ```r my_function <- function(i, multiplier, offset = 0) { result <- i * multiplier + offset return(result) } tq_apply( n = 50, fun = my_function, project = "test_args", resource = "hpc", multiplier = 10, # Passed to my_function offset = 5 # Passed to my_function ) ``` Each task will call: - Task 1: `my_function(1, multiplier = 10, offset = 5)` - Task 2: `my_function(2, multiplier = 10, offset = 5)` - And so on... ## Complete Example Here's a practical example running a Monte Carlo simulation: ```r library(taskqueue) # Define simulation function run_monte_carlo <- function(task_id, n_samples = 10000, seed_base = 12345) { # Set unique seed for each task set.seed(seed_base + task_id) # Run simulation samples <- rnorm(n_samples) result <- list( task_id = task_id, mean = mean(samples), sd = sd(samples), quantiles = quantile(samples, probs = c(0.025, 0.5, 0.975)) ) # Save results out_file <- sprintf("results/simulation_%04d.Rds", task_id) dir.create("results", showWarnings = FALSE) saveRDS(result, out_file) return(invisible(NULL)) } # Run 1000 simulations in parallel tq_apply( n = 1000, fun = run_monte_carlo, project = "monte_carlo_study", resource = "hpc", memory = 8, # 8 GB per task hour = 2, # 2 hour time limit working_dir = "/home/user/monte_carlo", n_samples = 50000, # Argument for run_monte_carlo seed_base = 99999 # Argument for run_monte_carlo ) ``` ## Monitoring Progress After calling `tq_apply()`, monitor your tasks: ```r # Check task status task_status("monte_carlo_study") # Check overall project status project_status("monte_carlo_study") ``` ## Collecting Results After all tasks complete, collect your results: ```r # Read all result files result_files <- list.files("results", pattern = "simulation_.*\\.Rds$", full.names = TRUE) # Combine results all_results <- lapply(result_files, readRDS) # Analyze means <- sapply(all_results, function(x) x$mean) hist(means, main = "Distribution of Means") ``` ## Best Practices ### 1. Save Results to Files Your function should save results to the file system: ```r my_task <- function(i) { out_file <- sprintf("output/result_%04d.Rds", i) # Skip if already done if (file.exists(out_file)) { return(invisible(NULL)) } # Do computation result <- expensive_computation(i) # Save result saveRDS(result, out_file) } ``` ### 2. Make Functions Idempotent Check if output already exists to avoid re-running completed tasks: ```r my_task <- function(i) { out_file <- sprintf("output/task_%d.Rds", i) if (file.exists(out_file)) return(invisible(NULL)) # ... computation and save } ``` ### 3. Specify Working Directory Ensure your working directory on the cluster is correct: ```r tq_apply( n = 100, fun = my_function, project = "my_project", resource = "hpc", working_dir = "/home/user/project_folder" ) ``` ### 4. Set Appropriate Resources Configure memory and time limits based on your task requirements: ```r tq_apply( n = 100, fun = memory_intensive_task, project = "big_analysis", resource = "hpc", memory = 64, # 64 GB for large tasks hour = 48 # 48 hour time limit ) ``` ## Comparison with Manual Workflow `tq_apply()` simplifies the workflow by combining these steps: **Manual approach:** ```r # Multiple steps project_add("test", memory = 10) project_resource_add("test", "hpc", working_dir = "/path", hours = 24) task_add("test", num = 100, clean = TRUE) project_reset("test") worker_slurm("test", "hpc", fun = my_function) ``` **With tq_apply():** ```r # Single step tq_apply(n = 100, fun = my_function, project = "test", resource = "hpc", working_dir = "/path", hour = 24) ``` ## Troubleshooting **Tasks fail immediately:** - Check the log folder specified in your resource configuration - Verify your function works locally first - Ensure the working directory exists on the cluster **Tasks remain in "idle" status:** - Check that the project is started: `project_start("my_project")` - Verify the resource is correctly configured - Check SLURM queue: `squeue -u $USER` **"Resource not found" error:** - The resource must be created before using `tq_apply()` - Use `resource_list()` to see available resources - Create resource with `resource_add()` ## When to Use tq_apply() **Use `tq_apply()` when:** - You have a simple parallel task - You want to quickly run many iterations of a function - You don't need fine-grained control over project settings **Use the manual workflow when:** - You need to manage multiple projects simultaneously - You want to reuse a project for different task sets - You need more control over resource scheduling - You're running different types of tasks in the same project