tq_apply() provides a simplified workflow for running
parallel tasks on HPC clusters. It combines multiple steps (project
creation, resource assignment, task addition, and worker scheduling)
into a single function call, similar to base R’s lapply()
or sapply().
This is the easiest way to get started with taskqueue if
you:
Before using taskqueue, ensure you have:
PostgreSQL installed and configured (see PostgreSQL Setup vignette)
SSH access configured for remote resources (see SSH Setup vignette)
Database initialized:
A resource already defined:
The simplest use of tq_apply() requires just a few
arguments:
library(taskqueue)
# Define your function
my_simulation <- function(i) {
# Your computation here
result <- i^2
Sys.sleep(1) # Simulate some work
return(result)
}
# Run 100 tasks in parallel
tq_apply(
n = 100,
fun = my_simulation,
project = "my_project",
resource = "hpc"
)This will:
my_simulation(1),
my_simulation(2), …, my_simulation(100) in
paralleln: Number of tasks to run
(integer)fun: The function to execute for each
taskproject: Project name (string)resource: Resource name (string, must
already exist)memory: Memory per task in GB
(default: 10)hour: Maximum runtime in hours
(default: 24)account: Account name for cluster
billing (optional)working_dir: Working directory on
cluster (default: getwd())...: Additional arguments passed to
your functionYou can pass additional arguments to your function using
...:
my_function <- function(i, multiplier, offset = 0) {
result <- i * multiplier + offset
return(result)
}
tq_apply(
n = 50,
fun = my_function,
project = "test_args",
resource = "hpc",
multiplier = 10, # Passed to my_function
offset = 5 # Passed to my_function
)Each task will call: - Task 1:
my_function(1, multiplier = 10, offset = 5) - Task 2:
my_function(2, multiplier = 10, offset = 5) - And so
on…
Here’s a practical example running a Monte Carlo simulation:
library(taskqueue)
# Define simulation function
run_monte_carlo <- function(task_id, n_samples = 10000, seed_base = 12345) {
# Set unique seed for each task
set.seed(seed_base + task_id)
# Run simulation
samples <- rnorm(n_samples)
result <- list(
task_id = task_id,
mean = mean(samples),
sd = sd(samples),
quantiles = quantile(samples, probs = c(0.025, 0.5, 0.975))
)
# Save results
out_file <- sprintf("results/simulation_%04d.Rds", task_id)
dir.create("results", showWarnings = FALSE)
saveRDS(result, out_file)
return(invisible(NULL))
}
# Run 1000 simulations in parallel
tq_apply(
n = 1000,
fun = run_monte_carlo,
project = "monte_carlo_study",
resource = "hpc",
memory = 8, # 8 GB per task
hour = 2, # 2 hour time limit
working_dir = "/home/user/monte_carlo",
n_samples = 50000, # Argument for run_monte_carlo
seed_base = 99999 # Argument for run_monte_carlo
)After calling tq_apply(), monitor your tasks:
After all tasks complete, collect your results:
Your function should save results to the file system:
Check if output already exists to avoid re-running completed tasks:
Ensure your working directory on the cluster is correct:
tq_apply() simplifies the workflow by combining these
steps:
Manual approach:
# Multiple steps
project_add("test", memory = 10)
project_resource_add("test", "hpc", working_dir = "/path", hours = 24)
task_add("test", num = 100, clean = TRUE)
project_reset("test")
worker_slurm("test", "hpc", fun = my_function)With tq_apply():
Tasks fail immediately: - Check the log folder specified in your resource configuration - Verify your function works locally first - Ensure the working directory exists on the cluster
Tasks remain in “idle” status: - Check that the
project is started: project_start("my_project") - Verify
the resource is correctly configured - Check SLURM queue:
squeue -u $USER
“Resource not found” error: - The resource must be
created before using tq_apply() - Use
resource_list() to see available resources - Create
resource with resource_add()
Use tq_apply() when: - You have a
simple parallel task - You want to quickly run many iterations of a
function - You don’t need fine-grained control over project settings
Use the manual workflow when: - You need to manage multiple projects simultaneously - You want to reuse a project for different task sets - You need more control over resource scheduling - You’re running different types of tasks in the same project