Simple Workflow with tq

Overview

tq_apply() provides a simplified workflow for running parallel tasks on HPC clusters. It combines multiple steps (project creation, resource assignment, task addition, and worker scheduling) into a single function call, similar to base R’s lapply() or sapply().

This is the easiest way to get started with taskqueue if you:

Have a simple function to run multiple times
Don’t need complex project management
Want to quickly parallelize work on an HPC cluster

Before using taskqueue, ensure you have:

PostgreSQL installed and configured (see PostgreSQL Setup vignette)
SSH access configured for remote resources (see SSH Setup vignette)
Database initialized:
```
library(taskqueue)
db_init()
```

A resource already defined:

resource_add(
  name = "hpc",
  type = "slurm",
  host = "hpc.example.com",
  nodename = "hpc",
  workers = 500,
  log_folder = "/home/user/log_folder/"
)

Basic Usage

The simplest use of tq_apply() requires just a few arguments:

library(taskqueue)

# Define your function
my_simulation <- function(i) {
  # Your computation here
  result <- i^2
  Sys.sleep(1)  # Simulate some work
  return(result)
}

# Run 100 tasks in parallel
tq_apply(
  n = 100,
  fun = my_simulation,
  project = "my_project",
  resource = "hpc"
)

This will:

Create or update the project “my_project”
Add the resource “hpc” to the project
Create 100 tasks
Schedule workers on the SLURM cluster
Execute my_simulation(1), my_simulation(2), …, my_simulation(100) in parallel

Function Arguments

Required Arguments

n: Number of tasks to run (integer)
fun: The function to execute for each task
project: Project name (string)
resource: Resource name (string, must already exist)

Optional Arguments

memory: Memory per task in GB (default: 10)
hour: Maximum runtime in hours (default: 24)
account: Account name for cluster billing (optional)
working_dir: Working directory on cluster (default: getwd())
...: Additional arguments passed to your function

Passing Arguments to Your Function

You can pass additional arguments to your function using ...:

my_function <- function(i, multiplier, offset = 0) {
  result <- i * multiplier + offset
  return(result)
}

tq_apply(
  n = 50,
  fun = my_function,
  project = "test_args",
  resource = "hpc",
  multiplier = 10,    # Passed to my_function
  offset = 5          # Passed to my_function
)

Each task will call: - Task 1: my_function(1, multiplier = 10, offset = 5) - Task 2: my_function(2, multiplier = 10, offset = 5) - And so on…

Complete Example

Here’s a practical example running a Monte Carlo simulation:

library(taskqueue)

# Define simulation function
run_monte_carlo <- function(task_id, n_samples = 10000, seed_base = 12345) {
  # Set unique seed for each task
  set.seed(seed_base + task_id)
  
  # Run simulation
  samples <- rnorm(n_samples)
  result <- list(
    task_id = task_id,
    mean = mean(samples),
    sd = sd(samples),
    quantiles = quantile(samples, probs = c(0.025, 0.5, 0.975))
  )
  
  # Save results
  out_file <- sprintf("results/simulation_%04d.Rds", task_id)
  dir.create("results", showWarnings = FALSE)
  saveRDS(result, out_file)
  
  return(invisible(NULL))
}

# Run 1000 simulations in parallel
tq_apply(
  n = 1000,
  fun = run_monte_carlo,
  project = "monte_carlo_study",
  resource = "hpc",
  memory = 8,           # 8 GB per task
  hour = 2,             # 2 hour time limit
  working_dir = "/home/user/monte_carlo",
  n_samples = 50000,    # Argument for run_monte_carlo
  seed_base = 99999     # Argument for run_monte_carlo
)

Monitoring Progress

After calling tq_apply(), monitor your tasks:

# Check task status
task_status("monte_carlo_study")

# Check overall project status
project_status("monte_carlo_study")

Collecting Results

After all tasks complete, collect your results:

# Read all result files
result_files <- list.files("results", pattern = "simulation_.*\\.Rds$", 
                          full.names = TRUE)

# Combine results
all_results <- lapply(result_files, readRDS)

# Analyze
means <- sapply(all_results, function(x) x$mean)
hist(means, main = "Distribution of Means")

Best Practices

1. Save Results to Files

Your function should save results to the file system:

my_task <- function(i) {
  out_file <- sprintf("output/result_%04d.Rds", i)
  
  # Skip if already done
  if (file.exists(out_file)) {
    return(invisible(NULL))
  }
  
  # Do computation
  result <- expensive_computation(i)
  
  # Save result
  saveRDS(result, out_file)
}

2. Make Functions Idempotent

Check if output already exists to avoid re-running completed tasks:

my_task <- function(i) {
  out_file <- sprintf("output/task_%d.Rds", i)
  if (file.exists(out_file)) return(invisible(NULL))
  
  # ... computation and save
}

3. Specify Working Directory

Ensure your working directory on the cluster is correct:

tq_apply(
  n = 100,
  fun = my_function,
  project = "my_project",
  resource = "hpc",
  working_dir = "/home/user/project_folder"
)

4. Set Appropriate Resources

Configure memory and time limits based on your task requirements:

tq_apply(
  n = 100,
  fun = memory_intensive_task,
  project = "big_analysis",
  resource = "hpc",
  memory = 64,    # 64 GB for large tasks
  hour = 48       # 48 hour time limit
)

Comparison with Manual Workflow

tq_apply() simplifies the workflow by combining these steps:

Manual approach:

# Multiple steps
project_add("test", memory = 10)
project_resource_add("test", "hpc", working_dir = "/path", hours = 24)
task_add("test", num = 100, clean = TRUE)
project_reset("test")
worker_slurm("test", "hpc", fun = my_function)

With tq_apply():

# Single step
tq_apply(n = 100, fun = my_function, project = "test", resource = "hpc",
         working_dir = "/path", hour = 24)

Troubleshooting

Tasks fail immediately: - Check the log folder specified in your resource configuration - Verify your function works locally first - Ensure the working directory exists on the cluster

Tasks remain in “idle” status: - Check that the project is started: project_start("my_project") - Verify the resource is correctly configured - Check SLURM queue: squeue -u $USER

“Resource not found” error: - The resource must be created before using tq_apply() - Use resource_list() to see available resources - Create resource with resource_add()

When to Use tq_apply()

Use tq_apply() when: - You have a simple parallel task - You want to quickly run many iterations of a function - You don’t need fine-grained control over project settings

Use the manual workflow when: - You need to manage multiple projects simultaneously - You want to reuse a project for different task sets - You need more control over resource scheduling - You’re running different types of tasks in the same project

Simple Workflow with tq_apply