--- title: "Troubleshooting staRburst" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Troubleshooting staRburst} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Troubleshooting staRburst This guide helps you diagnose and fix common issues with staRburst. ## Accessing Logs ### CloudWatch Logs Structure staRburst automatically sends worker logs to CloudWatch Logs: - **Log Group:** `/aws/ecs/starburst-worker` - **Log Stream Pattern:** `starburst/` - **Retention:** 7 days (configurable) ### Viewing Logs in R ```r # For ephemeral mode library(starburst) plan(starburst, workers = 10) # Check logs for a specific task # (get task ID from error messages or futures) # For detached sessions session <- starburst_session_attach("session-id") status <- session$status() # View failed task logs using AWS CLI or console ``` ### Viewing Logs in AWS Console 1. Navigate to **CloudWatch → Log Groups** 2. Find `/aws/ecs/starburst-worker` 3. Search for task ID in stream names 4. Use CloudWatch Insights for advanced queries: ``` fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 100 ``` ## Common Issues ### Issue 1: Tasks Stuck in "Pending" **Symptoms:** - `session$status()` shows tasks never start - Workers = 0 in status - Tasks remain in pending state for >5 minutes **Diagnosis:** ```r # Check Fargate quota config <- get_starburst_config() sts <- paws.security.identity::sts() account <- sts$get_caller_identity() # Check service quotas manually in AWS Console: # Service Quotas → AWS Fargate → Fargate vCPUs ``` **Common Causes:** 1. **Insufficient vCPU quota** - Most common issue - Default Fargate quota: 6 vCPUs in us-east-1 - Each worker uses configured CPU (default: 4 vCPUs) - With 10 workers × 4 vCPUs = 40 vCPUs needed 2. **Invalid task definition** - Wrong CPU/memory combination - Fargate has strict CPU/memory pairings - Example: 4 vCPUs supports 8-30 GB memory 3. **Network/subnet issues** - VPC configuration problems - Subnets must have available IP addresses - Security groups must allow outbound traffic 4. **IAM permission errors** - Missing ECS task execution role permissions - Must have ECR, S3, CloudWatch Logs access **Solutions:** ```r # Solution 1: Request quota increase # Go to AWS Console → Service Quotas → AWS Fargate # Request vCPUs quota increase to 100+ # Solution 2: Reduce workers plan(starburst, workers = 1) # Use only 1 worker (4 vCPUs) # Solution 3: Reduce CPU per worker plan(starburst, workers = 10, cpu = 0.25, memory = "512MB") # Solution 4: Check IAM permissions # Ensure ECS task execution role has: # - AmazonECSTaskExecutionRolePolicy # - S3 read/write access to starburst bucket # - CloudWatch Logs write access ``` ### Issue 2: Workers Crash Immediately **Symptoms:** - Tasks start but stop within 30 seconds - Status shows workers = 0 after initial launch - CloudWatch logs show error before exit **Diagnosis:** ```r # View CloudWatch logs for the failed task # Look for error messages in the logs # Common error patterns: # - "Error: Cannot connect to S3" → S3 permissions # - "Error loading package" → Package installation failed # - "Cannot allocate memory" → Memory limit too low # - "exec format error" → Architecture mismatch ``` **Common Causes:** 1. **S3 permission errors** - Task role can't access bucket 2. **Package installation failures** - Missing system dependencies 3. **Out of memory** - Memory limit too low for workload 4. **Architecture mismatch** - ARM64 vs X86_64 image/instance mismatch **Solutions:** ```r # Solution 1: Verify S3 permissions # Check task role has S3 access: # IAM → Roles → starburstECSTaskRole → Permissions # Should have S3 GetObject/PutObject on bucket # Solution 2: Increase memory plan(starburst, workers = 5, cpu = 4, memory = "16GB") # Solution 3: Check Docker build logs # Re-run starburst setup to rebuild image # Watch for package installation errors # Solution 4: For EC2 mode, verify architecture matches plan(starburst, launch_type = "EC2", instance_type = "c7g.xlarge") # Graviton (ARM64) # Ensure Docker image built for matching architecture ``` ### Issue 3: "Access Denied" Errors **Symptoms:** - Error messages containing "AccessDenied" or "Forbidden" - Can't create tasks, access S3, or push Docker images **Diagnosis:** ```r # Check which operation is failing: # 1. Docker push → ECR permissions # 2. S3 operations → S3 permissions # 3. Task launch → ECS permissions # Verify credentials library(paws.security.identity) sts <- paws.security.identity::sts() identity <- sts$get_caller_identity() print(identity) # Should show your AWS account ``` **Common Causes:** 1. **No AWS credentials configured** 2. **IAM user lacks required permissions** 3. **S3 bucket policy blocks access** 4. **ECR repository doesn't exist or blocks access** **Solutions:** ```r # Solution 1: Configure AWS credentials # Option A: Environment variables Sys.setenv( AWS_ACCESS_KEY_ID = "YOUR_KEY", AWS_SECRET_ACCESS_KEY = "YOUR_SECRET", AWS_DEFAULT_REGION = "us-east-1" ) # Option B: AWS CLI profile Sys.setenv(AWS_PROFILE = "your-profile") # Option C: IAM role (when running on EC2/ECS) # No configuration needed - automatic # Solution 2: Add required IAM permissions # Your IAM user/role needs: # - ECS: RunTask, DescribeTasks, StopTask # - ECR: GetAuthorizationToken, BatchCheckLayerAvailability, # GetDownloadUrlForLayer, PutImage, InitiateLayerUpload, etc. # - S3: GetObject, PutObject, ListBucket on your bucket # - IAM: PassRole (to pass ECS task role) # Solution 3: Run starburst_setup() to create all resources library(starburst) starburst_setup(bucket = "my-starburst-bucket") ``` ### Issue 4: High Costs / Runaway Workers **Symptoms:** - AWS bill higher than expected - Many tasks running when you expected them to stop - Old sessions still have active workers **Diagnosis:** ```r # List all active sessions library(starburst) sessions <- starburst_list_sessions() print(sessions) # Check for old sessions with running tasks ``` **Common Causes:** 1. **Forgot to cleanup session** - Workers keep running 2. **Requested too many workers** - Cost adds up quickly 3. **Long-running tasks** - Tasks running for hours/days **Solutions:** ```r # Solution 1: Cleanup all sessions sessions <- starburst_list_sessions() for (session_id in sessions$session_id) { session <- starburst_session_attach(session_id) session$cleanup(stop_workers = TRUE, force = TRUE) } # Solution 2: Set budget alerts in AWS # AWS Billing Console → Budgets → Create budget # Set alert at $100, $500 thresholds # Solution 3: Use worker validation to prevent mistakes # staRburst now enforces max 500 workers # Previously you could accidentally request 10,000+ # Solution 4: Set absolute timeout on sessions session <- starburst_session( workers = 10, absolute_timeout = 3600 # Auto-terminate after 1 hour ) ``` ### Issue 5: Session Cleanup Not Working **Symptoms:** - Called `session$cleanup()` but workers still running - S3 files not deleted - Tasks still appearing in ECS console **Diagnosis:** ```r # Check if cleanup was called with correct parameters session$cleanup(stop_workers = TRUE, force = TRUE) # Verify tasks actually stopped (may take 30-60 seconds) Sys.sleep(60) # Check ECS tasks manually library(paws.compute) ecs <- paws.compute::ecs(config = list(region = "us-east-1")) tasks <- ecs$list_tasks(cluster = "starburst-cluster") print(tasks$taskArns) # Should be empty or not include your tasks ``` **Common Causes:** 1. **Cleanup called without stop_workers** - Workers not stopped 2. **Cleanup called without force** - S3 files preserved 3. **Tasks in different cluster** - Cleanup looking in wrong place 4. **ECS eventual consistency** - Tasks take time to stop **Solutions:** ```r # Solution 1: Always use both flags for full cleanup session$cleanup(stop_workers = TRUE, force = TRUE) # Solution 2: Wait for ECS to process stop requests session$cleanup(stop_workers = TRUE) Sys.sleep(60) # Wait 1 minute # Then verify in AWS console # Solution 3: Manual cleanup if needed library(paws.compute) library(paws.storage) ecs <- paws.compute::ecs(config = list(region = "us-east-1")) s3 <- paws.storage::s3(config = list(region = "us-east-1")) # Stop all tasks in cluster tasks <- ecs$list_tasks(cluster = "starburst-cluster", desiredStatus = "RUNNING") for (task_arn in tasks$taskArns) { ecs$stop_task(cluster = "starburst-cluster", task = task_arn) } # Delete all session S3 files result <- s3$list_objects_v2(Bucket = "your-bucket", Prefix = "sessions/") # ... delete objects ``` ### Issue 6: Results Not Appearing **Symptoms:** - `session$collect()` returns empty list - Tasks show as "completed" but no results - S3 doesn't contain result files **Diagnosis:** ```r # Check session status status <- session$status() print(status) # Verify tasks were actually submitted # Check S3 for task files library(paws.storage) s3 <- paws.storage::s3(config = list(region = "us-east-1")) result <- s3$list_objects_v2( Bucket = "your-bucket", Prefix = sprintf("sessions/%s/results/", session$session_id) ) print(result$Contents) # Should show .qs files ``` **Common Causes:** 1. **Tasks failed before producing results** - Check for errors 2. **Workers can't write to S3** - Permission issue 3. **Looking at wrong session ID** - Attached to wrong session 4. **Results already collected** - Results only collected once **Solutions:** ```r # Solution 1: Check task status for errors status <- session$status() if (status$failed_tasks > 0) { # Check CloudWatch logs for failed task IDs # Look for error messages } # Solution 2: Verify S3 write permissions # Task role must have S3 PutObject permission # Solution 3: Verify session ID print(session$session_id) # Make sure this matches the session you created # Solution 4: Results can only be collected once # If you already called collect(), results are removed from S3 # You should store results after collection: results <- session$collect(wait = TRUE) saveRDS(results, "my_results.rds") # Save locally ``` ### Issue 7: Detached Session Reattach Fails **Symptoms:** - `starburst_session_attach()` throws error - "Session not found" message - Can't reconnect after closing R **Diagnosis:** ```r # List all sessions to find your session ID sessions <- starburst_list_sessions() print(sessions) # Try to attach with exact session ID session_id <- "session-abc123..." session <- starburst_session_attach(session_id) ``` **Common Causes:** 1. **Wrong session ID** - Typo or wrong ID 2. **Session expired** - Exceeded absolute_timeout 3. **S3 manifest deleted** - Someone deleted session files 4. **Wrong region** - Session created in different region **Solutions:** ```r # Solution 1: List and copy exact session ID sessions <- starburst_list_sessions() session_id <- sessions$session_id[1] # Use exact ID session <- starburst_session_attach(session_id) # Solution 2: Save session ID immediately after creation session <- starburst_session(workers = 10) session_id <- session$session_id write(session_id, "my_session_id.txt") # Save to file # Later: session_id <- readLines("my_session_id.txt") session <- starburst_session_attach(session_id) # Solution 3: Check correct region session <- starburst_session_attach(session_id, region = "us-west-2") ``` ### Issue 8: Package Installation Failures **Symptoms:** - Docker build fails during `renv::restore()` - Error messages about missing system dependencies - Specific packages fail to install **Diagnosis:** Look at Docker build output when running staRburst. Common error patterns: ``` Error: installation of package 'X' had non-zero exit status Error: compilation failed for package 'X' Error: unable to load shared library ``` **Common Causes:** 1. **Missing system dependencies** - Package needs system libraries 2. **Package not in CRAN** - Private or development package 3. **Version conflicts** - renv.lock specifies unavailable version **Solutions:** ```r # Solution 1: Add system dependencies to Dockerfile.base # Edit starburst package Dockerfile.base template: # Add RUN apt-get install -y libcurl4-openssl-dev # Solution 2: Use renv snapshot to capture dependencies renv::snapshot() # Updates renv.lock # Solution 3: Install from GitHub for dev packages renv::install("user/package") renv::snapshot() # Solution 4: Check package availability install.packages("package") # Test locally first ``` ## Advanced Diagnostics ### Checking ECS Task Status ```r library(paws.compute) ecs <- paws.compute::ecs(config = list(region = "us-east-1")) # List all tasks tasks <- ecs$list_tasks( cluster = "starburst-cluster", desiredStatus = "RUNNING" ) # Describe specific task task_detail <- ecs$describe_tasks( cluster = "starburst-cluster", tasks = tasks$taskArns[1:1] ) # Check exit code and reason print(task_detail$tasks[[1]]$containers[[1]]$exitCode) print(task_detail$tasks[[1]]$stoppedReason) ``` ### Monitoring S3 Storage ```r library(paws.storage) s3 <- paws.storage::s3(config = list(region = "us-east-1")) # List all session files result <- s3$list_objects_v2( Bucket = "your-starburst-bucket", Prefix = "sessions/" ) # Calculate total storage total_bytes <- sum(sapply(result$Contents, function(x) x$Size)) total_mb <- total_bytes / 1024^2 cat(sprintf("Total storage: %.2f MB\n", total_mb)) ``` ### Estimating Costs ```r # Fargate pricing (us-east-1, 2026): # - vCPU: $0.04048 per hour # - Memory: $0.004445 per GB-hour vcpu_price <- 0.04048 memory_price <- 0.004445 workers <- 10 cpu <- 4 memory_gb <- 8 runtime_hours <- 1 cost_per_worker <- (cpu * vcpu_price) + (memory_gb * memory_price) total_cost <- workers * cost_per_worker * runtime_hours cat(sprintf("Estimated cost: $%.2f for %d hours\n", total_cost, runtime_hours)) ``` ## Getting Help If you encounter issues not covered here: 1. **Check CloudWatch Logs** - Most issues have error messages in logs 2. **Review AWS Console** - Check ECS, S3, ECR for resource status 3. **File GitHub Issue** - Include error messages and logs 4. **AWS Support** - For quota increases or AWS-specific issues **Information to Include in Bug Reports:** - staRburst version: `packageVersion("starburst")` - R version: `R.version.string` - AWS region - Launch type (Fargate vs EC2) - Error messages from R and CloudWatch logs - Session ID (for detached sessions) - Output of `session$status()` (if applicable) ## See Also - [Security Best Practices](security.html) - Securing your staRburst deployments - [staRburst README](https://github.com/scttfrdmn/starburst) - Getting started guide - [AWS Fargate Documentation](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html)