--- title: "Complete Workflow: Combining Genetic and Non-Genetic Evidence" author: "Franco L. Marsico" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Complete Workflow: Combining Genetic and Non-Genetic Evidence} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction This vignette demonstrates a complete workflow for missing person identification combining genetic DNA evidence with non-genetic preliminary investigation evidence (sex, age, hair color). ### The Case Scenario A family is searching for their missing relative with the following known characteristics: - **Sex**: Female - **Age at disappearance**: 25 years old - **Hair color**: Dark brown (category 2) A person of interest (POI) has been found with: - **Observed sex**: Female - **Estimated age**: 27 years (within expected range) - **Hair color**: Dark brown We will evaluate this match using both: 1. Non-genetic preliminary evidence 2. Genetic DNA comparison with family reference ## Setup ```{r message=FALSE, warning=FALSE} library(mispitools) library(forrel) # For genetic simulations library(pedtools) # For pedigree operations ``` ## Part 1: Non-Genetic Evidence ### Step 1.1: Define Error Rates Error rates (epsilon) account for uncertainty in observations: ```{r} # Probability of misrecording biological sex eps_sex <- 0.02 # Probability of age estimate being outside true range eps_age <- 0.05 # Hair color observation error matrix eps_color <- error_matrix_hair(ep = 0.05) # 5% error rate ``` ### Step 1.2: Calculate Individual LRs **Sex Evidence:** ```{r} # MP is female (H=1), POI observed as female lr_sex_result <- lr_sex( LR = TRUE, H = 1, # True hypothesis (MP is female) nsims = 1, eps = eps_sex, erRs = 0.01 # Database recording error ) cat("LR for sex evidence:", lr_sex_result$LRs, "\n") ``` **Age Evidence:** ```{r} # MP age = 25, tolerance = 5 years (so 20-30 is acceptable) # POI age = 27 (falls within range) lr_age_result <- lr_age( LR = TRUE, H = 1, # True hypothesis nsims = 1, epa = eps_age, erRa = 0.01, MPa = 25, # MP age MPr = 5 # Range tolerance ) cat("LR for age evidence:", lr_age_result$LRa, "\n") ``` **Hair Color Evidence:** ```{r} # MP has color 2 (dark brown), POI observed as color 2 lr_color_result <- lr_hair_color( LR = TRUE, H = 1, # True hypothesis nsims = 1, MPc = 2, # MP hair color epc = eps_color, erRc = eps_color ) cat("LR for hair color:", lr_color_result$LRc, "\n") ``` ### Step 1.3: Combined Non-Genetic LR ```{r} # Combine all non-genetic evidence lr_nongenetic <- lr_sex_result$LRs * lr_age_result$LRa * lr_color_result$LRc cat("Combined non-genetic LR:", round(lr_nongenetic, 2), "\n") cat("Log10(LR):", round(log10(lr_nongenetic), 2), "\n") ``` ### Step 1.4: Visualize the CPTs ```{r fig.height=4} # Population CPT (H2) cpt_h2 <- cpt_population( propS = c(0.5, 0.5), MPa = 25, MPr = 5, propC = c(0.15, 0.35, 0.25, 0.15, 0.10) # Realistic color distribution ) # MP CPT (H1) cpt_h1 <- cpt_missing_person( MPs = 1, # Female MPc = 2, # Dark brown eps = eps_sex, epa = eps_age, epc = eps_color ) # Visualize both CPTs and LR heatmap plot_cpt(cpt_h2, cpt_h1) ``` ## Part 2: Genetic Evidence (Simulation) For illustration, we'll show how to simulate genetic LRs using a parent-child relationship pedigree. Note: This code is provided for reference but not executed in this vignette to avoid dependency on specific pedigree structures. ### Step 2.1: Create Pedigree ```{r eval=FALSE} # Create a simple pedigree: parent-child relationship # The missing person (ID 5) is child of individual 2 # Using linearPed to create grandparent-parent-child ped <- linearPed(2) # 5 individuals # Add genetic markers from Norwegian population ped <- setMarkers(ped, locusAttributes = NorwegianFrequencies[1:10]) # Simulate a profile for the reference person set.seed(123) ped <- profileSim(ped, N = 1, ids = 2)[[1]] ``` ### Step 2.2: Simulate Genetic LRs ```{r eval=FALSE} # Simulate genetic LRs genetic_sims <- sim_lr_genetic( reference = ped, missing = 5, numsims = 100, seed = 456 ) # Convert to dataframe genetic_df <- lr_to_dataframe(genetic_sims) # Visualize plot_lr_distribution(genetic_df) ``` For this demonstration, we'll use pre-computed example values: ```{r} # Example genetic LR values (pre-computed) # These represent typical values from parent-child testing set.seed(42) genetic_df <- data.frame( Related = 10^rnorm(100, mean = 3, sd = 1.5), Unrelated = 10^rnorm(100, mean = -0.5, sd = 1) ) cat("Summary of log10(LR) under H1 (Related):\n") summary(log10(genetic_df$Related)) cat("\nSummary of log10(LR) under H2 (Unrelated):\n") summary(log10(genetic_df$Unrelated)) ``` ### Step 2.3: Visualize Genetic LR Distributions ```{r fig.height=4} # Plot the LR distributions plot_lr_distribution(genetic_df) ``` ## Part 3: Combining All Evidence ### Step 3.1: Prior Probability We need to specify a prior probability that a random POI is the MP. This depends on the size of the candidate population: ```{r} # If there are ~10,000 potential candidates prior_prob <- 1/10000 # Convert to prior odds prior_odds <- prior_prob / (1 - prior_prob) cat("Prior probability:", prior_prob, "\n") cat("Prior odds:", prior_odds, "\n") ``` ### Step 3.2: Posterior Odds Calculation The posterior odds combine all evidence: $$\text{Posterior Odds} = \text{Prior Odds} \times LR_{genetic} \times LR_{nongenetic}$$ ```{r} # For the simulations under H1 (true match scenario) posterior_h1 <- prior_odds * genetic_df$Related * lr_nongenetic # For simulations under H2 (no match scenario) posterior_h2 <- prior_odds * genetic_df$Unrelated * lr_nongenetic # Summary cat("Posterior odds under H1 (median):", round(median(posterior_h1), 4), "\n") cat("Posterior odds under H2 (median):", round(median(posterior_h2), 6), "\n") ``` ### Step 3.3: Decision Threshold Analysis ```{r fig.height=4} # Find optimal threshold with weight 10 (FP 10x worse than FN) threshold_result <- decision_threshold( datasim = genetic_df, weight = 10 ) # Calculate error rates at the optimal threshold rates <- threshold_rates( datasim = genetic_df, threshold = threshold_result ) # Check rates at different thresholds cat("\nError rates at different thresholds:\n") for (t in c(1, 10, 100, 1000)) { r <- threshold_rates(genetic_df, threshold = t) cat(sprintf("LR > %5d: FPR=%.3f, FNR=%.3f, MCC=%.3f\n", t, r$FPR, r$FNR, r$MCC)) } # Plot decision curve plot_decision_curve( datasim = genetic_df, LRmax = 10000 ) ``` ## Part 4: Interpretation ### Strength of Evidence Categories The LR quantifies how many times more likely the evidence is under H1 vs H2: | Log10(LR) | LR Range | Interpretation | |-----------|----------|----------------| | < 0 | < 1 | Supports H2 (not the MP) | | 0-1 | 1-10 | Weak support for H1 | | 1-2 | 10-100 | Support for H1 | | 2-4 | 100-10,000 | Strong support for H1 | | > 4 | > 10,000 | Very strong support for H1 | ### Our Case Results ```{r} # Median genetic LR under H1 median_genetic_lr <- median(genetic_df$Related) # Total LR total_lr <- median_genetic_lr * lr_nongenetic log10_total <- log10(total_lr) cat("Genetic LR (median under H1):", round(median_genetic_lr, 0), "\n") cat("Non-genetic LR:", round(lr_nongenetic, 2), "\n") cat("Total combined LR:", round(total_lr, 0), "\n") cat("Log10(Total LR):", round(log10_total, 2), "\n") ``` ## Part 5: Interactive Exploration For interactive exploration of parameters and their effects, use the Shiny applications: ```{r eval=FALSE} # Basic CPT and LR visualization app_mispitools() # Advanced analysis with ROC curves app_lr_comparison() ``` ## Conclusion This workflow demonstrated how to: 1. Calculate LRs for non-genetic evidence (sex, age, hair color) 2. Simulate genetic LRs from pedigree-based relationships 3. Combine evidence using the Bayesian framework 4. Evaluate decision performance and error rates The key advantage of combining evidence types is increased discrimination power, particularly useful in cases where genetic evidence alone may be inconclusive due to: - Distant family relationships - Limited number of markers - Degraded samples ## Session Information ```{r} sessionInfo() ``` ## References Marsico FL, Vigeland MD, Egeland T, Herrera Pinero F (2021). "Making decisions in missing person identification cases with low statistical power." *Forensic Science International: Genetics*, 52, 102519. Marsico FL, et al. (2023). "Likelihood ratios for non-genetic evidence in missing person cases." *Forensic Science International: Genetics*, 66, 102891.