--- title: "causalDisco" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{causalDisco} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(causalDisco) ``` This vignette provides an overview of the causalDisco package, which offers tools for causal discovery from observational data. It covers the main features of the package, including various causal discovery algorithms, knowledge incorporation, and result visualization. # Running causal discovery algorithms We will for this section use the `num_data` dataset included in the package for demonstrating how to run causal discovery algorithms. It contains 5 numerical variables, `X1, X2, X3, Z`, and `Y`. ```{r load data} data(num_data) head(num_data) ``` To make the different causal graphs easier to interpret, we define a custom fixed layout for plotting the results: ```{r} plot_layout <- data.frame( name = c("Z", "X3", "X1", "X2", "Y"), x = c(0.00, 0.50, 0.00, 0.50, 0.25), y = c(0.0, 0.0, 0.5, 0.5, 1.0) ) ``` To run a causal discovery algorithm, we first define the algorithm using the corresponding function, and then pass it to the `disco()` function along with the data. Below we demonstrate this process using the Peter-Clark (PC) algorithm from bnlearn with Fisher's Z test: ```{r pc algorithm fisher z bnlearn} pc_bnlearn <- pc( engine = "bnlearn", # Use the bnlearn implementation test = "fisher_z", # Use Fisher's Z test for conditional independence alpha = 0.05 # Significance level for the test ) pc_result_bnlearn <- disco(data = num_data, method = pc_bnlearn) ``` We can visualize the results using `plot()`: ```{r} plot(pc_result_bnlearn, layout = plot_layout, main = "PC Fisher Z (bnlearn)") ``` The first notable feature of this plot is that some edges are directed, while others are undirected. For example, the edge from `X1` to `Y` is directed, indicating a causal effect of `X1` on `Y`. In contrast, the edge between `X1` and `X3` is undirected, indicating that the data alone do not provide sufficient information to determine the causal direction. Both orientations `X1 %-->% X3` and `X3 %-->% X1` are compatible with the observed conditional independencies. To view all engines available for a specific algorithm, you can see the documentation using `?pc`, where all options are listed under the `engine` argument. Instead of using bnlearn, we can also use the PC implementation from the pcalg package with the same test: ```{r pc algorithm fisher z pcalg} pc_pcalg <- pc( engine = "pcalg", # Use the pcalg implementation test = "fisher_z", # Use Fisher's Z test for conditional independence alpha = 0.05 # Significance level for the test ) pc_result_pcalg <- disco(data = num_data, method = pc_pcalg) plot(pc_result_pcalg, layout = plot_layout, main = "PC Fisher Z (pcalg)") ``` We see that the results using the PC algorithm implemented in bnlearn and pcalg gives the same output on this dataset. You can also use a different algorithm altogether, such as the GES algorithm. It follows the same pattern, however GES is a score-based algorithm, so instead of a test and an alpha level, we need to specify a score. Below we will use the Extended Bayesian Information Criterion (EBIC) score from Tetrad.
Tetrad Setup Instructions causalDisco provides an interface to the Java library [Tetrad](https://github.com/cmu-phil/tetrad) for causal discovery algorithms. To use algorithms from Tetrad you need to install a Java Development Kit (JDK) >= 21. We recommend Eclipse Temurin (OpenJDK), available at https://adoptium.net/en-GB/temurin/releases. When using the installer from the Temurin website, make sure to select the option to set the `JAVA_HOME` environment variable during installation, so rJava correctly detects the Java installation. For a simpler setup, we recommend using the rJavaEnv package, which provides a convenient function to install Java and configure the environment automatically for rJava. You can install Java using the `rJavaEnv::java_quick_install()` function: ```{r install java, eval=FALSE} # Use the development version of rJavaEnv from GitHub # pak::pak("e-kotov/rJavaEnv") rJavaEnv::java_quick_install(version = 25, distribution = "Temurin") ``` Once you have Java JDK set up correctly, the current supported version of Tetrad can then be installed by calling ```{r install tetrad, eval=FALSE} install_tetrad() ``` To verify everything is set up correctly you can run `verify_tetrad()`: ```{r check tetrad install} verify_tetrad() ``` Now you should be able to use Tetrad as an engine for the GES algorithm as shown in the code chunk below.
```{r ges algorithm ebic tetrad, eval = FALSE} if (verify_tetrad()$installed && verify_tetrad()$java_ok) { ges_tetrad <- ges( engine = "tetrad", # Use the Tetrad implementation score = "ebic" # Use the EBIC score ) ges_result_tetrad <- disco(data = num_data, method = ges_tetrad) plot(ges_result_tetrad, layout = plot_layout, main = "GES EBIC (Tetrad)") } ``` ```{r ges algorithm ebic tetrad plot code, echo=FALSE, eval = FALSE} # Code to generate the plot image on pkgdown and CRAN. # For pkgdown (comment out to avoid CRAN NOTE about unstated dependency used) use the function fig_settings # ragg::agg_png( # filename = "ges-ebic-tetrad-pkgdown.png", # width = fig_settings$fig.width, # height = fig_settings$fig.height, # units = "in", # res = fig_settings$dpi * fig_settings$fig.retina # ) # ges_result <- disco(data = num_data, method = ges(engine = "tetrad", score = "ebic")) # plot(ges_result, layout = plot_layout, main = "GES EBIC (Tetrad)") # dev.off() # For CRAN # Hardcoded since knitr::opts_current$get() is different when running in console. ragg::agg_png( filename = "ges-ebic-tetrad-cran.png", width = 3, height = 3, units = "in", res = 92 * 1 ) ges_result <- disco(data = num_data, method = ges(engine = "tetrad", score = "ebic")) plot(ges_result, layout = plot_layout, main = "GES EBIC (Tetrad)") dev.off() ``` ```{r ges-ebic-tetrad-include, echo=FALSE} img_file <- if (identical(Sys.getenv("IN_PKGDOWN"), "true")) { "ges-ebic-tetrad-pkgdown.png" } else { "ges-ebic-tetrad-cran.png" } knitr::include_graphics(img_file) ``` If you want to customize the plot appearance further, you can pass additional arguments to `plot()`. For example, to change the appearance of the nodes, you can use the `node_style` argument: ```{r custom plot} plot( pc_result_bnlearn, layout = plot_layout, main = "Customized plot", node_style = list( fill = "lightblue", # Fill color col = "darkblue", # Border color lwd = 2, # Border width padding = 4, # Text padding (mm) size = 1.2 # Size multiplier ) ) ``` For more details on customizing plots and generating TikZ code for LaTeX documents, see the [visualization vignette](visualization.html). Instead of using `plot()`, another way to view and analyze the results is to use the `print()` or `summary()` functions: ```{r view results} print(pc_result_bnlearn) summary(pc_result_bnlearn) ``` # Incorporating knowledge We will for this section use the dataset `tpc_example`, which contains variables, which are measured at three different life stages: childhood, youth, and old age. ```{r} data(tpc_example) head(tpc_example) ``` Since we know the temporal ordering of the variables, we can incorporate this background knowledge into the causal discovery algorithm. Specifically, we know that variables measured in childhood cannot be caused by variables measured in youth or old age, and variables measured in youth cannot be caused by variables measured in old age. Knowledge is encoded by creating a `knowledge` object via the `knowledge()` function. The first argument (optional, but recommended for name matching) specifies the dataset. Tiered knowledge can then be defined using the `tier()` function. Here, we illustrate this by creating a tiered knowledge structure based on life stages: ```{r prior knowledge} kn <- knowledge( tpc_example, tier( child ~ c("child_x1", "child_x2"), youth ~ starts_with("youth"), # tidyselect helper; equivalent to c("youth_x3", "youth_x4") oldage ~ starts_with("oldage") ) ) ``` For more details on how to define knowledge, see the [knowledge vignette](knowledge.html). You can view the `Knowledge` object using `print()`, `summary()` or `plot()`: ```{r view knowledge} print(kn) summary(kn) plot(kn, main = "Temporal Knowledge") ``` The plot displays vertical tiers, each enclosed in a shaded rectangle and labeled with the corresponding tier name at the top. We can then incorporate this knowledge into any algorithm like above. To do so, you need to pass the `knowledge` object as an argument to the `disco()` function. Here we use the Temporal Peter-Clark (tpc) algorithm from causalDisco with the regression-based information loss test: ```{r tpc algorithm with knowledge} tpc_method <- tpc( engine = "causalDisco", # Use the causalDisco implementation test = "reg" # Use the regression-based information loss test ) tpc_result <- disco(data = tpc_example, method = tpc_method, knowledge = kn) ``` Similarly, we can view the results using `print()`, `summary()` or `plot()`: ```{r view tpc results} print(tpc_result) summary(tpc_result) plot(tpc_result, main = "TPC reg_test with Temporal Knowledge (causalDisco)") ``` Like before, the tiered knowledge is reflected in the plot layout, with variables grouped by life stage. Additionally, you can customize the plot appearance further by passing additional arguments to `plot()`. # Next steps For more information about how to incorporate knowledge, see the [knowledge vignette](knowledge.html). For more information about causal discovery, see the [causal discovery vignette](causal-discovery.html). For more information about visualization options, see the [visualization vignette](visualization.html).