--- title: "Knowledge" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Knowledge} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(causalDisco) ``` This vignette demonstrates how to use the `knowledge()` function to incorporate prior knowledge into causal discovery algorithms. The different supported knowledge types are explained below, along with examples of how to create `Knowledge` objects and use them with causal discovery methods. All knowledge types can be freely combined. Multiple calls or operators are additive: each call adds new edges to the `Knowledge` object. For example if we require an edge from `A` to `B` and then require an edge from A to C, the resulting `Knowledge` object will require both edges from `A` to `B` and from `A` to `C`. At a conceptual level, all knowledge is represented as constraints on edges, specifying which edges are required and forbidden. Some knowledge types provide higher-level abstractions for expressing common modeling assumptions more conveniently. # Required and forbidden knowledge At the most basic level, prior knowledge is expressed as required or forbidden edges between variables. These constraints apply to directed edges in the causal graph. - Required edges specify that a directed edge must exist between two variables. - Forbidden edges specify that a directed edge is not allowed between two variables. These constraints are specified using the `%-->%` (required) and `%!-->%` (forbidden) operators, with the exclamation mark (`!`) indicating negation of the edge, i.e. the absence of the edge. Conceptually, this could be written as `%!(-->)%`, but we find this syntax too verbose. ## Specifying required and forbidden edges Suppose we want to require an edge from A to B, from A to C, and forbid an edge from B to C: ```{r required and forbidden knowledge} kn_1 <- knowledge( A %-->% c(B, C), # Require edges from A to B and A to C B %!-->% C # Forbid edge from B to C ) ``` This `Knowledge` object can be visualized: ```{r plot required and forbidden knowledge} plot(kn_1) ``` The blue edge represents the required edge from A to B, while the red edge represents the forbidden edge from B to C. If one wishes to remove some edges (either required or forbidden) knowledge from an existing `Knowledge` object, the `remove_edge()` function can be used. For example, to remove the required edge from A to B: ```{r remove required edge} kn_1_removed <- remove_edge(kn_1, from = A, to = B) plot(kn_1_removed) ``` ## Specifying required and forbidden edges in a dataset We will use the `tpc_example` dataset from the causalDisco package for the following examples: ```{r dataset required and forbidden knowledge} data(tpc_example) head(tpc_example) ``` We can pass the dataset to `knowledge` as the first argument, which also checks that the specified variables exist in the dataset: ```{r required and forbidden knowledge with data} kn_2 <- knowledge( tpc_example, child_x1 %-->% youth_x3, # Require edge from child_x1 to youth_x3 child_x2 %!-->% oldage_x5 # Forbid edge from child_x2 to oldage_x5 ) ``` This `Knowledge` object can also be visualized (we manually adjust the layout for better appearance): ```{r plot required and forbidden knowledge with data} cg <- knowledge_to_caugi(kn_2)$caugi layout <- caugi::caugi_layout_sugiyama(cg) layout[6, 2] <- layout[4, 2] plot(kn_2, layout = layout) ``` The plot then plots all variables in the dataset, with the required edges as blue edges and forbidden edges as red edges. ### Using tidyselect helpers To make specifying variables easier, you can use tidyselect helpers such as `starts_with`: ```{r required and forbidden knowledge with tidyselect} kn_3 <- knowledge( tpc_example, starts_with("child") %-->% starts_with("youth"), starts_with("oldage") %!-->% starts_with("youth") ) ``` This means, that all variables starting with "child" are required to have edges to all variables starting with "youth", and no variables starting with "oldage" can have edges to any variables starting with "youth". We can visualize this: ```{r plot required and forbidden knowledge with tidyselect} plot(kn_3) ``` For a list of all available tidyselect helpers we refer to the [tidyselect reference documentation](https://tidyselect.r-lib.org/reference/index.html). # Tiered knowledge Tiered knowledge provides a higher-level abstraction for expressing systematic ordering assumptions, such as temporal or logical precedence. Internally, tiered knowledge is translated into a collection of forbidden edges, but it is exposed separately because it provides a concise and structured way to express common ordering assumptions. For example, consider a dataset with three groups of variables: child, youth, and old. We may wish to enforce that child variables precede youth variables, which in turn precede old variables. This can be expressed using tiered knowledge. Tiered knowledge enforces that edges may only point from earlier tiers to later tiers. Edges within the same tier are unrestricted unless additional knowledge is supplied. ## Creating a tiered `Knowledge` object Suppose we observe variables over time: first the `A`'s, then the `B`'s, and finally the `C`'s. This ordering implies that causal direction cannot go backward in time (e.g., `B`'s cannot cause `A`'s). A tiered `Knowledge` object captures this temporal structure by specifying tiers and their associated variables. If numeric tiers are used, lower numbers indicate earlier tiers; otherwise, tiers are ordered by their appearance. The following specifications encode the same tier structure: ```{r tier knowledge} kn <- knowledge( tier( 1 ~ c(A1, A2), 2 ~ c(B1, B2), 3 ~ c(C1, C2) ) ) # Same object, since tiers are ordered numerically kn_same <- knowledge( tier( 1 ~ c(A1, A2), 3 ~ c(C1, C2), 2 ~ c(B1, B2) ) ) # Functionally equivalent, though not identical kn_almost <- knowledge( tier( 10 ~ c(A1, A2), 30 ~ c(C1, C2), 20 ~ c(B1, B2) ) ) # Again functionally equivalent kn_also_almost <- knowledge( tier( A ~ c(A1, A2), B ~ c(B1, B2), C ~ c(C1, C2) ) ) # Has a letter, so tiers are ordered by appearance, thus functionally equivalent kn_mixed <- knowledge( tier( 3 ~ c(A1, A2), B ~ c(B1, B2), 1 ~ c(C1, C2) ) ) ``` We can visualize the tiers: ```{r plot tier knowledge} plot(kn) ``` The plot then shows the tiers as layers, with the earliest tiers to the left and latest to the right. We can convert the meaning of the tiered knowledge into explicit forbidden edges using `convert_tiers_to_forbidden()`: ```{r convert tiers to forbidden} kn_converted <- convert_tiers_to_forbidden(kn) print(kn_converted) plot(kn_converted) ``` Tidyselect helpers such as `starts_with` can also be used to define tiers in a concise way, just as with required and forbidden edges. Different tidyselect helpers can be freely combined within a tier definition using `+`. For example, the following tiered `Knowledge` object defines two tiers, "young" and "old", by combining tidyselect helpers: ```{r tier knowledge with tidyselect} kn_tier_tidyselect <- knowledge( tpc_example, tier( young ~ starts_with("child") + ends_with(c("3", "4")), old ~ starts_with("old") ) ) plot(kn_tier_tidyselect) ``` # Exogenous variables knowledge Exogenous variables are those that have no incoming edges in the causal graph. That is, variables which are known causes but are not affected by other variables. Exogenous variables can be specified using the `exogenous()` function within `knowledge()`. ## Specifying exogenous variables The most natural usage is to supply the dataset so that the variables are checked for existence and selected correctly: ```{r exogenous knowledge} kn_exo_1 <- knowledge( tpc_example, exogenous("child_x1") ) ``` Instead of `exogenous`, you can also use the shorthand function `exo()`. This `knowledge` object can be visualized: ```{r plot exogenous knowledge} plot(kn_exo_1) ``` Below we add both `child_x1` and `child_x2` as exogenous variables using tidyselect helpers: ```{r exogenous knowledge with tidyselect} kn_exo_2 <- knowledge( tpc_example, exo(starts_with("child")) ) plot(kn_exo_2, layout = "bipartite", orientation = "columns") ``` # Combining different knowledge types Different knowledge types can be freely combined in a single `knowledge` object. For example, we can combine tiered knowledge with required and forbidden edges: ```{r combined knowledge} kn_combined <- knowledge( tpc_example, tier( 1 ~ starts_with("child"), 2 ~ starts_with("youth"), 3 ~ starts_with("oldage") ), child_x1 %-->% youth_x3, child_x1 %!-->% child_x2 ) plot(kn_combined) ``` # Using knowledge with causal discovery Once prior knowledge has been specified, it can be supplied to causal discovery algorithms by passing the `knowledge` object to the `disco()` function via the `knowledge` parameter. For example, we can use the Temporal GES algorithm `tges()` with engine "causalDisco" and temporal BIC ("tbic"): ```{r causal discovery with tier knowledge} kn <- knowledge( tpc_example, tier( 1 ~ starts_with("child"), 2 ~ starts_with("youth"), 3 ~ starts_with("oldage") ) ) cd_tges <- tges(engine = "causalDisco", score = "tbic") disco_cd_tges <- disco(data = tpc_example, method = cd_tges, knowledge = kn) ``` The causal discovery algorithms respects the provided knowledge. We can plot the resulting causal graph: ```{r plot causal discovery with tier knowledge} plot(disco_cd_tges) ``` The black edges are those inferred from the data. # Engine specific information about knowledge By engine we mean the underlying implementation of the causal discovery algorithm, i.e. the engine you specify to an algorithm such as `pc(engine = "bnlearn")` or `tges(engine = "causalDisco")`. ## bnlearn All knowledge types are supported with bnlearn engine. When required knowledge is provided, bnlearn may emit a warning during structure learning. This occurs when the algorithm identifies a candidate v-structure (collider) from the data whose orientation conflicts with edges already oriented due to background knowledge. ```{r bnlearn} data(tpc_example) kn <- knowledge( tpc_example, child_x1 %-->% youth_x3 ) bnlearn_pc <- pc(engine = "bnlearn", test = "fisher_z", alpha = 0.05) output <- disco(data = tpc_example, method = bnlearn_pc, knowledge = kn) ``` The resulting causal graph will still respect the provided knowledge. ```{r plot bnlearn} plot(output) ``` ## causalDisco causalDisco engine only supports tiered and forbidden knowledge. If required knowledge is provided, it will give a warning and ignore the required knowledge. ## pcalg Only forbidden symmetric knowledge is supported for pcalg engine. That is, edges that are forbidden in both directions. Thus, the only type of knowledge that can be used with pcalg is knowledge created using forbidden edges (`%!-->%`) without any required or tier knowledge. An example is shown below: ```{r pcalg} data(tpc_example) kn <- knowledge( tpc_example, child_x1 %!-->% youth_x3, youth_x3 %!-->% child_x1 ) pc_pcalg <- pc(engine = "pcalg", test = "fisher_z", alpha = 0.05) output <- disco(data = tpc_example, method = pc_pcalg, knowledge = kn) ``` ## Tetrad All knowledge types are supported with Tetrad engine.