---
title: "Knowledge"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Knowledge}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(causalDisco)
```

This vignette demonstrates how to use the `knowledge()` function to incorporate prior knowledge into causal discovery
algorithms.
The different supported knowledge types are explained below, along with examples of how to create `Knowledge` objects and
use them with causal discovery methods. All knowledge types can be freely combined. Multiple calls or operators are
additive: each call adds new edges to the `Knowledge` object. For example if we require an edge from `A` to `B` and then
require an edge from A to C, the resulting `Knowledge` object will require both edges from `A` to `B` and from `A` to `C`.

At a conceptual level, all knowledge is represented as constraints on edges, specifying which edges are required and
forbidden. Some knowledge types provide higher-level abstractions for expressing common modeling
assumptions more conveniently.

# Required and forbidden knowledge

At the most basic level, prior knowledge is expressed as required or forbidden edges between variables.
These constraints apply to directed edges in the causal graph.

- Required edges specify that a directed edge must exist between two variables.
- Forbidden edges specify that a directed edge is not allowed between two variables.

These constraints are specified using the `%-->%` (required) and `%!-->%` (forbidden) operators, with the exclamation
mark (`!`) indicating negation of the edge, i.e. the absence of the edge. Conceptually, this could be written as
`%!(-->)%`, but we find this syntax too verbose.

## Specifying required and forbidden edges

Suppose we want to require an edge from A to B, from A to C, and forbid an edge from B to C:

```{r required and forbidden knowledge}
kn_1 <- knowledge(
  A %-->% c(B, C), # Require edges from A to B and A to C
  B %!-->% C # Forbid edge from B to C
)
```

This `Knowledge` object can be visualized:

```{r plot required and forbidden knowledge}
plot(kn_1)
```

The blue edge represents the required edge from A to B, while the red edge represents the forbidden edge
from B to C.

If one wishes to remove some edges (either required or forbidden) knowledge from an existing `Knowledge` object, the
`remove_edge()` function can be used. For example, to remove the required edge from A to B:

```{r remove required edge}
kn_1_removed <- remove_edge(kn_1, from = A, to = B)
plot(kn_1_removed)
```

## Specifying required and forbidden edges in a dataset

We will use the `tpc_example` dataset from the causalDisco package for the following examples:

```{r dataset required and forbidden knowledge}
data(tpc_example)
head(tpc_example)
```

We can pass the dataset to `knowledge` as the first argument, which also checks that the specified variables exist
in the dataset:

```{r required and forbidden knowledge with data}
kn_2 <- knowledge(
  tpc_example,
  child_x1 %-->% youth_x3, # Require edge from child_x1 to youth_x3
  child_x2 %!-->% oldage_x5 # Forbid edge from child_x2 to oldage_x5
)
```

This `Knowledge` object can also be visualized (we manually adjust the layout for better appearance):

```{r plot required and forbidden knowledge with data}
cg <- knowledge_to_caugi(kn_2)$caugi
layout <- caugi::caugi_layout_sugiyama(cg)
layout[6, 2] <- layout[4, 2]

plot(kn_2, layout = layout)
```

The plot then plots all variables in the dataset, with the required edges as blue edges and forbidden edges as
red edges.

### Using tidyselect helpers

To make specifying variables easier, you can use tidyselect helpers such as `starts_with`:

```{r required and forbidden knowledge with tidyselect}
kn_3 <- knowledge(
  tpc_example,
  starts_with("child") %-->% starts_with("youth"),
  starts_with("oldage") %!-->% starts_with("youth")
)
```

This means, that all variables starting with "child" are required to have edges to all variables starting with "youth",
and no variables starting with "oldage" can have edges to any variables starting with "youth". We can visualize this:

```{r plot required and forbidden knowledge with tidyselect}
plot(kn_3)
```

For a list of all available tidyselect helpers we refer to the [tidyselect reference documentation](https://tidyselect.r-lib.org/reference/index.html).

# Tiered knowledge

Tiered knowledge provides a higher-level abstraction for expressing systematic ordering assumptions, such as temporal
or logical precedence. Internally, tiered knowledge is translated into a collection of forbidden edges, but it is
exposed separately because it provides a concise and structured way to express common ordering assumptions.

For example, consider a dataset with three groups of variables: child, youth, and old. We may wish to enforce that
child variables precede youth variables, which in turn precede old variables. This can be expressed using tiered
knowledge.

Tiered knowledge enforces that edges may only point from earlier tiers to later tiers. Edges within the same tier are unrestricted unless additional knowledge is supplied.

## Creating a tiered `Knowledge` object

Suppose we observe variables over time: first the `A`'s, then the `B`'s, and finally the `C`'s.
This ordering implies that causal direction cannot go backward in time (e.g., `B`'s cannot cause `A`'s). 
A tiered `Knowledge` object captures this temporal structure by specifying tiers and their associated variables. 
If numeric tiers are used, lower numbers indicate earlier tiers; otherwise, tiers are ordered by their appearance.

The following specifications encode the same tier structure:

```{r tier knowledge}
kn <- knowledge(
  tier(
    1 ~ c(A1, A2),
    2 ~ c(B1, B2),
    3 ~ c(C1, C2)
  )
)

# Same object, since tiers are ordered numerically
kn_same <- knowledge(
  tier(
    1 ~ c(A1, A2),
    3 ~ c(C1, C2),
    2 ~ c(B1, B2)
  )
)

# Functionally equivalent, though not identical
kn_almost <- knowledge(
  tier(
    10 ~ c(A1, A2),
    30 ~ c(C1, C2),
    20 ~ c(B1, B2)
  )
)

# Again functionally equivalent
kn_also_almost <- knowledge(
  tier(
    A ~ c(A1, A2),
    B ~ c(B1, B2),
    C ~ c(C1, C2)
  )
)

# Has a letter, so tiers are ordered by appearance, thus functionally equivalent
kn_mixed <- knowledge(
  tier(
    3   ~ c(A1, A2),
    B   ~ c(B1, B2),
    1   ~ c(C1, C2)
  )
)
```

We can visualize the tiers:

```{r plot tier knowledge}
plot(kn)
```

The plot then shows the tiers as layers, with the earliest tiers to the left and latest to the right.

We can convert the meaning of the tiered knowledge into explicit forbidden edges using `convert_tiers_to_forbidden()`:

```{r convert tiers to forbidden}
kn_converted <- convert_tiers_to_forbidden(kn)
print(kn_converted)
plot(kn_converted)
```

Tidyselect helpers such as `starts_with` can also be used to define tiers in a concise way, just as with required and 
forbidden edges. Different tidyselect helpers can be freely combined within a tier definition using `+`. For example,
the following tiered `Knowledge` object defines two tiers, "young" and "old", by combining tidyselect helpers:

```{r tier knowledge with tidyselect}
kn_tier_tidyselect <- knowledge(
  tpc_example,
  tier(
    young ~ starts_with("child") + ends_with(c("3", "4")),
    old   ~ starts_with("old")
  )
)
plot(kn_tier_tidyselect)
```

# Exogenous variables knowledge

Exogenous variables are those that have no incoming edges in the causal graph.
That is, variables which are known causes but are not affected by other variables.
Exogenous variables can be specified using the `exogenous()` function within `knowledge()`.

## Specifying exogenous variables

The most natural usage is to supply the dataset so that the variables are checked for existence and selected correctly:

```{r exogenous knowledge}
kn_exo_1 <- knowledge(
  tpc_example,
  exogenous("child_x1")
)
```

Instead of `exogenous`, you can also use the shorthand function `exo()`. This `knowledge` object can be visualized:

```{r plot exogenous knowledge}
plot(kn_exo_1)
```

Below we add both `child_x1` and `child_x2` as exogenous variables using tidyselect helpers:

```{r exogenous knowledge with tidyselect}
kn_exo_2 <- knowledge(
  tpc_example,
  exo(starts_with("child"))
)
plot(kn_exo_2, layout = "bipartite", orientation = "columns")
```

# Combining different knowledge types

Different knowledge types can be freely combined in a single `knowledge` object. For example, we can combine tiered
knowledge with required and forbidden edges:

```{r combined knowledge}
kn_combined <- knowledge(
  tpc_example,
  tier(
    1 ~ starts_with("child"),
    2 ~ starts_with("youth"),
    3 ~ starts_with("oldage")
  ),
  child_x1 %-->% youth_x3,
  child_x1 %!-->% child_x2
)

plot(kn_combined)
```

# Using knowledge with causal discovery

Once prior knowledge has been specified, it can be supplied to causal discovery algorithms by passing the `knowledge`
object to the `disco()` function via the `knowledge` parameter. For example, we can use the Temporal GES algorithm
`tges()` with engine "causalDisco" and temporal BIC ("tbic"):

```{r causal discovery with tier knowledge}
kn <- knowledge(
  tpc_example,
  tier(
    1 ~ starts_with("child"),
    2 ~ starts_with("youth"),
    3 ~ starts_with("oldage")
  )
)

cd_tges <- tges(engine = "causalDisco", score = "tbic")
disco_cd_tges <- disco(data = tpc_example, method = cd_tges, knowledge = kn)
```

The causal discovery algorithms respects the provided knowledge. We can plot the resulting causal graph:

```{r plot causal discovery with tier knowledge}
plot(disco_cd_tges)
```

The black edges are those inferred from the data.

# Engine specific information about knowledge

By engine we mean the underlying implementation of the causal discovery algorithm, i.e. the engine you specify
to an algorithm such as `pc(engine = "bnlearn")` or `tges(engine = "causalDisco")`.

## bnlearn

All knowledge types are supported with bnlearn engine. When required knowledge is provided,
bnlearn may emit a warning during structure learning. This occurs when the algorithm identifies a candidate v-structure
(collider) from the data whose orientation conflicts with edges already oriented due to background knowledge.

```{r bnlearn}
data(tpc_example)

kn <- knowledge(
  tpc_example,
  child_x1 %-->% youth_x3
)

bnlearn_pc <- pc(engine = "bnlearn", test = "fisher_z", alpha = 0.05)
output <- disco(data = tpc_example, method = bnlearn_pc, knowledge = kn)
```

The resulting causal graph will still respect the provided knowledge.

```{r plot bnlearn}
plot(output)
```

## causalDisco

causalDisco engine only supports tiered and forbidden knowledge. If required knowledge is provided, it will give
a warning and ignore the required knowledge.

## pcalg

Only forbidden symmetric knowledge is supported for pcalg engine. That is, edges that are forbidden in both directions.
Thus, the only type of knowledge that can be used with pcalg is knowledge created using forbidden edges (`%!-->%`)
without any required or tier knowledge. An example is shown below:

```{r pcalg}
data(tpc_example)
kn <- knowledge(
  tpc_example,
  child_x1 %!-->% youth_x3,
  youth_x3 %!-->% child_x1
)
pc_pcalg <- pc(engine = "pcalg", test = "fisher_z", alpha = 0.05)
output <- disco(data = tpc_example, method = pc_pcalg, knowledge = kn)
```

## Tetrad

All knowledge types are supported with Tetrad engine.