---
title: "Getting Started with bibnets"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with bibnets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(bibnets)
```

## What bibnets Builds

Bibliometric data usually arrives as a table of papers. Each paper has fields
such as authors, references, keywords, countries, institutions, source title,
and year. Most bibliometric networks are projections of those fields.

The core idea is simple:

1. Build a sparse `papers x entities` matrix.
2. Weight the matrix with a counting method.
3. Multiply the matrix to obtain entity-entity or paper-paper links.
4. Return a standard edge list.

Internally, this is the package pipeline:

```
build_bipartite()
  -> apply_counting()
  -> multiply_bipartite()
  -> as_bibnets_network()
```

The exported builders wrap that pipeline for common bibliometric questions:

| Function | Nodes | Link means |
|---|---|---|
| `author_network()` | authors | co-authorship, author coupling, or author co-citation |
| `reference_network()` | cited references | references are cited together |
| `document_network()` | documents | shared references, shared citers, or direct citation |
| `keyword_network()` | keywords | keywords appear together in papers |
| `source_network()` | journals/sources | sources share references or are co-cited |
| `country_network()` | countries | countries collaborate or share references |
| `institution_network()` | institutions | institutions collaborate or share references |
| `conetwork()` | any field | entities co-occur or share values of another field |
| `local_citations()` | documents | local citation counts inside the corpus |
| `historiograph()` | documents | directed citation history among locally cited papers |
| `temporal_network()` | any builder's nodes | the same network repeated over time windows |

Every network builder returns a `bibnets_network`: a data frame with columns
`from`, `to`, `weight`, and `count`.

`count` is the raw binary co-occurrence. `weight` is the analytical weight after
counting and optional similarity normalization.

## Data Used in This Vignette

The package includes small and medium example datasets:

```{r}
data(biblio_data)
data(scopus_quantum_cloud)
data(open_alex_gold_open_access_learning_analytics)

small <- biblio_data
sc <- scopus_quantum_cloud
oa <- open_alex_gold_open_access_learning_analytics

nrow(small)
nrow(sc)
nrow(oa)
```

`biblio_data` is a tiny synthetic dataset. `scopus_quantum_cloud` contains 499
Scopus records. `open_alex_gold_open_access_learning_analytics` contains 1,508
OpenAlex records with authors, countries, institutions, and primary topics.

## Reading Your Own Data

For files, use `read_biblio()`:

```{r, eval = FALSE}
data <- read_biblio("export.csv")
data <- read_biblio("folder_with_exports/")
data <- read_biblio(c("part_1.csv", "part_2.csv"))
```

`read_biblio()` detects common formats from file content. You can also call a
reader directly:

```{r, eval = FALSE}
read_scopus("scopus.csv")
read_wos("savedrecs.txt")
read_openalex_csv("openalex_works.csv")
read_dimensions("dimensions.csv")
read_lens("lens.csv")
read_bibtex("library.bib")
read_ris("library.ris")
```

For a custom CSV, specify the identifier and the columns that should be split
into list-columns:

```{r, eval = FALSE}
data <- read_biblio(
  "custom.csv",
  format = "generic",
  id = "paper_id",
  actors = c("Authors", "Keywords"),
  sep = ";"
)
```

Readers return a common schema where possible:

```{r}
names(sc)[1:12]
```

The most important columns for network construction are:

- `id`: the document identifier.
- `authors`: a list-column of author names.
- `references`: a list-column of cited references or cited work IDs.
- `keywords`: a list-column of author/index/topic keywords.
- `year`: the time variable used by `temporal_network()`.

Source-specific columns such as `countries`, `affiliations`, `index_keywords`,
and `keywords_plus` are preserved when available.

## Author Collaboration

The simplest author network links two authors when they appear on the same
paper:

```{r}
authors_full <- author_network(oa, type = "collaboration")
head(authors_full, 5)
```

The printed result has the standard schema:

```{r}
summary(authors_full)
```

Use `min_occur` to remove very rare authors before projection:

```{r}
authors_core <- author_network(oa, "collaboration", min_occur = 2)
nrow(authors_full)
nrow(authors_core)
```

## Counting Methods

Counting determines how much a paper contributes to edge weights.

Full counting gives every observed co-occurrence a weight of 1:

```{r}
head(author_network(small, "collaboration", counting = "full"), 5)
```

Fractional counting reduces the influence of long author lists:

```{r}
head(author_network(small, "collaboration", counting = "fractional"), 5)
```

Harmonic counting gives more credit to earlier byline positions while keeping
the paper's total credit normalized:

```{r}
head(author_network(small, "collaboration", counting = "harmonic"), 5)
```

First-last counting is useful only when the field's authorship conventions make
both first and last positions meaningful:

```{r}
head(author_network(small, "collaboration", counting = "first_last"), 5)
```

The correct method depends on the claim being made. Use `full` when the
question is about observed collaboration events. Use `fractional` when papers
with many entities should not dominate. Use position-dependent methods only
when author order is analytically meaningful.

## Attention-Style Position Weights

The `attention` argument applies a smooth position profile. It is separate from
`counting` and is available for author, keyword, country, and institution
networks.

```{r}
lead <- author_network(small, attention = "lead")
last <- author_network(small, attention = "last")

head(lead, 5)
head(last, 5)
```

The four profiles are:

| `attention` | Highest weight |
|---|---|
| `"lead"` | first position |
| `"last"` | last position |
| `"proximity"` | middle positions |
| `"circular"` | first and last positions |

Use attention weighting when the analysis needs a transparent positional
assumption rather than a named bibliometric counting convention.

## Reference Co-citation

Co-citation links two cited references when they are cited together by at least
one paper:

```{r}
refs <- reference_network(sc, min_occur = 2)
head(refs, 5)
```

Co-citation is a column-mode projection of the `papers x references` matrix.
The nodes are references; the links come from papers that cite both references.

Similarity normalization can reduce the advantage of very frequently cited
references:

```{r}
refs_cos <- reference_network(sc, min_occur = 2, similarity = "cosine")
head(refs_cos, 5)
```

## Document Coupling and Citation

Bibliographic coupling links two documents when they share cited references:

```{r}
coupled_docs <- document_network(sc, type = "coupling", similarity = "cosine")
head(coupled_docs, 5)
```

Direct citation is different. It keeps direction: `from` is the citing document
and `to` is the cited document, but only when both documents are inside the same
corpus.

```{r}
direct_docs <- document_network(sc, type = "citation")
head(direct_docs, 5)
```

Many exported datasets cite external works that are not themselves rows in the
dataset. Those external citations support co-citation and coupling, but they do
not become direct-citation edges unless the cited work is also present in `id`.

## Keyword Co-occurrence

Keyword networks are often the quickest way to inspect a corpus thematically:

```{r}
kw <- keyword_network(sc, min_occur = 2)
head(kw, 5)
```

Entity labels are trimmed and uppercased during matrix construction. This means
that `machine learning`, `Machine Learning`, and ` MACHINE LEARNING ` resolve
to the same node.

Association strength is commonly useful for co-occurrence maps because it
downweights pairs that are common only because both keywords are individually
frequent:

```{r}
kw_assoc <- keyword_network(sc, min_occur = 2, similarity = "association")
head(kw_assoc, 5)
```

## Countries, Institutions, and Sources

OpenAlex-style data often contains country and institution list-columns:

```{r}
country_edges <- country_network(oa, counting = "fractional")
head(country_edges, 5)

inst_edges <- institution_network(oa, counting = "fractional", min_occur = 2)
head(inst_edges, 5)
```

Source networks use `journal` as the entity field. Coupling links sources that
cite the same references:

```{r}
source_edges <- source_network(sc, type = "coupling", min_occur = 2)
head(source_edges, 5)
```

For source, country, institution, and author coupling, `min_occur` is applied
to the aggregated entity before building the coupling network.

## Generic Co-networks

Use `conetwork()` when you want a projection not covered by a dedicated helper.

One-field use:

```{r}
head(conetwork(sc, "keywords", min_occur = 2), 5)
```

Two-field use:

```{r}
head(conetwork(sc, "authors", by = "keywords", min_occur = 2), 5)
```

The second example links authors through shared keywords. This is not a
co-authorship network; it is a thematic-similarity network between authors.

Delimited character columns are split automatically:

```{r}
toy <- data.frame(
  id = c("P1", "P2", "P3"),
  tags = c("methods; networks", "networks; R", "methods; R")
)

conetwork(toy, "tags")
```

## Normalization

The same raw counts can support different similarity scores:

```{r}
none <- keyword_network(sc, min_occur = 2, similarity = "none")
cos  <- keyword_network(sc, min_occur = 2, similarity = "cosine")

head(none[, c("from", "to", "weight", "count")], 3)
head(cos[, c("from", "to", "weight", "count")], 3)
```

Notice that `count` is unchanged. The `weight` column changes because
normalization is applied after raw co-occurrence has been counted.

Available methods are:

```{r}
normalize(to_matrix(keyword_network(small)), "cosine")
```

In practice:

- Use `similarity = "none"` for raw weighted counts.
- Use `similarity = "cosine"` for interpretable overlap scaled by marginal
  size.
- Use `similarity = "association"` when the goal is to emphasize pairs that
  co-occur more than expected from their individual frequencies.
- Use `jaccard`, `inclusion`, or `equivalence` when those coefficients match a
  downstream method or established reporting convention.

## Reducing Large Networks

Dense co-occurrence networks can be hard to inspect. `bibnets` provides three
different reduction strategies.

```{r}
edges <- author_network(oa, "collaboration")

nrow(edges)
nrow(prune(edges, threshold = 2))
nrow(prune(edges, top_n = 5))
nrow(filter_top(edges, n = 50))
```

`prune(threshold = x)` keeps edges with weight at least `x`.
`prune(top_n = k)` keeps locally strong edges for each endpoint.
`filter_top(n = k)` first selects the most connected nodes, then keeps edges
among them.

`backbone()` applies the disparity filter for multiscale weighted networks:

```{r}
bb <- backbone(edges, alpha = 0.05)
nrow(bb)
head(bb, 5)
```

The disparity filter asks whether an edge is unusually strong relative to at
least one endpoint's local strength distribution. This is different from a
global weight cutoff and can preserve meaningful edges attached to smaller
nodes.

## Temporal Networks

`temporal_network()` runs any network builder over time windows:

```{r}
tn <- temporal_network(oa, author_network, "collaboration", window = 3)
names(tn)
```

Fixed windows are non-overlapping. Sliding windows overlap:

```{r}
tn_slide <- temporal_network(
  oa,
  author_network,
  "collaboration",
  window = 3,
  step = 1,
  strategy = "sliding"
)

names(tn_slide)
```

Cumulative windows always start at the first observed year and grow forward:

```{r}
tn_cum <- temporal_network(
  oa,
  author_network,
  "collaboration",
  window = 3,
  strategy = "cumulative"
)

names(tn_cum)
```

Each returned edge list has a `window` column. Windows with fewer than two
records or no surviving edges are omitted. If a builder errors inside a window,
`temporal_network()` reports a warning with the window label.

## Local Citations and Historiographs

`local_citations()` counts how often each document is cited by other documents
inside the same dataset:

```{r}
lcs <- local_citations(sc)
head(lcs, 5)
```

`historiograph()` builds a directed citation graph among the top locally cited
documents:

```{r}
h <- historiograph(sc, n = 10)
h$nodes
head(h$edges, 5)
```

This requires reference strings or IDs to match document IDs in the same data
frame. If the cited works are external to the corpus, local citation counts will
be low or zero even when global citation counts are high.

## Exporting Results

The default edge list is already useful for many tools:

```{r}
edges <- keyword_network(sc, min_occur = 2)
head(edges, 5)
```

Convert to a sparse matrix:

```{r}
m <- to_matrix(edges)
m[1:4, 1:4]
```

Prepare Gephi tables:

```{r}
gephi <- to_gephi(edges)
head(gephi$nodes, 3)
head(gephi$edges, 3)
```

Write GraphML without adding an XML dependency:

```{r}
xml <- to_graphml(edges)
cat(substr(xml, 1, 300))
```

Optional graph objects are available when the suggested packages are installed:

```{r, eval = FALSE}
if (requireNamespace("igraph", quietly = TRUE)) {
  g <- to_igraph(edges)
}

if (requireNamespace("tidygraph", quietly = TRUE)) {
  tg <- to_tbl_graph(edges)
}

if (requireNamespace("cograph", quietly = TRUE)) {
  cg <- to_cograph(edges)
}
```

## Interpreting a `bibnets_network`

The object stores construction metadata as attributes:

```{r}
edges <- author_network(oa, "collaboration", counting = "harmonic")

attr(edges, "network_type")
attr(edges, "counting")
attr(edges, "similarity")
```

The `print()` method reports the network type, node count, edge count, counting
method, and similarity method. `summary()` reports basic network and weight
summaries:

```{r}
summary(edges)
```

These attributes are meant to make downstream output easier to audit. A saved
edge list should still say how it was produced.