--- title: "Getting Started with bibnets" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with bibnets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(bibnets) ``` ## What bibnets Builds Bibliometric data usually arrives as a table of papers. Each paper has fields such as authors, references, keywords, countries, institutions, source title, and year. Most bibliometric networks are projections of those fields. The core idea is simple: 1. Build a sparse `papers x entities` matrix. 2. Weight the matrix with a counting method. 3. Multiply the matrix to obtain entity-entity or paper-paper links. 4. Return a standard edge list. Internally, this is the package pipeline: ``` build_bipartite() -> apply_counting() -> multiply_bipartite() -> as_bibnets_network() ``` The exported builders wrap that pipeline for common bibliometric questions: | Function | Nodes | Link means | |---|---|---| | `author_network()` | authors | co-authorship, author coupling, or author co-citation | | `reference_network()` | cited references | references are cited together | | `document_network()` | documents | shared references, shared citers, or direct citation | | `keyword_network()` | keywords | keywords appear together in papers | | `source_network()` | journals/sources | sources share references or are co-cited | | `country_network()` | countries | countries collaborate or share references | | `institution_network()` | institutions | institutions collaborate or share references | | `conetwork()` | any field | entities co-occur or share values of another field | | `local_citations()` | documents | local citation counts inside the corpus | | `historiograph()` | documents | directed citation history among locally cited papers | | `temporal_network()` | any builder's nodes | the same network repeated over time windows | Every network builder returns a `bibnets_network`: a data frame with columns `from`, `to`, `weight`, and `count`. `count` is the raw binary co-occurrence. `weight` is the analytical weight after counting and optional similarity normalization. ## Data Used in This Vignette The package includes small and medium example datasets: ```{r} data(biblio_data) data(scopus_quantum_cloud) data(open_alex_gold_open_access_learning_analytics) small <- biblio_data sc <- scopus_quantum_cloud oa <- open_alex_gold_open_access_learning_analytics nrow(small) nrow(sc) nrow(oa) ``` `biblio_data` is a tiny synthetic dataset. `scopus_quantum_cloud` contains 499 Scopus records. `open_alex_gold_open_access_learning_analytics` contains 1,508 OpenAlex records with authors, countries, institutions, and primary topics. ## Reading Your Own Data For files, use `read_biblio()`: ```{r, eval = FALSE} data <- read_biblio("export.csv") data <- read_biblio("folder_with_exports/") data <- read_biblio(c("part_1.csv", "part_2.csv")) ``` `read_biblio()` detects common formats from file content. You can also call a reader directly: ```{r, eval = FALSE} read_scopus("scopus.csv") read_wos("savedrecs.txt") read_openalex_csv("openalex_works.csv") read_dimensions("dimensions.csv") read_lens("lens.csv") read_bibtex("library.bib") read_ris("library.ris") ``` For a custom CSV, specify the identifier and the columns that should be split into list-columns: ```{r, eval = FALSE} data <- read_biblio( "custom.csv", format = "generic", id = "paper_id", actors = c("Authors", "Keywords"), sep = ";" ) ``` Readers return a common schema where possible: ```{r} names(sc)[1:12] ``` The most important columns for network construction are: - `id`: the document identifier. - `authors`: a list-column of author names. - `references`: a list-column of cited references or cited work IDs. - `keywords`: a list-column of author/index/topic keywords. - `year`: the time variable used by `temporal_network()`. Source-specific columns such as `countries`, `affiliations`, `index_keywords`, and `keywords_plus` are preserved when available. ## Author Collaboration The simplest author network links two authors when they appear on the same paper: ```{r} authors_full <- author_network(oa, type = "collaboration") head(authors_full, 5) ``` The printed result has the standard schema: ```{r} summary(authors_full) ``` Use `min_occur` to remove very rare authors before projection: ```{r} authors_core <- author_network(oa, "collaboration", min_occur = 2) nrow(authors_full) nrow(authors_core) ``` ## Counting Methods Counting determines how much a paper contributes to edge weights. Full counting gives every observed co-occurrence a weight of 1: ```{r} head(author_network(small, "collaboration", counting = "full"), 5) ``` Fractional counting reduces the influence of long author lists: ```{r} head(author_network(small, "collaboration", counting = "fractional"), 5) ``` Harmonic counting gives more credit to earlier byline positions while keeping the paper's total credit normalized: ```{r} head(author_network(small, "collaboration", counting = "harmonic"), 5) ``` First-last counting is useful only when the field's authorship conventions make both first and last positions meaningful: ```{r} head(author_network(small, "collaboration", counting = "first_last"), 5) ``` The correct method depends on the claim being made. Use `full` when the question is about observed collaboration events. Use `fractional` when papers with many entities should not dominate. Use position-dependent methods only when author order is analytically meaningful. ## Attention-Style Position Weights The `attention` argument applies a smooth position profile. It is separate from `counting` and is available for author, keyword, country, and institution networks. ```{r} lead <- author_network(small, attention = "lead") last <- author_network(small, attention = "last") head(lead, 5) head(last, 5) ``` The four profiles are: | `attention` | Highest weight | |---|---| | `"lead"` | first position | | `"last"` | last position | | `"proximity"` | middle positions | | `"circular"` | first and last positions | Use attention weighting when the analysis needs a transparent positional assumption rather than a named bibliometric counting convention. ## Reference Co-citation Co-citation links two cited references when they are cited together by at least one paper: ```{r} refs <- reference_network(sc, min_occur = 2) head(refs, 5) ``` Co-citation is a column-mode projection of the `papers x references` matrix. The nodes are references; the links come from papers that cite both references. Similarity normalization can reduce the advantage of very frequently cited references: ```{r} refs_cos <- reference_network(sc, min_occur = 2, similarity = "cosine") head(refs_cos, 5) ``` ## Document Coupling and Citation Bibliographic coupling links two documents when they share cited references: ```{r} coupled_docs <- document_network(sc, type = "coupling", similarity = "cosine") head(coupled_docs, 5) ``` Direct citation is different. It keeps direction: `from` is the citing document and `to` is the cited document, but only when both documents are inside the same corpus. ```{r} direct_docs <- document_network(sc, type = "citation") head(direct_docs, 5) ``` Many exported datasets cite external works that are not themselves rows in the dataset. Those external citations support co-citation and coupling, but they do not become direct-citation edges unless the cited work is also present in `id`. ## Keyword Co-occurrence Keyword networks are often the quickest way to inspect a corpus thematically: ```{r} kw <- keyword_network(sc, min_occur = 2) head(kw, 5) ``` Entity labels are trimmed and uppercased during matrix construction. This means that `machine learning`, `Machine Learning`, and ` MACHINE LEARNING ` resolve to the same node. Association strength is commonly useful for co-occurrence maps because it downweights pairs that are common only because both keywords are individually frequent: ```{r} kw_assoc <- keyword_network(sc, min_occur = 2, similarity = "association") head(kw_assoc, 5) ``` ## Countries, Institutions, and Sources OpenAlex-style data often contains country and institution list-columns: ```{r} country_edges <- country_network(oa, counting = "fractional") head(country_edges, 5) inst_edges <- institution_network(oa, counting = "fractional", min_occur = 2) head(inst_edges, 5) ``` Source networks use `journal` as the entity field. Coupling links sources that cite the same references: ```{r} source_edges <- source_network(sc, type = "coupling", min_occur = 2) head(source_edges, 5) ``` For source, country, institution, and author coupling, `min_occur` is applied to the aggregated entity before building the coupling network. ## Generic Co-networks Use `conetwork()` when you want a projection not covered by a dedicated helper. One-field use: ```{r} head(conetwork(sc, "keywords", min_occur = 2), 5) ``` Two-field use: ```{r} head(conetwork(sc, "authors", by = "keywords", min_occur = 2), 5) ``` The second example links authors through shared keywords. This is not a co-authorship network; it is a thematic-similarity network between authors. Delimited character columns are split automatically: ```{r} toy <- data.frame( id = c("P1", "P2", "P3"), tags = c("methods; networks", "networks; R", "methods; R") ) conetwork(toy, "tags") ``` ## Normalization The same raw counts can support different similarity scores: ```{r} none <- keyword_network(sc, min_occur = 2, similarity = "none") cos <- keyword_network(sc, min_occur = 2, similarity = "cosine") head(none[, c("from", "to", "weight", "count")], 3) head(cos[, c("from", "to", "weight", "count")], 3) ``` Notice that `count` is unchanged. The `weight` column changes because normalization is applied after raw co-occurrence has been counted. Available methods are: ```{r} normalize(to_matrix(keyword_network(small)), "cosine") ``` In practice: - Use `similarity = "none"` for raw weighted counts. - Use `similarity = "cosine"` for interpretable overlap scaled by marginal size. - Use `similarity = "association"` when the goal is to emphasize pairs that co-occur more than expected from their individual frequencies. - Use `jaccard`, `inclusion`, or `equivalence` when those coefficients match a downstream method or established reporting convention. ## Reducing Large Networks Dense co-occurrence networks can be hard to inspect. `bibnets` provides three different reduction strategies. ```{r} edges <- author_network(oa, "collaboration") nrow(edges) nrow(prune(edges, threshold = 2)) nrow(prune(edges, top_n = 5)) nrow(filter_top(edges, n = 50)) ``` `prune(threshold = x)` keeps edges with weight at least `x`. `prune(top_n = k)` keeps locally strong edges for each endpoint. `filter_top(n = k)` first selects the most connected nodes, then keeps edges among them. `backbone()` applies the disparity filter for multiscale weighted networks: ```{r} bb <- backbone(edges, alpha = 0.05) nrow(bb) head(bb, 5) ``` The disparity filter asks whether an edge is unusually strong relative to at least one endpoint's local strength distribution. This is different from a global weight cutoff and can preserve meaningful edges attached to smaller nodes. ## Temporal Networks `temporal_network()` runs any network builder over time windows: ```{r} tn <- temporal_network(oa, author_network, "collaboration", window = 3) names(tn) ``` Fixed windows are non-overlapping. Sliding windows overlap: ```{r} tn_slide <- temporal_network( oa, author_network, "collaboration", window = 3, step = 1, strategy = "sliding" ) names(tn_slide) ``` Cumulative windows always start at the first observed year and grow forward: ```{r} tn_cum <- temporal_network( oa, author_network, "collaboration", window = 3, strategy = "cumulative" ) names(tn_cum) ``` Each returned edge list has a `window` column. Windows with fewer than two records or no surviving edges are omitted. If a builder errors inside a window, `temporal_network()` reports a warning with the window label. ## Local Citations and Historiographs `local_citations()` counts how often each document is cited by other documents inside the same dataset: ```{r} lcs <- local_citations(sc) head(lcs, 5) ``` `historiograph()` builds a directed citation graph among the top locally cited documents: ```{r} h <- historiograph(sc, n = 10) h$nodes head(h$edges, 5) ``` This requires reference strings or IDs to match document IDs in the same data frame. If the cited works are external to the corpus, local citation counts will be low or zero even when global citation counts are high. ## Exporting Results The default edge list is already useful for many tools: ```{r} edges <- keyword_network(sc, min_occur = 2) head(edges, 5) ``` Convert to a sparse matrix: ```{r} m <- to_matrix(edges) m[1:4, 1:4] ``` Prepare Gephi tables: ```{r} gephi <- to_gephi(edges) head(gephi$nodes, 3) head(gephi$edges, 3) ``` Write GraphML without adding an XML dependency: ```{r} xml <- to_graphml(edges) cat(substr(xml, 1, 300)) ``` Optional graph objects are available when the suggested packages are installed: ```{r, eval = FALSE} if (requireNamespace("igraph", quietly = TRUE)) { g <- to_igraph(edges) } if (requireNamespace("tidygraph", quietly = TRUE)) { tg <- to_tbl_graph(edges) } if (requireNamespace("cograph", quietly = TRUE)) { cg <- to_cograph(edges) } ``` ## Interpreting a `bibnets_network` The object stores construction metadata as attributes: ```{r} edges <- author_network(oa, "collaboration", counting = "harmonic") attr(edges, "network_type") attr(edges, "counting") attr(edges, "similarity") ``` The `print()` method reports the network type, node count, edge count, counting method, and similarity method. `summary()` reports basic network and weight summaries: ```{r} summary(edges) ``` These attributes are meant to make downstream output easier to audit. A saved edge list should still say how it was produced.