--- title:
Dates and missing dating data in `"sdam"` date: "August 2022" author: - name: Antonio Rivero Ostoic

affiliation: Aarhus University

email: jaro@cas.au.dk output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Dates and missing dating data in `"sdam"`} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, echo=FALSE, message=FALSE} knitr::opts_chunk$set(echo=TRUE,error=TRUE) knitr::opts_chunk$set(comment = "") library("sdam") ``` ```{r set-options, echo=FALSE, cache=FALSE} options(width = 96) ```

### Preliminaries Install and load a version of `"sdam"` package.

```{r, echo=TRUE, eval=FALSE} install.packages("sdam") # from CRAN devtools::install_github("sdam-au/sdam") # development version devtools::install_github("mplex/cedhar", subdir="pkg/sdam") # legacy version R 3.6.x ```

```{r} # load and check versions library(sdam) packageVersion("sdam") ```

## Dating data Temporal data is significant when it comes to analysing the history of archaeological artefacts like written markers from the Ancient Mediterranean. In the `EDH` dataset, for example, dates for inscriptions are plausible timespans of existence with the endpoints in variables `not_before` and `not_after` that, from the perspective of the timespan, are the *terminus ante quem* (TAQ) and *terminus post quem* (TPQ) of the time segment. However, not all inscriptions have these two variables filled by domain experts and replacing missing dating data constitutes a challenge. Besides `EDH`, other datasets with `"sdam"` the package and related functions involve dating data in the ancient Mediterranean like displaying dates and time segments in a plot, by organising dates within Roman provinces, and by performed imputation techniques for missing dating data.

### Plotting temporal data #### Shipwrecks dataset dating data An example of plotting dates is with the Shipwrecks external dataset, which is a semicolon separated file of different variables.

References for shipwrecks data are in - Vignette [Datasets in `"sdam"` package](../doc/Intro.html)

When reading the shipwrecks external dataset with ` read.csv` make sure to use the right separator in `sep` and leave untouched the names of the variables. ```{r} # load shipwrecks external dataset sw <- system.file("extdata","StraussShipwrecks.csv",package="sdam") |> read.csv(sep=";", check.names=FALSE) ``` ```{r} # variables in shipwrecks dataset colnames(sw) ``` Plot the time segments with function `plot.dates()` and a customized `'id'` where variables 15 to 16 in `sw` have timespans of existence as `'taq'` and `'tpq'`. ```{r, echo=TRUE, eval=TRUE, fig.width=4, fig.height=4, fig.align="center", fig.cap="Range of timespans in Shipwrecks dataset"} # shipwrecks dates with Wreck ID plot.dates(sw, id="Wreck ID", type="rg", taq="Earliest date", tpq="Latest date", col=4) ```

#### Mid points and range of timespan The mid points and range of shipwrecks data are explicitly computed by function `prex()` with the `mp` option in the `'type'` argument. `'vars'` stands for the variables that in this case are TAQ and TPQ, and the `'keep'` option allows maintaining the rest of the variables in the output that for `prex()` with mid points is a data frame. ```{r} # add mid points and range to shipwrecks data prex(sw[c(1,7,15:16)], type="mp", vars=c("Earliest date", "Latest date"), keep=TRUE) |> tail() ```

The default `'type'` option and chronological phase in `prex()` are the aoristic sum with a five periods bin or `bin5`. ```{r} # aoristic sum shipwrecks prex(sw[c(1,7,15:16)], vars=c("Earliest date", "Latest date")) ```

For an eight chronological periods bin in the shipwrecks dataset ```{r} # aoristic sum shipwrecks 8 bin prex(sw[c(1,7,15:16)], vars=c("Earliest date", "Latest date"), cp="bin8") ```

For aoristic sum algorithm, cf. [Temporal Uncertainty](https://mplex.github.io/cedhar/Uncertainty.html).

## Dating data in the Roman world Many functions and datasets in `"sdam"` are related to temporal information of the Roman world, particularly from the Roman Empire during the classical ancient period.

Function `plot.map()` is to depict cartographical maps per Roman province or region, and it has a `'date'` argument to display dates within the caption. Dates in this case are one or two years either for the consolidation of the Italian peninsula or the affiliation of the region to the Roman Empire. ```{r, echo=TRUE, eval=FALSE} # silhouette of Italian peninsula plot.map(x="Ita", date=TRUE) ## not run ```

* The built-in dataset `rpmcd` has the shapes and colours used in the cartographical maps with `plot.map()`, and some dates related to provinces as well. ```{r} # 59 provinces dates, colors, and shapes data("rpmcd") # province acronyms as in EDH names(rpmcd) ```

### Roman provinces establishment dates The establishment dates of Roman provinces used in the cartographical map captions are in the second component of `rpmcd`. ```{r} # pipe dataset for dates in second component rpmcd |> lapply(function (x) x[[2]]) |> head() ```

A vector of establishment dates in years from the `"rpmcd"` dataset is recorded in object `est` that allow making a chronology of the Roman provinces. ```{r, echo=-5} # second component in dataset est <- rpmcd |> lapply(function (x) x[[2]]) |> unlist(use.names=FALSE) est ```

### Formatting dates The establishment dates of Roman provinces and regions are in vector `est`, and these dates can become more standard with the function `cln()` for further processing. This is a cleaning function where, for instance, level `9` removes all content after the first parenthesis in the input while the other levels are for specific needs. ```{r} # clean levels are 0-9 cln(est, level=9) ```

After this transformation of the data in `est`, is possible to format dates as numerical data with function `dts()`, which takes the first value when there are two competing dates in the input; unless the opposite is specified in the `'last'` argument. ```{r} # update object with establishment dates est <- est |> cln(level=9) |> dts() ``` ```{r} est ```

### Chronology of Roman provinces Object `est` has a chronology for the establishment dates of Mediterranean regions and territories as Roman provinces that corresponds to the provinces in `"rpmcd"` dataset. The union of the names of provinces and dates of establishment as a Roman province is a data frame object `rpde` that better displays without the row names. ```{r} # Roman province dates of establishement (strings still strings) rpde <- cbind(names(rpmcd),dts(est)) |> as.data.frame(stringsAsFactors=FALSE) ``` ```{r} rownames(rpde) <- NULL head(rpde) ```

Because the dates have a numerical format from function `dts()`, the data frame allows producing a chronology of affiliation dates for the provinces and regions to the Roman Empire by ordering the second variable in `rpde`. ```{r} # order of affiliation of provinces rpde[order(as.numeric(rpde$V2)),1] ```

The regions in the Italian peninsula have the earliest affiliation dates, and Mesopotamia has the latest affiliation date to the Roman Empire.

### Roman influence periods * Dataset `"rpcp"` has influence periods of the Roman Empire. ```{r, echo=TRUE, eval=TRUE} # list with 45 early and late influence dates provinces data("rpcp") ``` ```{r} # look at data internal structure str(rpcp) ```

#### Early period of Roman influence Visualize time intervals of early Roman influence in provinces and regions. ```{r, echo=TRUE, eval=FALSE} # early influence dates are in first list of 'rpcp' plot.dates(x=rpcp[[1]], taq="EarInf", tpq="OffPrv", main="Early period", ylab="province") ``` ```{r, echo=FALSE, eval=TRUE, fig.width=4, fig.height=4, fig.align="center"} plot.dates(x=rpcp[[1]], taq="EarInf", tpq="OffPrv", main="Early period", ylab="province", yaxt="n") ```

#### Late period and fall from the Roman Empire Time intervals of late Roman influence in provinces and regions depicted with mid points and range interval if longer than one. ```{r, echo=TRUE, eval=FALSE} # late influence dates are in second list of 'rpcp' plot.dates(x=rpcp[[2]], type="mp", taq="LateInf", tpq="Fall", lwd=5, col="red", main="Late period", ylab="province") ``` ```{r, echo=FALSE, eval=TRUE, fig.width=4, fig.height=4, fig.align="center"} plot.dates(x=rpcp[[2]], type="mp", taq="LateInf", tpq="Fall", lwd=5, col="red", main="Late period", ylab="province", yaxt="n") ```

## Restricted imputation of missing dating data * Dataset `rpd` has time intervals for `"not_before"` and `"not_after"` that corresponds to the dating data in the `EDH` dataset. ```{r, echo=TRUE, eval=TRUE} # Roman provinces dates from EDH data("rpd") ``` ```{r} # Rome summary(rpd$Rom) ``` ```{r} # Aegyptus summary(rpd$Aeg) ```

These intervals are the basis for a restricted imputation of missing dating data in `EDH`

### Imputation of dates by province Function `edhwpd()` constructs, for a chosen province, a list of data frames with the components made of its inscriptions related by attribute co-occurrences. The replacement of missing dates occurs in this setting with function `rmids()` that stand for *restricted multiple imputation on data subsets*. An example of restricted multiple imputations is the province of **Armenia** which has the fewest inscriptions in the `EDH` dataset. Dataset `rpd` is a list where each component corresponds to a province and where the component class provides the `HD` `ids` of inscriptions. ```{r} # Armenia rpd$Arm ```

#### Imputation of inscriptions by similarity Imputation from similarities of attribute variables per province and dates is organised with wrapper function `edhwpd()` having different argument options. ```{r} # list with arguments formals(edhwpd) ```

By default, the input data for this function is the `EDH` dataset and the organisation is based on characteristics of the artefacts in `vars`.

```{r} # characteristics of inscriptions vars = c("findspot_ancient", "type_of_inscription", "type_of_monument", "language") ```

Function `rmids()` performs the multiple imputation of missing dating data in `EDH` by default or in another dataset as input. In the case of `Arm`, record `HD015521` has censored data in dates while the other two records have complete missing dating data. ```{r, echo=TRUE, eval=TRUE} # Armenia: restricted imputation of dates edhwpd(vars=vars, province="Arm") |> rmids() ```

The warnings tell us that the imputation values are taken from the respective province in the `rpd` dataset where `avg len TS` stands for *average length of timespan*, `min TAQ` is the minimum value of `not_before`, and `max TPQ` is the maximum value of `not_after`.

### Pooling results Since there are multiple imputations of missing dating data, one next step is to combine the data by pooling rules of the *m* results from function `rmids()` into final point estimates plus standard error. Pooling options for time intervals are take: * average time-span with `avg len TS` * `min TAQ` and `max TPQ` * `max TAQ` and `min TPQ` With these options, there is a single imputed value per variable with implied consequences.

### See also #### Vignettes * [Datasets in `"sdam"` package](../doc/Intro.html) * [Re-encoding `people` in the `EDH` dataset](../doc/Encoding.html) * [Cartographical maps and networks](../doc/Maps.html)

#### Reference Manual * [sdam: Digital Tools for the SDAM Project at Aarhus University](../html/sdam-package.html) * [`"sdam"` manual](https://github.com/mplex/cedhar/blob/master/typesetting/reports/sdam.pdf)

#### Project * [Release candidate version](https://github.com/sdam-au/sdam) * [Code snippets using `"sdam"`](https://github.com/sdam-au/R_code) * [Social Dynamics and complexity in the Ancient Mediterranean project](https://sdam-au.github.io/sdam-au/)