---
title:
Dates and missing dating data in `"sdam"`
date: "August 2022"
author:
- name: Antonio Rivero Ostoic
affiliation: Aarhus University
email: jaro@cas.au.dk
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Dates and missing dating data in `"sdam"`}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r setup, echo=FALSE, message=FALSE}
knitr::opts_chunk$set(echo=TRUE,error=TRUE)
knitr::opts_chunk$set(comment = "")
library("sdam")
```
```{r set-options, echo=FALSE, cache=FALSE}
options(width = 96)
```
### Preliminaries
Install and load a version of `"sdam"` package.
```{r, echo=TRUE, eval=FALSE}
install.packages("sdam") # from CRAN
devtools::install_github("sdam-au/sdam") # development version
devtools::install_github("mplex/cedhar", subdir="pkg/sdam") # legacy version R 3.6.x
```
```{r}
# load and check versions
library(sdam)
packageVersion("sdam")
```
## Dating data
Temporal data is significant when it comes to analysing the history of archaeological
artefacts like written markers from the Ancient Mediterranean.
In the `EDH` dataset, for example, dates for inscriptions are plausible timespans of existence
with the endpoints in variables `not_before` and `not_after` that, from the perspective of the timespan,
are the *terminus ante quem* (TAQ) and *terminus post quem* (TPQ) of the time segment.
However, not all inscriptions have these two variables filled by domain experts and replacing missing dating data constitutes a challenge.
Besides `EDH`, other datasets with `"sdam"` the package and related functions involve dating data in
the ancient Mediterranean like displaying dates and time segments in a plot, by organising dates within Roman provinces,
and by performed imputation techniques for missing dating data.
### Plotting temporal data
#### Shipwrecks dataset dating data
An example of plotting dates is with the Shipwrecks external dataset, which is a semicolon separated file of different variables.
References for shipwrecks data are in
- Vignette [Datasets in `"sdam"` package](../doc/Intro.html)
When reading the shipwrecks external dataset with ` read.csv` make sure to use the right separator in `sep` and leave untouched the names of the variables.
```{r}
# load shipwrecks external dataset
sw <- system.file("extdata","StraussShipwrecks.csv",package="sdam") |>
read.csv(sep=";", check.names=FALSE)
```
```{r}
# variables in shipwrecks dataset
colnames(sw)
```
Plot the time segments with function `plot.dates()` and a customized `'id'` where variables 15 to 16 in `sw` have timespans of existence as `'taq'` and `'tpq'`.
```{r, echo=TRUE, eval=TRUE, fig.width=4, fig.height=4, fig.align="center", fig.cap="Range of timespans in Shipwrecks dataset"}
# shipwrecks dates with Wreck ID
plot.dates(sw, id="Wreck ID", type="rg", taq="Earliest date", tpq="Latest date", col=4)
```
#### Mid points and range of timespan
The mid points and range of shipwrecks data are explicitly computed by function `prex()` with the `mp` option in the `'type'` argument.
`'vars'` stands for the variables that in this case are TAQ and TPQ, and the `'keep'` option allows maintaining the rest of the variables
in the output that for `prex()` with mid points is a data frame.
```{r}
# add mid points and range to shipwrecks data
prex(sw[c(1,7,15:16)], type="mp", vars=c("Earliest date", "Latest date"), keep=TRUE) |>
tail()
```
The default `'type'` option and chronological phase in `prex()` are the aoristic sum with a five periods bin or `bin5`.
```{r}
# aoristic sum shipwrecks
prex(sw[c(1,7,15:16)], vars=c("Earliest date", "Latest date"))
```
For an eight chronological periods bin in the shipwrecks dataset
```{r}
# aoristic sum shipwrecks 8 bin
prex(sw[c(1,7,15:16)], vars=c("Earliest date", "Latest date"), cp="bin8")
```
For aoristic sum algorithm, cf. [Temporal Uncertainty](https://mplex.github.io/cedhar/Uncertainty.html).
## Dating data in the Roman world
Many functions and datasets in `"sdam"` are related to temporal information of the Roman world,
particularly from the Roman Empire during the classical ancient period.
Function `plot.map()` is to depict cartographical maps per Roman province or region, and it has a `'date'` argument to display dates within the caption. Dates in this case are one or two years either for the consolidation of the Italian peninsula or the affiliation of the region to the Roman Empire.
```{r, echo=TRUE, eval=FALSE}
# silhouette of Italian peninsula
plot.map(x="Ita", date=TRUE)
## not run
```
* The built-in dataset `rpmcd` has the shapes and colours used in the cartographical maps with `plot.map()`, and some
dates related to provinces as well.
```{r}
# 59 provinces dates, colors, and shapes
data("rpmcd")
# province acronyms as in EDH
names(rpmcd)
```
### Roman provinces establishment dates
The establishment dates of Roman provinces used in the cartographical map captions are in the second
component of `rpmcd`.
```{r}
# pipe dataset for dates in second component
rpmcd |>
lapply(function (x) x[[2]]) |>
head()
```
A vector of establishment dates in years from the `"rpmcd"` dataset is recorded in object `est` that
allow making a chronology of the Roman provinces.
```{r, echo=-5}
# second component in dataset
est <- rpmcd |>
lapply(function (x) x[[2]]) |>
unlist(use.names=FALSE)
est
```
### Formatting dates
The establishment dates of Roman provinces and regions are in vector `est`, and these dates can become
more standard with the function `cln()` for further processing.
This is a cleaning function where, for instance, level `9` removes all content after the first parenthesis
in the input while the other levels are for specific needs.
```{r}
# clean levels are 0-9
cln(est, level=9)
```
After this transformation of the data in `est`, is possible to format dates
as numerical data with function `dts()`, which takes the first
value when there are two competing dates in the input; unless the opposite is specified
in the `'last'` argument.
```{r}
# update object with establishment dates
est <- est |>
cln(level=9) |>
dts()
```
```{r}
est
```
### Chronology of Roman provinces
Object `est` has a chronology for the establishment dates of Mediterranean regions and territories as
Roman provinces that corresponds to the provinces in `"rpmcd"` dataset.
The union of the names of provinces and dates of establishment as a Roman province is a data frame object
`rpde` that better displays without the row names.
```{r}
# Roman province dates of establishement (strings still strings)
rpde <- cbind(names(rpmcd),dts(est)) |>
as.data.frame(stringsAsFactors=FALSE)
```
```{r}
rownames(rpde) <- NULL
head(rpde)
```
Because the dates have a numerical format from function `dts()`, the data frame allows producing a chronology of affiliation dates for the provinces and regions to the Roman Empire by ordering the second variable in `rpde`.
```{r}
# order of affiliation of provinces
rpde[order(as.numeric(rpde$V2)),1]
```
The regions in the Italian peninsula have the earliest affiliation dates, and Mesopotamia has the latest affiliation date to the Roman Empire.
### Roman influence periods
* Dataset `"rpcp"` has influence periods of the Roman Empire.
```{r, echo=TRUE, eval=TRUE}
# list with 45 early and late influence dates provinces
data("rpcp")
```
```{r}
# look at data internal structure
str(rpcp)
```
#### Early period of Roman influence
Visualize time intervals of early Roman influence in provinces and regions.
```{r, echo=TRUE, eval=FALSE}
# early influence dates are in first list of 'rpcp'
plot.dates(x=rpcp[[1]], taq="EarInf", tpq="OffPrv", main="Early period", ylab="province")
```
```{r, echo=FALSE, eval=TRUE, fig.width=4, fig.height=4, fig.align="center"}
plot.dates(x=rpcp[[1]], taq="EarInf", tpq="OffPrv", main="Early period", ylab="province", yaxt="n")
```
#### Late period and fall from the Roman Empire
Time intervals of late Roman influence in provinces and regions depicted with mid points and
range interval if longer than one.
```{r, echo=TRUE, eval=FALSE}
# late influence dates are in second list of 'rpcp'
plot.dates(x=rpcp[[2]], type="mp", taq="LateInf", tpq="Fall", lwd=5, col="red",
main="Late period", ylab="province")
```
```{r, echo=FALSE, eval=TRUE, fig.width=4, fig.height=4, fig.align="center"}
plot.dates(x=rpcp[[2]], type="mp", taq="LateInf", tpq="Fall", lwd=5, col="red", main="Late period", ylab="province", yaxt="n")
```
## Restricted imputation of missing dating data
* Dataset `rpd` has time intervals for `"not_before"` and `"not_after"` that corresponds to the dating data
in the `EDH` dataset.
```{r, echo=TRUE, eval=TRUE}
# Roman provinces dates from EDH
data("rpd")
```
```{r}
# Rome
summary(rpd$Rom)
```
```{r}
# Aegyptus
summary(rpd$Aeg)
```
These intervals are the basis for a restricted imputation of missing dating data in `EDH`
### Imputation of dates by province
Function `edhwpd()` constructs, for a chosen province, a list of data frames with the
components made of its inscriptions related by attribute co-occurrences.
The replacement of missing dates occurs in this setting with function `rmids()` that stand for
*restricted multiple imputation on data subsets*.
An example of restricted multiple imputations is the province of **Armenia** which has the fewest inscriptions in the `EDH` dataset. Dataset `rpd` is a list where each component corresponds to a province and where the component class provides the `HD` `ids` of inscriptions.
```{r}
# Armenia
rpd$Arm
```
#### Imputation of inscriptions by similarity
Imputation from similarities of attribute variables per province and dates is organised with wrapper function `edhwpd()` having different argument options.
```{r}
# list with arguments
formals(edhwpd)
```
By default, the input data for this function is the `EDH` dataset and the organisation is based on characteristics of the
artefacts in `vars`.
```{r}
# characteristics of inscriptions
vars = c("findspot_ancient", "type_of_inscription", "type_of_monument", "language")
```
Function `rmids()` performs the multiple imputation of missing dating data in `EDH` by default or in another dataset as input.
In the case of `Arm`, record `HD015521` has censored data in dates while the other two records have complete missing dating data.
```{r, echo=TRUE, eval=TRUE}
# Armenia: restricted imputation of dates
edhwpd(vars=vars, province="Arm") |>
rmids()
```
The warnings tell us that the imputation values are taken from the respective province in the `rpd` dataset
where `avg len TS` stands for *average length of timespan*, `min TAQ` is the minimum value of `not_before`, and
`max TPQ` is the maximum value of `not_after`.
### Pooling results
Since there are multiple imputations of missing dating data, one next step is to combine the data by pooling rules of the *m* results from function `rmids()` into final point estimates plus standard error.
Pooling options for time intervals are take:
* average time-span with `avg len TS`
* `min TAQ` and `max TPQ`
* `max TAQ` and `min TPQ`
With these options, there is a single imputed value per variable with implied consequences.
¨
### See also
#### Vignettes
* [Datasets in `"sdam"` package](../doc/Intro.html)
* [Re-encoding `people` in the `EDH` dataset](../doc/Encoding.html)
* [Cartographical maps and networks](../doc/Maps.html)
#### Reference Manual
* [sdam: Digital Tools for the SDAM Project at Aarhus University](../html/sdam-package.html)
* [`"sdam"` manual](https://github.com/mplex/cedhar/blob/master/typesetting/reports/sdam.pdf)
#### Project
* [Release candidate version](https://github.com/sdam-au/sdam)
* [Code snippets using `"sdam"`](https://github.com/sdam-au/R_code)
* [Social Dynamics and complexity in the Ancient Mediterranean project](https://sdam-au.github.io/sdam-au/)