--- title: "Data Cleaning: From KOSIS Raw Data to Tidy Format" output: rmarkdown::html_vignette date: "`r format(Sys.Date(), '%B %d, %Y')`" vignette: > %\VignetteIndexEntry{Data Cleaning: From KOSIS Raw Data to Tidy Format} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, eval = FALSE, # Set to FALSE since this requires API keys and large data downloads out.width = "100%" ) ``` ```r library(tidycensuskr) library(dplyr) library(tidyr) library(plyr) library(kosis) ``` ## Introduction This vignette demonstrates how to clean and transform raw Korean census data from the Korean Statistical Information Service (KOSIS) API into a tidy, analysis-ready format. While there are existing R packages for accessing Korean statistical data (such as the `kosis` package for general KOSIS data access), **there are currently no readily available software solutions specifically designed for retrieving and processing Korean census data in a tidy, analysis-friendly format** for demographic and socioeconomic research. ### The Challenge Korean census and administrative data presents several challenges for researchers: 1. **Complex API URLs**: KOSIS APIs require detailed parameter specifications with cryptic codes 2. **Multiple data sources**: Different datasets (population, tax, mortality) come from different API endpoints 3. **Inconsistent formatting**: Raw data often comes in wide format with Korean-language column names 4. **Administrative code mapping**: Different datasets use different administrative code systems 5. **Data integration**: Combining multiple datasets requires careful matching of administrative units This vignette shows how the `tidycensuskr` package addresses these challenges by providing pre-processed, tidy datasets and demonstrating the underlying data cleaning workflow. ## Raw Data Sources The data cleaning process involves retrieving data from multiple KOSIS API endpoints: You need to know following information to retrieve data from KOSIS API: - **API Key**: You need to register for an API key from KOSIS. - **API Parameters**: Each API endpoint requires specific parameters such as `itmId`, `objL1`, `objL2`, etc. These parameters define the data you want to retrieve. - **Output Format**: The API supports various output formats, but we will use JSON for easier processing in R. ### API call parameters | Parameter | Type | Description | Required? | |-----------|------|-------------|-----| | `apiKey` | String | Your KOSIS API key | Yes | | `orgId` | String | Organization identifier | Yes | | `tblId` | String | Table identifier | Yes | | `objL1` | String | First class code | Yes | | `objL2`-`objL8` | String | Second to eighth class code | Optional | | `itmId` | String | Item identifier(s) | Yes | | `prdSe` | String | Update cycle | Yes | | `format` | String | Output format (e.g., JSON) | Yes | ### Data variables To note, an output can include the following variables: | Field name (all caps) | Description | Data type | Remarks | |-----|--------------|-------|------| | ORG_ID | Organization code | VARCHAR2(40) | Yes | | TBL_ID | Table identifier | VARCHAR2(40) | | | TBL_NM | Table name | VARCHAR2(300) | | | C1 - C8 | Class identifier (1-8) | VARCHAR2(40) | 2-8 may be omitted if nonexisting | | C1_OBJ_NM - C8_OBJ_NM | Class code (1-8) | VARCHAR2(3000) | | | C1_OBJ_NM_ENG - C8_OBJ_NM_ENG | Class code in English (1-8) | VARCHAR2(3000) | | | C1_NM - C8_NM | Class name (1-8) | VARCHAR2(3000) | | | C1_NM_ENG - C8_NM_ENG | Class name in English (1-8) | VARCHAR2(3000) | | | ITM_ID | Item identifier | VARCHAR2(40) | | | ITM_NM | Item name | VARCHAR2(3000) | | | ITM_NM_ENG | Item name in English | VARCHAR2(3000) | | | UNIT_ID | Unit identifier | VARCHAR2(40) | | | UNIT_NM | Unit name | VARCHAR2(1000) | | | UNIT_NM_ENG | Unit name in English | VARCHAR2(1000) | | | PRD_SE | Data update cycle | VARCHAR2(20) | | | PRD_DE | Data period | VARCHAR2(8) | | | DT | Data value | VARCHAR2(100) | | | LST_CHN_DE | Date of change | VARCHAR2(8) | | It is worth noting that all fields are character, and numeric values are stored as strings. This means that you will need to convert them to numeric types after loading the data. ### API Key To use the KOSIS API, you need to register for an API key. You can obtain one by visiting the [KOSIS API registration page](https://kosis.kr/openapi/serviceUse/serviceUseUnityReg_01Detail.do) and following the instructions. The webpage is only available in Korean and may not be available outside of Korea. ### Example URL The URL below is an example of how to retrieve general tax data from KOSIS. You will need to replace `인증키없음` with your actual API key. Dissecting this URL can help you understand how to construct your own API calls. ```{r tax-url} url_tax_general <- "https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=인증키없음&itmId=T001+&objL1=A0201+A0202+A0203+A0204+A0205+A0206+A0207+A0208+A0209+A0210+A0211+A0212+A0213+A0214+A0215+A0216+A0217+A0218+A0219+A0220+A0221+A0222+A0223+A0224+A0225+A0301+A0302+A0303+A0304+A0305+A0306+A0307+A0308+A0309+A0310+A0401+A0402+A0403+A0404+A0405+A0406+A0407+A0408+A0409+A0410+A0411+A0412+A0413+A0414+A0415+A0416+A0417+A0418+A0419+A0420+A0421+A0422+A0423+A0424+A0425+A0426+A0427+A0428+A0429+A0430+A0431+A0501+A0502+A0503+A0504+A0505+A0506+A0507+A0508+A0509+A0510+A0511+A0512+A0513+A0514+A0515+A0516+A0517+A0518+A0601+A0602+A0603+A0604+A0605+A0701+A0702+A0703+A0704+A0705+A0706+A0707+A0708+A0709+A0710+A0711+A0801+A0802+A0803+A0804+A0805+A0806+A0807+A0808+A0809+A0810+A0811+A0812+A0813+A0814+A0815+A09+A1001+A1002+A1003+A1004+A1005+A1101+A1102+A1103+A1104+A1105+A1106+A1107+A1108+A1109+A1110+A1111+A1112+A1113+A1114+A1201+A1202+A1203+A1204+A1205+A1206+A1207+A1208+A1209+A1210+A1211+A1212+A1213+A1214+A1215+A1216+A1217+A1218+A1219+A1220+A1221+A1222+A1309+A1301+A1302+A1303+A1304+A1305+A1306+A1307+A1308+A1401+A1402+A1403+A1404+A1405+A1406+A1407+A1408+A1409+A1410+A1411+A1412+A1413+A1414+A1415+A1416+A1417+A1418+A1419+A1420+A1421+A1422+A1423+A1501+A1502+A1503+A1504+A1505+A1506+A1507+A1508+A1509+A1510+A1511+A1512+A1513+A1514+A1515+A1516+A1601+A1602+A1603+A1604+A1605+A1701+A1702+A1703+A1704+A1705+A1706+A1707+A1708+A1709+A1710+A1711+A1712+A1713+A1714+A1715+A1716+A1717+A1718+A1802+A1801+&objL2=15133SGH0M+&objL3=&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_ID+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=133&tblId=DT_133N_A3212" ``` | Parameter | Value | |-----------|-------------------| | `apiKey` | "인증키없음" (meaning "no authentication") as it is for demonstration. | | `orgId` | "133" | | `tblId` | "DT_133N_A3212" | | `objL1` | "A0201", "A0202", ... , "A1801" | | `objL2` | "15133SGH0M" | | `itmId` | "T001" | | `prdSe` | "Y" | | `newEstPrdCnt` | "1" | | `outputFields` | "TBL_ID", "TBL_NM", ... , "UNIT_NM_ENG" | Adding more context, multiple values for `objL1` and `itmId` can be specified by separating them with a plus sign (`+`). For example, `itmId=T001+T002+T003` would retrieve data for multiple items. Try interpreting the following URL to understand how to construct your own API calls: ```{r pop-url} url_pop <- "https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=인증키없음&itmId=T00+T60+&objL1=11010+11020+11030+11040+11050+11060+11070+11080+11090+11100+11110+11120+11130+11140+11150+11160+11170+11180+11190+11200+11210+11220+11230+11240+11250+21010+21020+21030+21040+21050+21060+21070+21080+21090+21100+21110+21120+21130+21140+21150+21510+22010+22020+22030+22040+22050+22060+22070+22510+22520+23010+23020+23030+23040+23050+23060+23070+23080+23090+23510+23520+24010+24020+24030+24040+24050+25010+25020+25030+25040+25050+26010+26020+26030+26040+26510+29010+31010+31011+31012+31013+31014+31020+31021+31022+31023+31030+31040+31041+31042+31050+31051+31052+31053+31060+31070+31080+31090+31091+31092+31100+31101+31103+31104+31110+31120+31130+31140+31150+31160+31170+31180+31190+31191+31192+31193+31200+31210+31220+31230+31240+31250+31260+31270+31280+31550+31570+31580+32010+32020+32030+32040+32050+32060+32070+32510+32520+32530+32540+32550+32560+32570+32580+32590+32600+32610+33020+33030+33040+33041+33042+33043+33044+33520+33530+33540+33550+33560+33570+33580+33590+34010+34011+34012+34020+34030+34040+34050+34060+34070+34080+34510+34530+34540+34550+34560+34570+34580+35010+35011+35012+35020+35030+35040+35050+35060+35510+35520+35530+35540+35550+35560+35570+35580+36010+36020+36030+36040+36060+36510+36520+36530+36550+36560+36570+36580+36590+36600+36610+36620+36630+36640+36650+36660+36670+36680+37010+37011+37012+37020+37030+37040+37050+37060+37070+37080+37090+37100+37510+37520+37530+37540+37550+37560+37570+37580+37590+37600+37610+37620+37630+38030+38050+38060+38070+38080+38090+38100+38110+38111+38112+38113+38114+38115+38510+38520+38530+38540+38550+38560+38570+38580+38590+38600+39010+39020+&objL2=ALL&objL3=000+&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_ID+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=101&tblId=DT_1IN1509" ``` ## Data Retrieval ### Setting up API Access ```r library(tidycensuskr) library(dplyr) library(tidyr) library(kosis) # Set KOSIS API key (stored in a secure file) kosiskey <- readLines("~/.kosiskey")[1] tidycensuskr::set_kosis_key("~/.kosiskey") ``` ### Downloading Raw Data ```r # Download raw datasets from KOSIS API df_tax <- kosis::getStatDataFromURL(url_tax_general) # Load administrative district lookup table sgg_lookup <- read.csv( system.file( file.path("extdata", "lookup_district_code.csv"), package = "tidycensuskr" ), fileEncoding = "EUC-KR" ) ``` ## Data Cleaning Workflow ### 1. Administrative Code Mapping The first challenge is creating a consistent mapping between different administrative code systems used across datasets: ```r # Create administrative code mapping for provinces (sido) sidocd_range <- tibble::tribble( ~sido_kr, ~sido_cd, ~sido_txcd, "서울특별시", "11", "02", "부산광역시", "21", "15", "대구광역시", "22", "13", "인천광역시", "23", "03", "광주광역시", "24", "10", "대전광역시", "25", "06", "울산광역시", "26", "16", "세종특별자치시", "29", "09", "경기도", "31", "04", "강원특별자치도", "32", "05", "충청북도", "33", "07", "충청남도", "34", "08", "전라북도", "35", "11", "전라남도", "36", "12", "경상북도", "37", "14", "경상남도", "38", "17", "제주특별자치도", "39", "18" ) ``` ### 2. Tax Data Processing Transform raw tax data into a standardized format: ```r df_tax_compact <- df_tax |> dplyr::transmute( adm2_code = C1, # Administrative code value = DT # Tax value in million KRW ) |> dplyr::inner_join( sgg_lookup[, c("sgg_tax_global", "sido_en", "sigungu_1_en", "adm2_code")], multiple = "first" ) ``` ### 3. Population Data Processing Clean and reshape population data with gender disaggregation: ```r df_pop2 <- df_pop |> dplyr::mutate( sex = plyr::mapvalues(C2, c(0, 1, 2), c("total", "male", "female")), type = plyr::mapvalues(ITM_ID, c("T00", "T60"), c("population_total", "population_nonrelative")) ) |> dplyr::select(C1, C1_NM, sex, type, DT) |> tidyr::pivot_wider( names_from = c(type, sex), values_from = DT ) |> dplyr::rename( sigungu_cd = C1, sigungu_kr = C1_NM ) |> dplyr::mutate( adm2_code = as.integer(adm2_code) ) |> dplyr::inner_join( sgg_lookup[, c("adm2_code", "sido_en", "sigungu_1_en")], by = "adm2_code", multiple = "first" ) ``` ## Transforming to Tidy Format The key feature of `tidycensuskr` is converting wide-format, multiple-dataset structure into a single, long-format tidy dataset: ### 1. Convert Each Dataset to Long Format ```r # Tax data to long format df_tax_long <- df_tax_compact |> dplyr::select(2:5) |> tidyr::pivot_longer( cols = "value" ) |> dplyr::mutate( type = "tax", class1 = "global", class2 = "total", unit = "million KRW" ) |> dplyr::select(-name) # Population data to long format df_pop_long <- df_pop2 |> dplyr::select(-2) |> tidyr::pivot_longer( cols = 2:7 ) |> tidyr::separate(col = "name", into = c("type", "class1", "class2"), sep = "_") ``` ### 2. Combine into Single Tidy Dataset ```r # Bind all datasets into one comprehensive long-format dataset censuskor <- dplyr::bind_rows( df_tax_long, df_pop_long ) |> dplyr::rename( adm1 = sido_en, adm2 = sigungu_1_en ) |> dplyr::mutate( year = 2020 ) |> dplyr::select( year, adm1, adm2, adm2_other, adm2_code, type, class1, class2, unit, value ) ``` ## The Result: A Tidy Dataset The final `censuskor` dataset has a consistent structure: - **year**: Census year (2020) - **adm1**: Province-level administrative name (in English) - **adm2**: District-level administrative name (in English) - **adm2_code**: Numeric administrative code - **type**: Data type ("population", "tax", "mortality") - **class1**: First-level classification (e.g., "global", "income", "All causes") - **class2**: Second-level classification (e.g., "total", "male", "female") - **unit**: Unit of measurement - **value**: Numeric value ### Example Usage ```r library(tidycensuskr) library(dplyr) # Load the cleaned dataset data(censuskor) # View the structure head(censuskor) # Filter for Seoul population data seoul_pop <- censuskor |> filter(adm1 == "Seoul", type == "population", class1 == "population") |> select(adm2, class2, value) |> pivot_wider(names_from = class2, values_from = value) head(seoul_pop) ``` These data can be joined with `sf` objects bundled in the package to make maps and perform spatial analysis ## Notes on `adm2_code` changes Districts (_si-gun-gu_) are the second-level administrative units in South Korea. Each district is assigned a unique code called `adm2_code`. However, these codes can change over time due to administrative boundary changes, mergers, or reclassifications. Notable changes were made in all rural districts (_gun_) in 2022; the third digit of `adm2_code` got addition of 2. For example, _Ulleung-gun_ changed from `37430` to `37630`. This change may affect longitudinal analyses that span multiple years and pose challenges in updating `censuskor` dataset by developers. Some datasets in KOSIS are **retrospectively** updated to reflect the new codes, while most KOSIS datasets retain the old codes. Therefore, developers should be advised to check the `adm2_code` for each dataset and year with this code: ```r # imported data some_census_table # check adm2_code any(substr(some_census_table$adm2_code, 3, 3) %in% c("5", "6")) ``` If `TRUE`, it indicates that the dataset uses the **new** `adm2_code` system (post-2022) regardless of the raw data year. If `FALSE`, it uses the **old** system (pre-2022). **Note**: This vignette shows the data cleaning process for developers. The actual `censuskor` dataset in the package is the result of this workflow and is ready for immediate use.