--- title: "Using rcldf to work with Cross-Linguistic Data Format datasets" author: "Simon J. Greenhill" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using rcldf to work with Cross-Linguistic Data Format datasets} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} %\VignetteTangle{false} editor_options: markdown: wrap: 72 bibliography: references.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(rcldf) library(dplyr) library(ggplot2) library(patchwork) theme_set(theme_classic()) ``` # Introduction The `rcldf` package provides tools to read and interact with `CLDF` datasets in R. This vignette demonstrates how to install the package, load these datasets, and use them. Cross-Linguistic Data Formats (CLDF, [Forkel et al. 2018](https://doi.org/10.1038/sdata.2018.205)) is a standardized data format designed to handle cross-linguistic and cross-cultural datasets. CLDF provides a consistent [specification and package format](https://cldf.clld.org/) for common types of linguistic and cultural data e.g. word lists, grammatical features, cultural traits etc. The aim of the format is to enable provide a simple, reliable data format to facilitate the sharing and re-use of these data to make new analyses possible. There are currently more than [250 CLDF datasets available](https://zenodo.org/search?q=CLDF&f=resource_type%3Adataset&l=list&p=1&s=10&sort=bestmatch) containing data from the world's languages and cultures including everything from catalogues of linguistic metadata (e.g. [Glottolog](https://glottolog.org)), [EndangeredLanguages.com](https://www.endangeredlanguages.com)), to word lists of lexical data (e.g. [Lexibank](https://lexibank.clld.org/)), grammatical features (e.g. [WALS](https://wals.info) and [Grambank](https://grambank.clld.org)), phonetic information (e.g. [Phoible](https://phoible.org/)), geographic informations (e.g. [Wurm & Hattori 1981/1983](https://doi.org/10.5281/zenodo.15183865)), and religious and cultural databases (e.g. [D-PLACE](https://d-place.org/) and [Pulotu](https://pulotu.com)). # Installation You can install `rcldf` directly from GitHub using the `devtools` package: ```{r eval = FALSE} library(devtools) install_github("SimonGreenhill/rcldf", dependencies = TRUE) library(rcldf) ``` # Loading a CLDF Dataset Once installed, load a CLDF dataset by creating a `cldf` object from a dataset. You can point to the dataset using a path or URL where the dataset is located: ```{r eval = FALSE} library(rcldf) # Load from a local directory: ds <- cldf('/path/to/dir/wals_1a_cldf') # or load from a specific metadatafile: ds <- cldf('/path/to/dir/wals_1a_cldf/StructureDataset-metadata.json') # or load from zenodo: df <- cldf("https://zenodo.org/record/7844558/files/grambank/grambank-v1.0.3.zip?download=1") # or load from github: df <- cldf('https://github.com/lexibank/abvd') ``` # Exploring a CLDF Dataset Once loaded, a `cldf` object has various bits of information. To show this I'll use a small dataset that comes packaged with `rcldf`. This dataset contains a wordlist from a number of Huon Peninsula languages from New Guinea, originally detailed in @McElhanon1967: - McElhanon, K. A. (1967). Preliminary Observations on Huon Peninsula Languages. *Oceanic Linguistics*, *6*(1), 1–45. ```{r} library(rcldf) ds <- cldf(system.file("extdata/huon", package="rcldf")) # this dataset has 4 tables: ds # more details: summary(ds) ``` So here we have a dataset with tables of languages, parameters (=words), forms (=the lexical items), and cognates (=cognacy information showing how the lexical items are related). There is also other information here, e.g. the `citation` for where the dataset came from, the `path` where the dataset is stored, and which `Type` of CLDF specificiation this dataset adheres to. ## Accessing the data. Each table is attached to the *df\$tables* list, so to access them you need to call `df$tables$`. These are simply dataframes (or tibbles) so you can then do anything you want with them: ```{r} names(ds$tables) # let's look at the languages -- head(ds$tables$LanguageTable) # and the parameters - in this case the words in the wordlist head(ds$tables$ParameterTable) # and finally, the lexical items themselves: ds$tables$ValueTable ``` ## Load all the source information CLDF datasets have sources stored in BibTeX format. We don't load them by default, as it can take a long time to parse the BibTeX file correctly. You can load and access them like this, and the sources are then available in `ds$sources` in `bib2df` format: ```{r} ds <- cldf(system.file("extdata/huon", package="rcldf"), load_bib=TRUE) # or if you loaded the CLDF without sources the first time you can add them now: ds <- read_bib(ds) ds$sources ``` ## Construct a 'wide' table with all foreign key entries filled in: Sometimes people want to have all the data from a CLDF dataset as one dataframe. Use `as.cldf.wide` to do this, passing it the name of a table to act as the base. This will take the base table, and resolve all foreign keys (usually `*_ID`) into their own columns. Name clashes between the two tables are resolved by appending the table name (e.g. the column `Name` in the original `CodeTable` will become `Name.CodeTable`). For example, this dataset has a `FormTable` which connects to the `ParameterTable` via `Parameter_ID` and the `LanguageTable` via `Language_ID`. Using `as.cldf.wide` we can combine all the information from `ParameterTable` and `LanguageTable` into the `FormTable`: ```{r} as.cldf.wide(ds, 'FormTable') ``` ## Load just one table: Sometimes you just want to get one table. To do this call `get_table_from` with the table type, and dataset path: ```{r} get_table_from('LanguageTable', system.file("extdata/huon", package="rcldf")) ``` ### Get the citation for a dataset: ```{r} print(ds$citation) ``` ### Quickly get information on an unloaded dataset: Perhaps you want to know which version of a dataset you have without loading the whole dataset. Use `get_details`: ```{r} get_details(system.file("extdata/huon", package="rcldf")) ``` ## Easily get reference catalogs (Glottolog, Concepticon): `rcldf` has a couple of utility functions to get the latest versions of the Glottolog and Concepticon CLDF reference catalogues: ```{r, eval=FALSE} glott <- load_glottolog() conc <- load_concepticon() ``` ## Cache Information When you load a dataset from a URL, `rcldf` downloads the dataset and unpacks it to a cache directory. By default this is a temporary directory which will be deleted when you close R. However, by specifying a directory or using `tools::R_user_dir("rcldf", which = "cache")` you can re-use the dataset later. To see where downloads will be saved: ```{r} get_cache_dir() ``` You can set the cache_dir setting for the session by: ```{r, eval=FALSE} set_cache_dir('/path/somewhere') ``` If you want to set this permanently, then edit your R environ file to add the line: `RCLDF_CACHE_DIR=/path/somewhere`. ```{r, eval=FALSE} usethis::edit_r_environ() # now add # RCLDF_CACHE_DIR= ``` To see what datasets you've downloaded: ```{r} list_cache_files() ``` You can re-use datasets in your cache: ```{r} cldf(list_cache_files()[1, 'Path']) ``` Or just save them to a particular directory: ```{r, eval=FALSE} df <- cldf("https://zenodo.org/record/7844558/files/grambank/grambank-v1.0.3.zip?download=1", cache_dir="~/data/grambank") ``` # Using rcldf to analyse datasets: To show you how to use `rcldf` to analyse datasets, we're going to test whether languages that distinguish inclusive pronouns from exclusive pronouns (i.e. [clusivity][]), also tend to have high rigidity in social structure. To do this, we will use the Grambank [Feature GB028: Is there a distinction between inclusive and exclusive?](https://grambank.clld.org/parameters/GB028), and the D-PLACE [Trait EA113: Degree of rigidity in social structures](https://d-place.org/parameters/EA113). First, let's get the published version of Grambank off Zenodo. It's always a good idea to use the published version as this is a citable and versioned product which makes replicating your analysis easier for other researchers. To do this, [go to the Zenodo page for Grambank](https://zenodo.org/records/7844558), and find the download link under the 'Files' section and copy it. ![Screenshot of Zenodo page for Grambank showing the download link](zenodo.png) It will look like this . Give that to `rcldf`: ```{r} grambank <- cldf("https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip?download=1") grambank ``` Now, let's use `dplyr` to get the variable we want from grambank. We use `summary` to see what the dataset looks like. It is a CLDF 'Structure' Dataset with six tables: ```{r} summary(grambank) ``` Let's start by extracting a list of languages in this dataset: ```{r} languages <- grambank$tables$LanguageTable |> select(ID, Name, Macroarea, Latitude, Longitude) # only selecting some columns above to make it easier to see languages ``` We can also look at the `ParameterTable` to see information on our parameter of interest: "GB028: Is there a distinction between inclusive and exclusive?". The `ID` of this feature is "GB028", so let's just see that one: ```{r} grambank$tables$ParameterTable |> filter(ID=='GB028') ``` Ok, now let's get all the Values for this Parameter in `ValueTable`. Following the `cldf` standard means that the Parameter will be mapped into the ValueTable in the `Parameter_ID` column, so let's select all the rows in `ValueTable` that match Parameter_ID of GB028: ```{r} values <- grambank$tables$ValueTable |> filter(Parameter_ID=='GB028') |> select(ID, Language_ID, Parameter_ID, Value, Source) ``` Now we need to get the data from D-PLACE. We'll use the Github repository to get these data to show you how that works, but you should probably use the published version on Zenodo for proper work. Get the github repository link and give that to rcldf too: ```{r} dplace <- cldf("https://github.com/D-PLACE/dplace-cldf") summary(dplace) ``` Great, let's get the values for the parameter we want. D-PLACE indexes the variables in a column `Var_ID`: ```{r} dplace$tables$ValueTable |> filter(Var_ID=='EA113') ``` We want to merge this with our Grambank data but need a way to do this. D-PLACE stores `glottocode` information that maps each society to a language. This column is also in the Languages table in Grambank. So, let's get the trait values for EA113 and add the glottocodes from LanguageTable using a `join`: ```{r} # get languages from DPLACE dplanguages <- dplace$tables$LanguageTable |> select(ID, Glottocode) # get values for EA113 and merge with language information ea113 <- dplace$tables$ValueTable |> filter(Var_ID=='EA113') |> select(Soc_ID, Value) |> left_join(dplanguages, join_by(Soc_ID==ID)) # rename `Value` to EA113 ea113 <- ea113 |> mutate(EA113=Value) |> select(Glottocode, EA113) head(ea113) ``` Now get the grambank data into the same format: ```{r} gb028 <- values |> mutate(Glottocode=Language_ID, GB028=Value) |> select(Glottocode, GB028) head(gb028) ``` ...and finally join them up using the mutual column 'Glottocode'. I'll use an inner join here to only get the languages/societies that are in both datasets. And we'll use `na.omit` to only keep rows that have data for both variables: ```{r} df <- gb028 |> inner_join(ea113) |> na.omit() head(df) ``` Ok, it looks like we only have 17 rows for this pairing. That's a little small, but let's plot the data: ```{r} p1 <- ggplot(df, aes(x=GB028)) + geom_histogram(stat='count') p2 <- ggplot(df, aes(x=EA113)) + geom_histogram(stat='count') p1 / p2 # patchwork ``` Hmm. It looks like societies that do not mark this distinction between inclusive and exclusive tend to be those with flexible social structure. Looking good for our hypothesis, but let's test it formally to make sure we're not seeing a chance pattern: ```{r} tab <- table(df$GB028, df$EA113) tab chisq.test(tab) ``` ....and the probability here is 1.0 of getting a result like this due to chance. So yes, this is just a chance result and there is no evidence in these data that languages which distinguish inclusive from exclusive also tend to be those that have rigid social structures. There goes our big paper in Nature/Science. ## References: