--- title: "Introduction to theft" author: "Trent Henderson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 4 vignette: > %\VignetteIndexEntry{Introduction to theft} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 7, fig.width = 7, warning = FALSE, fig.align = "center" ) ``` ```{r setup, message = FALSE, warning = FALSE} library(theft) ``` ## Purpose `theft` enables the standardised calculation of time-series features from six existing feature sets in both R and Python as well as any user-supplied features. ## Core functionality All time-series datasets passed into `theft` must be a `tbl_ts` generated through the [`tsibble`](https://tsibble.tidyverts.org/) R package. This ensures consistency with the broader [`tidyverts`](https://tidyverts.org/) collection of packages. To explore package functionality, we are going to use a dataset that comes with `theft` called `simData`. This dataset contains some simulated time series processes, including Gaussian noise, AR(1), ARMA(1,1), MA(1), noisy sinusoid, and a random walk. The dataset can be accessed via: ```{r, message = FALSE, warning = FALSE, eval = FALSE} theft::simData ``` Note that `simData` is a `tsibble` with two `key` variables: `id` and `process` which identify each time series by its unique ID and group, and an `index` variable of `timepoint` which denotes time indices. The data follows the following structure: ```{r, message = FALSE, warning = FALSE} head(simData) ``` ### Calculating time-series features The core function in `theft` is `calculate_features`. You can choose which subset of features to calculate with the `feature_set` argument. The choices are currently `"catch22"`, `"feasts"`, `"tsfeatures"`, `"tsfresh"`, `"tsfel"`, and `"kats"`. In addition, two basic feature sets `"quantiles"` (a set of 100 quantiles) and `"moments"` (the first four moments of the distribution: mean, variance, skewness, and kurtosis) are also available to be specified for users seeking to compute simple baselines against which to compare the more sophisticated feature sets (see [this recent paper](https://arxiv.org/abs/2303.17809) for more discussion on this idea). Note that `kats`, `tsfresh` and `tsfel` are Python packages. The R package `reticulate` is used to call Python code that uses these packages and applies it within the broader *tidy* data philosophy embodied by `theft`. `theft` currently provides access to $>1100$ features from these six sets alone. However, as discussed in the functionality demonstrations below, you can also supply your own list of features too! #### Installing Python feature sets Prior to using `theft` (only if you want to use the `Kats`, `tsfresh` or `TSFEL` feature sets; the R-based sets will run fine) you should have a working Python 3.9 installation and run the function `install_python_pkgs(venv, python)` after first installing `theft`, where the `venv` argument is the name of the virtual environment you want to create and `python` is the path to the Python interpreter you want to use. Note that the filepath to the Python interpreter will differ depending on your operating system. For example, on Windows it might be a path similar to `"C:/Users/YourName/AppData/Local/Programs/Python/Python310/python.exe"`, whereas on on Linux or MacOS (which was used for this vignette) it could be a path similar to `"/usr/local/bin/python3.10"` or `"/usr/bin/python3"`. For example, if you wanted to install the Python libraries to the default virtual environment folder used by `reticulate`, you would run the following after first having installed `theft` (here I am just creating a new virtual environment called `"theft-package"`---you can call it whatever you like!): ```{r, eval = FALSE} install_python_pkgs(venv = "theft-package", python = "/usr/local/bin/python3.10") ``` You can then run the following to activate the virtual environment: ```{r, eval = FALSE} init_theft("theft-package") ``` You are now ready to commit theft! **NOTE 1: You only need to call ** `init_theft` **once per session.** **NOTE 2: There are also separate installation functions for each Python feature set, such as** `install_tsfresh` **if you only need one of the libraries and want to keep your dependencies light.** #### Calculating features The core function in `theft` is `calculate_features` which takes the following arguments: * `data`---a `tbl_ts` containing the time series data * `feature_set`---character or vector of characters denoting the set of time-series features to calculate. Can be one or more of `"catch22"`, `"feasts"`, `"tsfeatures"`, `"tsfresh"`, `"tsfel"`, `"kats"`, `"quantiles"`, or `"moments"` * `features`---a named list containing a set of user-supplied functions to calculate on `data`. Each function should take a single argument which is the time series. Defaults to `NULL` for no manually-specified features. Each list entry must have a name as `calculate_features` looks for these to name the features. If you don't want to use the existing feature sets and only compute those passed to `features`, set `feature_set = NULL` * `catch24`---Boolean specifying whether to compute `catch24` in addition to `catch22` if `catch22` is one of the feature sets selected. Defaults to `FALSE` * `tsfresh_cleanup`---Boolean specifying whether to use the in-built `tsfresh` relevant feature filter or not. Defaults to `FALSE` * `use_compengine`---Boolean specifying whether to use the `"compengine"` features in `tsfeatures`. Defaults to `FALSE` to provide immense computational efficiency benefits * `seed`---integer denoting a fixed number for R's random number generator to ensure reproducibility. Defaults to `123` * `z_score`---Boolean specifying whether to *z*-score the time-series before computing features. Defaults to `FALSE` * `n_jobs`---integer denoting the number of parallel processes to use if `"tsfresh"` or `"tsfel"` are specified in `feature_set`. Defaults to `0` for no parallelisation * `warn`---Boolean specifying whether to produce warnings from feature set packages. Defaults to `TRUE` Here is an example with the `catch22` set: ```{r, message = FALSE, warning = FALSE} feature_matrix <- calculate_features(data = simData, feature_set = "catch22") head(feature_matrix) ``` Note that `data` must be a `tsibble::tbl_ts` object, which has specified `key` (i.e., identifying) and `index` (i.e., time) variables. `theft` treats the first variable in the `key` as the ID variable and the second as the grouping variable (if there is one). Any other key variables will be ignored by `theft`. You can also supply your own named list of functions to compute as time-series features. Below is an example with mean and standard deviation. Note that the list *must* be named as `theft` uses the list element names to label the time-series features internally. Note that if you don't want to use any of the existing feature sets in `theft` and only calculate the features you supply to `features`, just set `feature_set = NULL`. ```{r, message = FALSE, warning = FALSE} feature_matrix2 <- calculate_features(data = simData, feature_set = NULL, features = list("mean" = mean, "sd" = sd)) head(feature_matrix2) ``` ### Comparison of feature sets For a detailed comparison of the six feature sets, see [this paper](https://ieeexplore.ieee.org/document/9679937) for a detailed review^[T. Henderson and B. D. Fulcher, "An Empirical Evaluation of Time-Series Feature Sets," 2021 International Conference on Data Mining Workshops (ICDMW), 2021, pp. 1032-1038, doi: 10.1109/ICDMW53433.2021.00134.]. ## Reading and processing hctsa-formatted files As `theft` is based on the foundations laid by [`hctsa`](https://github.com/benfulcher/hctsa), there is also functionality for reading in `hctsa`-formatted Matlab files and automatically processing them into tidy dataframes ready for feature extraction in `theft`. The `process_hctsa_file` function takes a string filepath to the Matlab file and does all the work for you, returning a dataframe with naming conventions consistent with other `theft` functionality. As per `hctsa` specifications for [Input File Format 1](https://time-series-features.gitbook.io/hctsa-manual/installing-and-using-hctsa/calculating/input_files#input-file-format-1-.mat-file), this file should have 3 variables with the following exact names: `timeSeriesData`, `labels`, and `keywords`. The filepath can be a local drive path or a URL. ## Analysing, interpreting, and visualising time-series features Please see the companion package [`theftdlc`](https://github.com/hendersontrent/theftdlc) ('`theft` downloadable content') for a large suite of functions that are designed to work on top of `theft`.