--- title: "trendtestr-intro-eustockmarkets" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{trendtestr-intro-eustockmarkets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction This vignette walks through a **recommended automated workflow** in **`trendtestR`** using the built-in **`EuStockMarkets`** dataset. Rather than demonstrating every function, it focuses on a **small, practical subset** that covers most day-to-day use cases with minimal manual tuning. ## What you’ll do (automated path) 1. **Prepare & reshape data** (wide → long for grouped analysis) 2. **Verify time-window continuity** to ensure valid comparisons 3. **Compare cross-year periods** at two granularities (weekly vs daily) 4. **Run auto-selected group tests** via `run_group_test()` with assumptions and effect sizes 5. **Fit trends with one call** via `explore_trend_auto()` (automatic model family & smoothing) 6. **(Optional) ARIMA readiness check** to decide if time-series modeling is warranted ## Selected functions used - `filter_by_groupcol()` — subset/structure data by groups and date column - `check_continuity_by_window()` — confirm continuous windows across years/months - `compare_monthly_cases()` — cross-year comparison with chosen granularity & aggregation - `compare_distribution_by_granularity()` — sanity-check distributional changes (day vs week) - `run_group_tests()` — automated test selection + assumptions + effect sizes - `explore_trend_auto()` — automatic trend modeling (Gaussian/Gamma, splines) - `check_rate_diff_arima_ready()` *(optional)* — stationarity, seasonality & differencing hints ## Dataset **`EuStockMarkets`** contains daily closing prices for four European stock indices (**DAX**, **SMI**, **CAC**, **FTSE**) from 1991-01-01 to 1996-02-03. For clarity and speed, this vignette focuses on **DAX** and **CAC** and a **two-year cross-year window**. # Workflow ## 1. Installation and Setup This section loads the necessary packages and prepares the built-in dataset **`EuStockMarkets`** for analysis. We will: - Convert the built-in time-series object to a **data.frame** with an explicit **date** column. - Reshape it from **wide format** (one column per market) to **long format** (one column for market names, one for index values), which is easier to group, filter, and visualize in **`trendtestR`** workflows. ## 1.1 Load required packages ```r library(trendtestR) library(dplyr) library(tidyr) library(lubridate) ``` ## 1.2 Data Preparation The built-in dataset contains daily closing prices of four European stock market indices: DAX (Germany), SMI (Switzerland), CAC (France), and FTSE (UK), covering the period 1991-01-01 to 1996-02-03. ```r # Load the built-in dataset data("EuStockMarkets") # Create a dataframe with a date column and the stock indices eu_df <- data.frame( date = seq(as.Date("1991-01-01"), by = "day", length.out = nrow(EuStockMarkets)), as.data.frame(EuStockMarkets) ) # Preview the last few rows tail(eu_df) # Reshape the dataset to long format for easier grouping and filterin eu_long <- eu_df %>% pivot_longer( cols = c(DAX, SMI, CAC, FTSE), names_to = "market", values_to = "index" ) %>% mutate(market = factor(market)) # Preview the first few rows head(eu_long) ``` ## 2. Data Filtering We keep only **DAX** (Germany) and **CAC** (France) for a smaller, faster analysis. **`filter_by_groupcol()`** lets us select specific groups while keeping the data in long format. ```r # ecoDaxCac <- filter_by_groupcol( eu_long, group_col = "market", # grouping variable value_col = "index", # values to analyze datum_col = "date", # date variable keep_levels = c("DAX", "CAC"), to_wide = FALSE, keep_other_cols = TRUE ) # Preview the first few rows head(ecoDaxCac) ``` ## 3. Data Continuity Check We use **`check_continuity_by_window()`** to verify there are no date gaps in the selected period, ensuring data quality before running further functions. ```r checkconti <- check_continuity_by_window( date_vec = ecoDaxCac$date, years = c(1991, 1993), months = c(10, 9), window_unit = "day", use_isoweek = TRUE, allow_leading_gap = TRUE ) # Display continuity results cat("Data is continuous:", checkconti$continuous, "\n") cat("Data range:", as.character(checkconti$range), "\n") # Output: Data is continuous: TRUE # Output: Data range: 1991-10-01 1993-09-30 ``` ## 4. Cross-Year Data Comparison We use **`compare_monthly_cases()`** to compare values between years over a cross-year period, allowing flexible month selection and time aggregation. ## 4.1 Weekly Granularity Comparison We first compare 1992–1993 data aggregated weekly: ```r # Compare 1992-1993 data with weekly granularity reseuro <- compare_monthly_cases( ecoDaxCac, datum_col = "date", value_col = "index", group_col = "market", years = c(1992, 1993), months = c(10:12, 1:9), # Oct–Dec + Jan–Sep (cross-year) granularity = "week", agg_fun = "mean", shift_month = "mth_to_next" #alternative param: mth_to_prev, none ) # Note: Function automatically excludes groups with no data (1991, 1995) # Shows standardization info and data characteristics ``` ## 4.2 Daily Granularity Comparison We repeat the analysis at daily granularity to compare results: ```r # Compare same period with daily granularity reseurod <- compare_monthly_cases( ecoDaxCac, datum_col = "date", value_col = "index", group_col = "market", years = c(1992, 1993), months = c(10:12, 1:9), granularity = "day", agg_fun = "median", shift_month = "mth_to_next" ) # View statistical test results print(reseuro$tests) # Results show Kruskal-Wallis test with large effect size (eta² ≈ 0.31) # Includes assumption checks and post-hoc Dunn tests ``` ## 4.3 Distribution Comparison Across Granularities We then compare distributions between granularities to guide aggregation choice: ```r # Compare distributions using Q-Q plots compare_distribution_by_granularity(reseuro, reseurod) #Shows normality tests and variance tests for different granularities #Helps determine optimal time aggregation level ``` This helps determine the most suitable time aggregation level for subsequent statistical analyses. ## 5. Automated Statistical Testing We use **`run_group_tests()`** to automatically select and perform the most appropriate statistical test based on data characteristics, including assumption checks and effect size calculation. ```r # Run automated group comparison tests test_results <- run_group_tests( reseuro$data, value_col = "index", group_col = "market", effect_size = TRUE, report_assumptions = TRUE ) print(test_results) # Function automatically excludes groups with no data (FTSE, SMI) # Recommends Mann-Whitney U-Test due to violated normality assumptions ``` ## 6. Trend Modeling We start with **automatic model selection** using **`explore_trend_auto()`**, which evaluates multiple candidate families (e.g., Gaussian, Gamma, Poisson, ZINB) and chooses the most suitable one based on AIC and model diagnostics. This step provides a quick, data-driven baseline model before fine-tuning parameters such as spline degrees of freedom in the next section. ## 6.1 Automated Trend Exploration ```r # Automatically select the most appropriate trend model trend_auto <- explore_trend_auto( reseuro$data, datum_col = "date", value_col = ".value", group_col = "market", family = "auto", kdf = 5 ) print(trend_auto$summary) # Function compares Gaussian vs Gamma GLM and selects optimal model # Shows AIC comparison and model selection rationale ``` ## 6.2 Spline Degrees of Freedom Optimization While **`explore_trend_auto()`** already selects a reasonable default, users may wish to **manually fine-tune model complexity** for deeper exploration. Here we illustrate one such approach: selecting spline degrees of freedom based on the **largest AIC drop** compared to the previous candidate, rather than simply picking the absolute AIC minimum. This captures the point of **maximum improvement before diminishing returns**. > This is just **one possible workflow** — any of the `explore_*_trend()` functions can be used interactively to test different model families, spline settings, or grouping structures for more tailored analysis. ```r # Create AIC comparison dataframe aic_df <- data.frame( df_spline = integer(), AIC = numeric() ) # Loop through different degrees of freedom for (df in 4:7) { tmp <- explore_continuous_trend( reseuro$data, datum_col = "date", value_col = ".value", group_col = "market", family = "gaussian", df_spline = df ) aic_df <- rbind(aic_df, data.frame(df_spline = df, AIC = AIC(tmp$model))) } # Find optimal degrees of freedom aic_drop <- diff(aic_df$AIC) optimal_df <- aic_df$df_spline[which.max(-aic_drop)] + 1 # largest negative drop cat("optimal spline degrees of freedom:", optimal_df, "\n") ``` ## 6.3 Modeling with Optimal Parameters We refit the trend model using the **optimal spline degrees of freedom** found above, ensuring the model complexity is justified by the largest improvement in fit. ```r euexp <- explore_continuous_trend( reseuro$data, datum_col = "date", value_col = ".value", group_col = "market", family = "gaussian", df_spline = optimal_df # Use df=5 for optimal fit ) # View model summary summary(euexp$model) ``` ## 7. Model Diagnostics After fitting the trend model, we run **`diagnose_model_trend()`** to check whether model assumptions are met. This step validates residual behavior, tests for normality and variance homogeneity, and helps decide if further model adjustments are necessary. ```r # Perform model diagnostics diagnosis <- diagnose_model_trend(euexp$model) # Provides residual plots, normality tests, and homogeneity checks # Includes Kolmogorov-Smirnov, Shapiro-Wilk, and Levene tests ``` ## 8. ARIMA Modeling Preparation Before applying ARIMA, we run **`check_rate_diff_arima_ready()`** to assess if the data meets key time-series assumptions. This step checks for outliers, trend and seasonality patterns, and suggests whether differencing is required, ensuring a more stable ARIMA fit. ```r # Pre-ARIMA modeling checks arima_check <- check_rate_diff_arima_ready( rate_diff_vec = eu_df$DAX, date_vec = eu_df$date, frequency = 52, plot_acf = TRUE, do_stl = TRUE ) # Shows comprehensive analysis: outliers, stationarity tests, # seasonal decomposition, and differencing recommendations ``` ## Other Functions (at a glance) We also provide utilities beyond this recommended workflow. ### Epidemiology-style weekly visualization - `plot_weekly_cases()` — weekly aggregation and visualization for epi data. - Aggregates by ISO week with user-defined retrospective windows (single range or custom start–end). - Generates three plot types: trend (bar+line), histogram with density, and boxplot. - Supports flexible aggregation functions (`sum`, `mean`, etc.) and optional plot selection. - Calculates and reports 95% confidence intervals for weekly means. - Allows saving plots to file, making it suitable for seasonality checks, outbreak monitoring, and reporting-ready outputs. ### Additional statistical testing - `run_group_tests()` — (used above) auto-selects tests + assumptions + effect sizes. - `run_paired_tests()` — paired or unpaired comparisons with normality checks and nonparametric fallback. - `run_multi_group_tests()` — k-group comparisons (ANOVA / Kruskal–Wallis) with optional post-hoc (Tukey / Dunn). - `run_count_two_group_tests()` — compares count data between two groups, automatically chooses Poisson or Negative Binomial regression **based on overdispersion**. - `run_count_multi_group_tests()` — compares count data across ≥3 groups, automatically chooses Poisson vs Negative Binomial **based on overdispersion**, reports an overall (ANOVA-like) p-value and, if significant, **post-hoc** pairwise results; optional effect size (McFadden pseudo-R²) and basic assumption diagnostics. ### Additional trend modeling - `explore_continuous_trend()` — GLM-style trends for continuous outcomes (Gaussian/Gamma), with spline control. - `explore_poisson_trend()` — GAM-style trends for count-data (Poisson / Negative Binomial) with spline control. - `explore_zinb_trend()` — zero-inflated counts (ZIP vs ZINB) with AIC/Vuong comparison. - `explore_trend_auto()` — (used above) single-entry auto dispatcher choosing a suitable family and functions. ### Time-series readiness - `check_rate_diff_arima_ready()` — (used above) stationarity, STL seasonality, differencing and ACF diagnostics before ARIMA. > These functions can be combined with the same data-prep pattern (wide → long, filtered groups, verified continuity). Pick what you need: quick epi weekly plots, richer hypothesis tests, or specialized trend families for counts and zero-inflated data. # Summary This vignette walked through a **streamlined, automated workflow** in `trendtestR` using the built-in `EuStockMarkets` dataset. We started with **data preparation and continuity checks**, moved through **cross-year comparisons**, **auto-selected statistical testing**, and **automatic trend modeling**, and optionally ran **ARIMA readiness checks** for time-series forecasting. Beyond this workflow, `trendtestR` provides **modular functions** for epidemiology-style weekly plots, overdispersion-aware count-data testing, and specialized trend models for continuous, count, or zero-inflated data. You can adopt the full automated path for rapid insights, or **mix and match components** for deeper, more customized analyses — all while keeping a consistent data-preparation pattern and diagnostic rigor. For detailed functionality, please refer to the help documentation of individual functions. **Example use cases include**: - Financial time series analysis: stock prices, market indices, trading volumes - Economic data analysis: GDP growth, inflation rates, employment figures - Epidemiological studies: disease incidence rates, vaccination coverage - Environmental monitoring: temperature trends, pollution levels, rainfall patterns - Business analytics: sales trends, customer metrics, operational KPIs