---
title: "Getting Started with leakr"
author: "Cheryl Isabella Lim"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with leakr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

```{r setup}
library(leakr)
```

## Introduction

Data leakage is one of the most insidious problems in machine learning, where information from the future or target variable inadvertently influences model training. The **leakr** package provides a comprehensive toolkit for detecting common leakage patterns that can compromise model validity and reproducibility.

This vignette demonstrates the basic functionality of leakr using standard datasets, showing how to identify and diagnose potential leakage issues in your machine learning workflows.

## Basic Usage: The leakr_audit() Function

The primary interface for leakage detection is `leakr_audit()`, which runs multiple detectors on your dataset and generates a comprehensive report.

### Simple Example with iris Dataset

```{r basic_example}
# Load the iris dataset
data(iris)

# Run a basic audit
report <- leakr_audit(iris, target = "Species")

# View the summary
print(report)
```

### Understanding the Output

The audit report contains several key components:

- **Summary statistics** about your dataset
- **Detected issues** organised by severity level
- **Recommendations** for addressing potential leakage
- **Diagnostic information** for each detector

```{r examine_report}
# Get a detailed summary
summary_report <- leakr_summarise(report, top_n = 5, show_config = TRUE)
print(summary_report)
```

## Working with Train/Test Splits

One of the most common sources of leakage occurs when information from the test set influences training. Let's create a more realistic example:

```{r train_test_example}
# Create a dataset with potential train/test leakage
set.seed(123)
n <- 1000

# Simulate a dataset
data <- data.frame(
  feature1 = rnorm(n),
  feature2 = rnorm(n),
  feature3 = rnorm(n),
  target = factor(sample(c("A", "B"), n, replace = TRUE))
)

# Create a train/test split
train_indices <- sample(1:n, 0.7 * n)
split_vector <- rep("test", n)
split_vector[train_indices] <- "train"

# Run audit with split information
report_with_split <- leakr_audit(
  data = data,
  target = "target",
  split = split_vector
)

print(report_with_split)
```

## Detecting Specific Leakage Patterns

### Target Leakage Detection

Target leakage occurs when features contain information that would not be available at prediction time:

```{r target_leakage}
# Create data with obvious target leakage
leaky_data <- data.frame(
  legitimate_feature = rnorm(100),
  target = factor(sample(c("yes", "no"), 100, replace = TRUE)),
  stringsAsFactors = FALSE
)

# Add a leaky feature (perfectly correlated with target)
leaky_data$leaky_feature <- ifelse(leaky_data$target == "yes", 1, 0)

# Audit for target leakage
leakage_report <- leakr_audit(leaky_data, target = "target")
print(leakage_report)
```

### Duplication Detection

Duplicate records can lead to optimistic performance estimates:

```{r duplication}
# Create data with duplicates
original_data <- mtcars[1:20, ]
duplicated_data <- rbind(original_data, original_data[1:5, ])

# Add row identifiers
duplicated_data$id <- 1:nrow(duplicated_data)

# Run duplication audit
dup_report <- leakr_audit(
  data = duplicated_data, 
  target = "mpg",
  id = "id"
)

print(dup_report)
```

## Configuration and Customisation

The `leakr_audit()` function accepts various configuration options to customise the detection process:

```{r configuration}
# Custom configuration
custom_config <- list(
  sample_size = 10000,        # Limit analysis to 10k rows for large datasets
  correlation_threshold = 0.9, # Adjust sensitivity for correlation-based detectors
  duplicate_threshold = 0.95   # Threshold for near-duplicate detection
)

# Run audit with custom configuration
configured_report <- leakr_audit(
  data = iris,
  target = "Species",
  config = custom_config
)

print(configured_report)
```

## Visualising Results

Generate diagnostic plots to better understand detected issues:

```{r visualisation, eval=FALSE}
# Generate diagnostic plots
plots <- generate_diagnostic_plots(report)

# Display plots (if available)
if (!is.null(plots)) {
  plot(plots)
}
```

## Working with Large Datasets

For large datasets, leakr automatically applies intelligent sampling to maintain performance while preserving detection accuracy:

```{r large_dataset_simulation}
# Simulate a large dataset
set.seed(42)
large_n <- 50000

large_data <- data.frame(
  feature1 = rnorm(large_n),
  feature2 = rnorm(large_n),
  feature3 = sample(letters[1:5], large_n, replace = TRUE),
  target = factor(sample(c("positive", "negative"), large_n, replace = TRUE))
)

# leakr will automatically sample this dataset
large_report <- leakr_audit(large_data, target = "target")
print(large_report)
```

## Next Steps

This vignette covered the basics of using leakr for data leakage detection. For more advanced usage, including:

- Integration with popular ML frameworks (caret, mlr3, tidymodels)
- Custom detector development
- Advanced configuration options
- Handling specific data types and domains

See the other vignettes in this package:

- **Advanced Leakage Detection**: Deep dive into specific detector types and customisation
- **Framework Integration**: Using leakr with caret, mlr3, and tidymodels workflows

## Summary

The leakr package provides a systematic approach to detecting data leakage in machine learning workflows. Key takeaways:

1. Use `leakr_audit()` as your primary entry point for leakage detection
2. Always specify your target variable and train/test splits when available
3. Review the generated reports carefully and follow the recommendations
4. Configure detection thresholds based on your specific use case
5. Integrate leakage detection early in your ML pipeline to catch issues before they impact model performance

Regular use of leakr can help ensure the integrity and reproducibility of your machine learning models.