{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "10fed78c",
   "metadata": {},
   "source": [
    "# 🏃 Anomaly Detection in Sports: Athlete Performance & Injury Risk Monitoring\n",
    "\n",
    "## Course: DATA 110 — Introduction to Data Science\n",
    "### School of Data Science and Society, UNC Chapel Hill\n",
    "\n",
    "---\n",
    "\n",
    "**Scenario:** You are a sports data analyst for a collegiate athletics program. Coaches rely on you to monitor 20 athletes across 5 sports over a 30-day training period. Your goal is to detect:\n",
    "\n",
    "- **Overtraining** — athletes pushing too hard, risking burnout\n",
    "- **Injury risk** — declining performance + poor recovery signals  \n",
    "- **Peak performance** — unusually high output (positive anomaly!)\n",
    "\n",
    "You will use both **supervised** and **unsupervised** anomaly detection to flag athletes who need attention.\n",
    "\n",
    "---\n",
    "\n",
    "### 📋 What You'll Learn\n",
    "- Exploring sports performance data with visualizations\n",
    "- **Supervised:** Random Forest to classify known anomaly patterns\n",
    "- **Unsupervised:** Isolation Forest and Local Outlier Factor (LOF) to discover anomalies without labels\n",
    "- Comparing methods and understanding trade-offs\n",
    "- Real-world applications in sports analytics\n",
    "\n",
    "### 📦 Dataset\n",
    "- **600 daily records** (20 athletes × 30 days)\n",
    "- 8 performance/wellness metrics per record\n",
    "- **540 normal** records and **60 anomalies** (overtraining, injury risk, peak performance)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84fa77fd",
   "metadata": {},
   "source": [
    "## Step 1: Setup — Import Libraries and Load Data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c91fef6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# IMPORT LIBRARIES\n",
    "# ============================================================\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# Machine Learning\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.ensemble import RandomForestClassifier, IsolationForest\n",
    "from sklearn.neighbors import LocalOutlierFactor\n",
    "from sklearn.metrics import (classification_report, confusion_matrix,\n",
    "                             accuracy_score, ConfusionMatrixDisplay,\n",
    "                             precision_score, recall_score, f1_score)\n",
    "\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "sns.set_palette(\"husl\")\n",
    "plt.rcParams['figure.figsize'] = (10, 6)\n",
    "plt.rcParams['font.size'] = 12\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "print(\"✅ All libraries loaded successfully!\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c64331ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# LOAD THE DATASET\n",
    "# ============================================================\n",
    "# Update the path to wherever you saved the Excel file\n",
    "df = pd.read_excel('sports_athlete_anomaly_data.xlsx', sheet_name='Athlete_Performance')\n",
    "\n",
    "print(f\"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns\")\n",
    "print(f\"\\nFirst 5 rows:\")\n",
    "df.head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99d75ac7",
   "metadata": {},
   "source": [
    "## Step 2: Exploratory Data Analysis (EDA)\n",
    "\n",
    "### 2.1 Understand the Data Structure\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "abfe5490",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# DATASET OVERVIEW\n",
    "# ============================================================\n",
    "print(\"=\" * 60)\n",
    "print(\"DATASET OVERVIEW\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "print(f\"\\nTotal records: {len(df)}\")\n",
    "print(f\"Unique athletes: {df['Athlete_ID'].nunique()}\")\n",
    "print(f\"Sports: {', '.join(df['Sport'].unique())}\")\n",
    "print(f\"Date range: {df['Date'].min().date()} to {df['Date'].max().date()}\")\n",
    "\n",
    "print(f\"\\nLabel distribution:\")\n",
    "print(df['Anomaly_Label'].value_counts())\n",
    "print(f\"\\nAnomaly percentage: {(df['Anomaly_Label'] != 'Normal').mean() * 100:.1f}%\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64f85baa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define our feature columns (the performance metrics)\n",
    "feature_cols = ['Training_Load_AU', 'Sprint_Speed_km_h', 'Resting_Heart_Rate_bpm',\n",
    "                'Recovery_Heart_Rate_bpm', 'Sleep_Hours', 'Hydration_Level_pct',\n",
    "                'Perceived_Exertion_1_10', 'Performance_Score_0_100']\n",
    "\n",
    "df[feature_cols].describe().round(2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1182f0f",
   "metadata": {},
   "source": [
    "### 2.2 Visualize Performance Metrics\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a367c153",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# DISTRIBUTION PLOTS: Compare Normal vs Anomaly types\n",
    "# ============================================================\n",
    "fig, axes = plt.subplots(2, 4, figsize=(22, 10))\n",
    "axes = axes.flatten()\n",
    "\n",
    "for i, col in enumerate(feature_cols):\n",
    "    for label in df['Anomaly_Label'].unique():\n",
    "        subset = df[df['Anomaly_Label'] == label][col]\n",
    "        axes[i].hist(subset, bins=20, alpha=0.5, label=label, density=True)\n",
    "    axes[i].set_title(col.replace('_', ' '), fontsize=10, fontweight='bold')\n",
    "    axes[i].set_ylabel('Density')\n",
    "\n",
    "# Single legend for all subplots\n",
    "handles, labels = axes[0].get_legend_handles_labels()\n",
    "fig.legend(handles, labels, loc='lower center', ncol=4, fontsize=10, \n",
    "           bbox_to_anchor=(0.5, -0.02))\n",
    "plt.suptitle('Performance Metric Distributions by Status', fontsize=16, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"💡 OBSERVE: How do overtraining and injury risk patterns differ from normal?\")\n",
    "print(\"   Notice that peak performance anomalies are POSITIVE outliers!\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ce3c47e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# SCATTER: Training Load vs Performance Score\n",
    "# ============================================================\n",
    "plt.figure(figsize=(10, 7))\n",
    "\n",
    "colors = {'Normal': '#4CAF50', 'Anomaly - Overtraining': '#F44336',\n",
    "          'Anomaly - Injury Risk': '#FF9800', 'Anomaly - Peak Performance': '#2196F3'}\n",
    "\n",
    "for label, color in colors.items():\n",
    "    mask = df['Anomaly_Label'] == label\n",
    "    plt.scatter(df.loc[mask, 'Training_Load_AU'],\n",
    "                df.loc[mask, 'Performance_Score_0_100'],\n",
    "                c=color, label=label, alpha=0.6, s=60, edgecolors='white')\n",
    "\n",
    "plt.xlabel('Training Load (AU)', fontsize=12)\n",
    "plt.ylabel('Performance Score (0-100)', fontsize=12)\n",
    "plt.title('Training Load vs. Performance Score', fontsize=14, fontweight='bold')\n",
    "plt.legend(fontsize=10, loc='upper left')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"💡 KEY INSIGHT:\")\n",
    "print(\"   - Overtraining: HIGH load, LOW performance (burning out)\")\n",
    "print(\"   - Injury Risk:  HIGH load, LOW performance + poor sprint speed\")\n",
    "print(\"   - Peak Performance: MODERATE load, VERY HIGH performance (in the zone!)\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4744154",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# RADAR CHART: Average profile for each group\n",
    "# ============================================================\n",
    "from math import pi\n",
    "\n",
    "# Calculate mean for each group\n",
    "groups = df.groupby('Anomaly_Label')[feature_cols].mean()\n",
    "\n",
    "# Normalize to 0-1 scale for radar chart\n",
    "groups_norm = (groups - groups.min()) / (groups.max() - groups.min())\n",
    "\n",
    "categories = [c.replace('_', '\\n') for c in feature_cols]\n",
    "N = len(categories)\n",
    "angles = [n / float(N) * 2 * pi for n in range(N)]\n",
    "angles += angles[:1]\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(polar=True))\n",
    "\n",
    "for label, color in colors.items():\n",
    "    values = groups_norm.loc[label].values.tolist()\n",
    "    values += values[:1]\n",
    "    ax.plot(angles, values, 'o-', linewidth=2, label=label, color=color)\n",
    "    ax.fill(angles, values, alpha=0.1, color=color)\n",
    "\n",
    "ax.set_xticks(angles[:-1])\n",
    "ax.set_xticklabels(categories, fontsize=9)\n",
    "ax.set_ylim(0, 1)\n",
    "ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1), fontsize=10)\n",
    "plt.title('Average Profile by Group (Normalized)', fontsize=14, fontweight='bold', y=1.08)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"💡 The radar chart shows each group's 'fingerprint' across all metrics.\")\n",
    "print(\"   Different anomaly types have distinctly different shapes!\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29e6e62d",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 3: SUPERVISED Anomaly Detection — Random Forest\n",
    "\n",
    "Since we have labeled data, we can train a classifier. This time we'll do **multi-class** classification to distinguish between the three types of anomalies.\n",
    "\n",
    "### 3.1 Prepare Data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b7b17ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# PREPARE DATA\n",
    "# ============================================================\n",
    "X = df[feature_cols].copy()\n",
    "\n",
    "# Binary labels for overall anomaly detection\n",
    "y_binary = (df['Anomaly_Label'] != 'Normal').astype(int)\n",
    "\n",
    "# Multi-class labels for specific anomaly type detection\n",
    "y_multi = df['Anomaly_Label'].copy()\n",
    "\n",
    "# Split data\n",
    "X_train, X_test, yb_train, yb_test, ym_train, ym_test = train_test_split(\n",
    "    X, y_binary, y_multi, test_size=0.30, random_state=42, stratify=y_binary\n",
    ")\n",
    "\n",
    "# Scale features\n",
    "scaler = StandardScaler()\n",
    "X_train_scaled = scaler.fit_transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)\n",
    "\n",
    "print(f\"Training set: {X_train.shape[0]} records\")\n",
    "print(f\"Test set:     {X_test.shape[0]} records\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c3c8552",
   "metadata": {},
   "source": [
    "### 3.2 Train and Evaluate — Binary Classification (Normal vs Anomaly)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f017e72",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# BINARY RANDOM FOREST: Normal vs Anomaly\n",
    "# ============================================================\n",
    "rf_binary = RandomForestClassifier(\n",
    "    n_estimators=100, max_depth=10, random_state=42, class_weight='balanced'\n",
    ")\n",
    "rf_binary.fit(X_train_scaled, yb_train)\n",
    "yb_pred = rf_binary.predict(X_test_scaled)\n",
    "\n",
    "print(\"BINARY CLASSIFICATION: Normal vs. Anomaly\")\n",
    "print(\"=\" * 50)\n",
    "print(classification_report(yb_test, yb_pred, target_names=['Normal', 'Anomaly']))\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 6))\n",
    "ConfusionMatrixDisplay(confusion_matrix(yb_test, yb_pred),\n",
    "                       display_labels=['Normal', 'Anomaly']).plot(cmap='Blues', ax=ax)\n",
    "plt.title('Binary Classification — Confusion Matrix', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7987354c",
   "metadata": {},
   "source": [
    "### 3.3 Train and Evaluate — Multi-Class (Specific Anomaly Types)\n",
    "\n",
    "Can the model tell us **what kind** of anomaly it is?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a613a374",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# MULTI-CLASS RANDOM FOREST: Predict specific anomaly type\n",
    "# ============================================================\n",
    "rf_multi = RandomForestClassifier(\n",
    "    n_estimators=100, max_depth=12, random_state=42, class_weight='balanced'\n",
    ")\n",
    "rf_multi.fit(X_train_scaled, ym_train)\n",
    "ym_pred = rf_multi.predict(X_test_scaled)\n",
    "\n",
    "print(\"MULTI-CLASS CLASSIFICATION: Specific Anomaly Types\")\n",
    "print(\"=\" * 60)\n",
    "print(classification_report(ym_test, ym_pred))\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 8))\n",
    "cm_multi = confusion_matrix(ym_test, ym_pred, labels=rf_multi.classes_)\n",
    "disp = ConfusionMatrixDisplay(cm_multi, display_labels=rf_multi.classes_)\n",
    "disp.plot(cmap='Blues', ax=ax, xticks_rotation=30)\n",
    "plt.title('Multi-Class Classification — Confusion Matrix', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"💡 The multi-class model can identify the TYPE of anomaly,\")\n",
    "print(\"   which helps coaches take the right corrective action!\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c893e68",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# FEATURE IMPORTANCE\n",
    "# ============================================================\n",
    "importance = pd.Series(rf_binary.feature_importances_, index=feature_cols)\n",
    "importance = importance.sort_values(ascending=True)\n",
    "\n",
    "plt.figure(figsize=(10, 6))\n",
    "colors_bar = ['#4CAF50' if v < importance.median() else '#1976D2' for v in importance.values]\n",
    "importance.plot(kind='barh', color=colors_bar, edgecolor='white')\n",
    "plt.xlabel('Importance Score', fontsize=12)\n",
    "plt.title('Which Metrics Are Most Important for Detecting Anomalies?', \n",
    "          fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(f\"\\n💡 Top 3 most important features: {', '.join(importance.index[-3:][::-1])}\")\n",
    "print(\"   Coaches should pay closest attention to these metrics!\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5bcbdd0",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 4: UNSUPERVISED Anomaly Detection\n",
    "\n",
    "Now let's **pretend we don't have labels**. Can we still find anomalies?\n",
    "\n",
    "We'll use:\n",
    "1. **Isolation Forest** — isolates anomalies as points that are easy to separate\n",
    "2. **Local Outlier Factor (LOF)** — compares each point's density to its neighbors\n",
    "\n",
    "---\n",
    "\n",
    "### 4.1 Isolation Forest\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6829697a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# ISOLATION FOREST\n",
    "# ============================================================\n",
    "X_all_scaled = scaler.fit_transform(X)\n",
    "\n",
    "iso_forest = IsolationForest(\n",
    "    n_estimators=100,\n",
    "    contamination=0.10,   # Expect ~10% anomalies\n",
    "    random_state=42\n",
    ")\n",
    "\n",
    "iso_predictions = iso_forest.fit_predict(X_all_scaled)\n",
    "iso_labels = (iso_predictions == -1).astype(int)\n",
    "\n",
    "print(\"ISOLATION FOREST RESULTS\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"Predicted Normal:  {(iso_labels == 0).sum()}\")\n",
    "print(f\"Predicted Anomaly: {(iso_labels == 1).sum()}\")\n",
    "print()\n",
    "print(classification_report(y_binary, iso_labels, target_names=['Normal', 'Anomaly']))\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 6))\n",
    "ConfusionMatrixDisplay(confusion_matrix(y_binary, iso_labels),\n",
    "                       display_labels=['Normal', 'Anomaly']).plot(cmap='Oranges', ax=ax)\n",
    "plt.title('Isolation Forest (Unsupervised) — Confusion Matrix', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd051830",
   "metadata": {},
   "source": [
    "### 4.2 Local Outlier Factor (LOF)\n",
    "\n",
    "**How it works:** LOF measures the local density around each data point compared to its neighbors. If a point's density is much lower than its neighbors, it's considered an outlier.\n",
    "\n",
    "**Sports analogy:** If most athletes at a similar training load have similar performance scores, but one athlete's performance is dramatically different, LOF will flag that athlete.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e774bb29",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# LOCAL OUTLIER FACTOR\n",
    "# ============================================================\n",
    "lof = LocalOutlierFactor(\n",
    "    n_neighbors=20,        # Number of neighbors to consider\n",
    "    contamination=0.10     # Expected proportion of outliers\n",
    ")\n",
    "\n",
    "lof_predictions = lof.fit_predict(X_all_scaled)\n",
    "lof_labels = (lof_predictions == -1).astype(int)\n",
    "\n",
    "print(\"LOCAL OUTLIER FACTOR RESULTS\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"Predicted Normal:  {(lof_labels == 0).sum()}\")\n",
    "print(f\"Predicted Anomaly: {(lof_labels == 1).sum()}\")\n",
    "print()\n",
    "print(classification_report(y_binary, lof_labels, target_names=['Normal', 'Anomaly']))\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 6))\n",
    "ConfusionMatrixDisplay(confusion_matrix(y_binary, lof_labels),\n",
    "                       display_labels=['Normal', 'Anomaly']).plot(cmap='Purples', ax=ax)\n",
    "plt.title('Local Outlier Factor (Unsupervised) — Confusion Matrix', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03a4b774",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 5: Compare All Three Methods\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36e31ca5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# SIDE-BY-SIDE COMPARISON\n",
    "# ============================================================\n",
    "rf_full_pred = rf_binary.predict(scaler.fit_transform(X))\n",
    "\n",
    "methods = ['Random Forest\\n(Supervised)', 'Isolation Forest\\n(Unsupervised)', 'LOF\\n(Unsupervised)']\n",
    "predictions = [rf_full_pred, iso_labels, lof_labels]\n",
    "\n",
    "metrics = {'Accuracy': [], 'Precision': [], 'Recall': [], 'F1 Score': []}\n",
    "for pred in predictions:\n",
    "    metrics['Accuracy'].append(accuracy_score(y_binary, pred))\n",
    "    metrics['Precision'].append(precision_score(y_binary, pred, zero_division=0))\n",
    "    metrics['Recall'].append(recall_score(y_binary, pred, zero_division=0))\n",
    "    metrics['F1 Score'].append(f1_score(y_binary, pred, zero_division=0))\n",
    "\n",
    "fig, axes = plt.subplots(1, 4, figsize=(20, 5))\n",
    "bar_colors = ['#1976D2', '#F57C00', '#7B1FA2']\n",
    "\n",
    "for i, (metric_name, values) in enumerate(metrics.items()):\n",
    "    bars = axes[i].bar(methods, values, color=bar_colors, edgecolor='white', linewidth=1.5)\n",
    "    axes[i].set_title(metric_name, fontsize=14, fontweight='bold')\n",
    "    axes[i].set_ylim(0, 1.05)\n",
    "    axes[i].axhline(y=0.5, color='gray', linestyle='--', alpha=0.3)\n",
    "    for bar, val in zip(bars, values):\n",
    "        axes[i].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02,\n",
    "                     f'{val:.2f}', ha='center', fontsize=12, fontweight='bold')\n",
    "\n",
    "plt.suptitle('Model Comparison: Supervised vs. Unsupervised Anomaly Detection',\n",
    "             fontsize=16, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a55af6e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# BONUS: Which specific anomaly types did each method catch?\n",
    "# ============================================================\n",
    "results_df = df[['Athlete_Name', 'Sport', 'Anomaly_Label']].copy()\n",
    "results_df['RF_Detected'] = rf_full_pred\n",
    "results_df['IsoForest_Detected'] = iso_labels\n",
    "results_df['LOF_Detected'] = lof_labels\n",
    "\n",
    "# Look at actual anomalies — how many did each method catch?\n",
    "actual_anomalies = results_df[results_df['Anomaly_Label'] != 'Normal'].copy()\n",
    "\n",
    "print(\"DETECTION RATE BY ANOMALY TYPE\")\n",
    "print(\"=\" * 70)\n",
    "for anomaly_type in actual_anomalies['Anomaly_Label'].unique():\n",
    "    subset = actual_anomalies[actual_anomalies['Anomaly_Label'] == anomaly_type]\n",
    "    n = len(subset)\n",
    "    rf_caught = subset['RF_Detected'].sum()\n",
    "    iso_caught = subset['IsoForest_Detected'].sum()\n",
    "    lof_caught = subset['LOF_Detected'].sum()\n",
    "    print(f\"\\n{anomaly_type} ({n} cases):\")\n",
    "    print(f\"  Random Forest:    {rf_caught}/{n} caught ({rf_caught/n*100:.0f}%)\")\n",
    "    print(f\"  Isolation Forest: {iso_caught}/{n} caught ({iso_caught/n*100:.0f}%)\")\n",
    "    print(f\"  LOF:              {lof_caught}/{n} caught ({lof_caught/n*100:.0f}%)\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "492a855b",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 6: Key Takeaways\n",
    "\n",
    "### 📊 Method Comparison Summary\n",
    "\n",
    "| Method | Type | Strengths | Weaknesses | Best For |\n",
    "|--------|------|-----------|------------|----------|\n",
    "| **Random Forest** | Supervised | Highest accuracy; identifies anomaly *type* | Needs labeled data; can't find new anomaly types | Known patterns (overtraining protocols) |\n",
    "| **Isolation Forest** | Unsupervised | No labels needed; fast; good with high-dimensional data | Can't tell you *why* something is anomalous | General monitoring; new datasets |\n",
    "| **LOF** | Unsupervised | Detects local outliers; good for clustered data | Sensitive to parameters; slower on large data | Comparing athletes within groups |\n",
    "\n",
    "### 🏃 Real-World Sports Applications\n",
    "- **Pre-season screening:** Use unsupervised methods on baseline data to establish each athlete's \"normal\" profile\n",
    "- **In-season monitoring:** Use supervised models trained on historical injury data to flag at-risk athletes daily  \n",
    "- **Recovery tracking:** Detect when recovery patterns deviate from an athlete's personal baseline\n",
    "- **Talent scouting:** Peak performance anomalies can identify standout performers\n",
    "\n",
    "### 🤔 Discussion Questions\n",
    "1. An athlete shows as \"anomalous\" but they feel fine. Should the coach still intervene? Why or why not?\n",
    "2. How might the **contamination** parameter in Isolation Forest change results? What if we set it too high or too low?\n",
    "3. In sports, some anomalies are *good* (peak performance). How should we handle positive vs. negative anomalies differently?\n",
    "4. If you could add one more metric to the dataset, what would it be and why?\n",
    "\n",
    "---\n",
    "*Notebook created for DATA 110 — Introduction to Data Science*  \n",
    "*School of Data Science and Society, UNC Chapel Hill*\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}