{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "063cd431",
   "metadata": {},
   "source": [
    "# 🏥 Anomaly Detection in Healthcare: Patient Vital Signs Monitoring\n",
    "\n",
    "## Course: DATA 110 — Introduction to Data Science\n",
    "### School of Data Science and Society, UNC Chapel Hill\n",
    "\n",
    "---\n",
    "\n",
    "**Scenario:** You are a data analyst at a hospital. Your team monitors patient vital signs (heart rate, blood pressure, oxygen levels, etc.) to detect early warning signs of critical events like **sepsis**, **cardiac emergencies**, and **hypothermia**. \n",
    "\n",
    "In this notebook, you will learn to build **anomaly detection models** using both:\n",
    "1. **Supervised methods** — when we have labeled data telling us what is \"normal\" vs. \"anomalous\"\n",
    "2. **Unsupervised methods** — when we have NO labels and must discover anomalies purely from data patterns\n",
    "\n",
    "**Why does this matter?** Early detection of abnormal vital signs can save lives. Hospitals increasingly use AI/ML to flag patients who need immediate attention.\n",
    "\n",
    "---\n",
    "\n",
    "### 📋 What You'll Learn\n",
    "- How to explore and visualize health data\n",
    "- **Supervised:** Random Forest classifier to predict anomaly types\n",
    "- **Unsupervised:** Isolation Forest and DBSCAN to discover anomalies without labels\n",
    "- How to evaluate and compare both approaches\n",
    "- When to use supervised vs. unsupervised anomaly detection\n",
    "\n",
    "### 📦 Dataset\n",
    "- **500 patient records** with 8 vital sign measurements\n",
    "- **440 normal** patients and **60 anomalies** (sepsis, cardiac, hypothermia)\n",
    "- See the `Data_Dictionary` sheet in the Excel file for column details\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3c7b473",
   "metadata": {},
   "source": [
    "## Step 1: Setup — Import Libraries and Load Data\n",
    "\n",
    "Let's start by importing the Python libraries we'll need and loading our healthcare dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc30a43d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# IMPORT LIBRARIES\n",
    "# ============================================================\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# Machine Learning libraries\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler, LabelEncoder\n",
    "from sklearn.ensemble import RandomForestClassifier, IsolationForest\n",
    "from sklearn.cluster import DBSCAN\n",
    "from sklearn.metrics import (classification_report, confusion_matrix, \n",
    "                             accuracy_score, ConfusionMatrixDisplay)\n",
    "\n",
    "# Make plots look nice\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "sns.set_palette(\"husl\")\n",
    "plt.rcParams['figure.figsize'] = (10, 6)\n",
    "plt.rcParams['font.size'] = 12\n",
    "\n",
    "# Suppress warnings for cleaner output\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "print(\"✅ All libraries loaded successfully!\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d1dc57d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# LOAD THE DATASET\n",
    "# ============================================================\n",
    "# Update the path below to wherever you saved the Excel file\n",
    "df = pd.read_excel('healthcare_vitals_anomaly_data.xlsx', sheet_name='Patient_Vitals')\n",
    "\n",
    "# First look at the data\n",
    "print(f\"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns\")\n",
    "print(f\"\\nFirst 5 rows:\")\n",
    "df.head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1216664",
   "metadata": {},
   "source": [
    "## Step 2: Exploratory Data Analysis (EDA)\n",
    "\n",
    "Before building any model, we need to **understand our data**. This is the most important step!\n",
    "\n",
    "### 2.1 Basic Statistics\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78cd7fd1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# BASIC STATISTICS\n",
    "# ============================================================\n",
    "print(\"=\" * 60)\n",
    "print(\"DATASET OVERVIEW\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "print(f\"\\nTotal patients: {len(df)}\")\n",
    "print(f\"\\nLabel distribution:\")\n",
    "print(df['Anomaly_Label'].value_counts())\n",
    "print(f\"\\nAnomaly percentage: {(df['Anomaly_Label'] != 'Normal').mean() * 100:.1f}%\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d28980e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary statistics for vital signs\n",
    "vital_cols = ['Heart_Rate_bpm', 'Systolic_BP_mmHg', 'Diastolic_BP_mmHg',\n",
    "              'Oxygen_Saturation_pct', 'Temperature_F', 'Respiratory_Rate_bpm',\n",
    "              'White_Blood_Cells_K', 'Glucose_mg_dL']\n",
    "\n",
    "df[vital_cols].describe().round(2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6665f555",
   "metadata": {},
   "source": [
    "### 2.2 Visualize the Data\n",
    "\n",
    "Let's see how normal vs. anomalous patients differ across vital signs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d84e566d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# BOX PLOTS: Normal vs Anomaly across vital signs\n",
    "# ============================================================\n",
    "fig, axes = plt.subplots(2, 4, figsize=(20, 10))\n",
    "axes = axes.flatten()\n",
    "\n",
    "for i, col in enumerate(vital_cols):\n",
    "    df.boxplot(column=col, by='Anomaly_Label', ax=axes[i], rot=45)\n",
    "    axes[i].set_title(col.replace('_', ' '), fontsize=11, fontweight='bold')\n",
    "    axes[i].set_xlabel('')\n",
    "\n",
    "plt.suptitle('Vital Signs Distribution by Patient Status', fontsize=16, fontweight='bold', y=1.02)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"💡 OBSERVE: Which vital signs show the biggest differences between Normal and Anomaly groups?\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71268dc3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# CORRELATION HEATMAP\n",
    "# ============================================================\n",
    "plt.figure(figsize=(10, 8))\n",
    "corr = df[vital_cols].corr()\n",
    "sns.heatmap(corr, annot=True, cmap='RdBu_r', center=0, fmt='.2f',\n",
    "            square=True, linewidths=0.5)\n",
    "plt.title('Correlation Between Vital Signs', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"💡 OBSERVE: Which vital signs are correlated with each other?\")\n",
    "print(\"   Strong correlations might indicate related physiological responses.\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c3ce1c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# SCATTER PLOT: Heart Rate vs. Oxygen Saturation (colored by label)\n",
    "# ============================================================\n",
    "plt.figure(figsize=(10, 7))\n",
    "\n",
    "colors = {'Normal': '#2196F3', 'Anomaly - Sepsis': '#F44336', \n",
    "          'Anomaly - Cardiac': '#FF9800', 'Anomaly - Hypothermia': '#9C27B0'}\n",
    "\n",
    "for label, color in colors.items():\n",
    "    mask = df['Anomaly_Label'] == label\n",
    "    plt.scatter(df.loc[mask, 'Heart_Rate_bpm'], \n",
    "                df.loc[mask, 'Oxygen_Saturation_pct'],\n",
    "                c=color, label=label, alpha=0.6, s=50, edgecolors='white')\n",
    "\n",
    "plt.xlabel('Heart Rate (bpm)', fontsize=12)\n",
    "plt.ylabel('Oxygen Saturation (%)', fontsize=12)\n",
    "plt.title('Heart Rate vs. Oxygen Saturation by Patient Status', fontsize=14, fontweight='bold')\n",
    "plt.legend(fontsize=10)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"💡 OBSERVE: Can you visually separate the anomaly groups from normal patients?\")\n",
    "print(\"   This is essentially what our ML models will try to do — but in 8 dimensions!\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5af45fe5",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 3: SUPERVISED Anomaly Detection — Random Forest Classifier\n",
    "\n",
    "### What is Supervised Anomaly Detection?\n",
    "When we have **labeled data** (we know which records are normal and which are anomalous), we can train a classifier to learn the patterns.\n",
    "\n",
    "**Think of it like this:** A doctor trains a medical student by showing them many examples of normal vitals and abnormal vitals. The student learns the patterns and can then diagnose new patients.\n",
    "\n",
    "### Why Random Forest?\n",
    "- Easy to understand and interpret\n",
    "- Handles multiple features well\n",
    "- Shows us which vital signs are most important for detecting anomalies\n",
    "- Works well with relatively small datasets\n",
    "\n",
    "### 3.1 Prepare the Data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2f5b5f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# PREPARE DATA FOR SUPERVISED LEARNING\n",
    "# ============================================================\n",
    "\n",
    "# Features = the vital signs we'll use to make predictions\n",
    "features = vital_cols  # The 8 vital sign columns\n",
    "\n",
    "X = df[features].copy()    # Feature matrix\n",
    "y = df['Anomaly_Label'].copy()  # Target labels\n",
    "\n",
    "# Create a simpler binary label: Normal (0) vs Anomaly (1)\n",
    "y_binary = (y != 'Normal').astype(int)\n",
    "\n",
    "print(\"Feature columns:\", features)\n",
    "print(f\"\\nFeature matrix shape: {X.shape}\")\n",
    "print(f\"\\nBinary label distribution:\")\n",
    "print(f\"  Normal (0):  {(y_binary == 0).sum()}\")\n",
    "print(f\"  Anomaly (1): {(y_binary == 1).sum()}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ccaa0942",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# SPLIT INTO TRAINING AND TEST SETS\n",
    "# ============================================================\n",
    "# 70% for training, 30% for testing\n",
    "# stratify ensures both sets have the same proportion of anomalies\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y_binary, test_size=0.30, random_state=42, stratify=y_binary\n",
    ")\n",
    "\n",
    "print(f\"Training set: {X_train.shape[0]} patients\")\n",
    "print(f\"Test set:     {X_test.shape[0]} patients\")\n",
    "print(f\"\\nTraining anomaly rate: {y_train.mean()*100:.1f}%\")\n",
    "print(f\"Test anomaly rate:     {y_test.mean()*100:.1f}%\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec0783c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# SCALE THE FEATURES\n",
    "# ============================================================\n",
    "# StandardScaler transforms each feature to have mean=0 and std=1\n",
    "# This is important because vital signs are on very different scales\n",
    "# (e.g., heart rate ~75 vs. oxygen saturation ~97)\n",
    "\n",
    "scaler = StandardScaler()\n",
    "X_train_scaled = scaler.fit_transform(X_train)   # Fit on training data\n",
    "X_test_scaled = scaler.transform(X_test)          # Transform test data with same parameters\n",
    "\n",
    "print(\"✅ Features scaled using StandardScaler\")\n",
    "print(f\"\\nExample — first patient's raw values vs. scaled values:\")\n",
    "print(f\"  Raw:    {X_train.iloc[0].values.round(1)}\")\n",
    "print(f\"  Scaled: {X_train_scaled[0].round(2)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85cdd30d",
   "metadata": {},
   "source": [
    "### 3.2 Train the Random Forest Model\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de5b39d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# TRAIN THE RANDOM FOREST CLASSIFIER\n",
    "# ============================================================\n",
    "# n_estimators = number of decision trees in the forest\n",
    "# random_state = ensures reproducible results\n",
    "\n",
    "rf_model = RandomForestClassifier(\n",
    "    n_estimators=100,       # 100 decision trees\n",
    "    max_depth=10,           # Limit tree depth to prevent overfitting\n",
    "    random_state=42,\n",
    "    class_weight='balanced' # Handle the imbalanced classes (many more Normal than Anomaly)\n",
    ")\n",
    "\n",
    "rf_model.fit(X_train_scaled, y_train)\n",
    "\n",
    "print(\"✅ Random Forest model trained!\")\n",
    "print(f\"   Number of trees: {rf_model.n_estimators}\")\n",
    "print(f\"   Max depth: {rf_model.max_depth}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a1d1831",
   "metadata": {},
   "source": [
    "### 3.3 Evaluate the Model\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cb027eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# MAKE PREDICTIONS AND EVALUATE\n",
    "# ============================================================\n",
    "y_pred = rf_model.predict(X_test_scaled)\n",
    "\n",
    "# Overall accuracy\n",
    "accuracy = accuracy_score(y_test, y_pred)\n",
    "print(f\"Overall Accuracy: {accuracy*100:.1f}%\")\n",
    "print()\n",
    "\n",
    "# Detailed classification report\n",
    "print(\"CLASSIFICATION REPORT\")\n",
    "print(\"=\" * 50)\n",
    "print(classification_report(y_test, y_pred, \n",
    "                            target_names=['Normal', 'Anomaly']))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05fa7afb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# CONFUSION MATRIX — a visual way to see predictions vs. reality\n",
    "# ============================================================\n",
    "fig, ax = plt.subplots(figsize=(8, 6))\n",
    "cm = confusion_matrix(y_test, y_pred)\n",
    "disp = ConfusionMatrixDisplay(cm, display_labels=['Normal', 'Anomaly'])\n",
    "disp.plot(cmap='Blues', ax=ax)\n",
    "plt.title('Supervised Model — Confusion Matrix', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"HOW TO READ THIS:\")\n",
    "print(\"  ✅ Top-left:     Normal patients correctly identified as Normal (True Negative)\")\n",
    "print(\"  ✅ Bottom-right:  Anomalies correctly identified as Anomalies (True Positive)\")\n",
    "print(\"  ❌ Top-right:     Normal patients incorrectly flagged as Anomalies (False Positive)\")\n",
    "print(\"  ❌ Bottom-left:   Anomalies MISSED — classified as Normal (False Negative) ⚠️\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "839fecd9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# FEATURE IMPORTANCE — Which vital signs matter most?\n",
    "# ============================================================\n",
    "importance = pd.Series(rf_model.feature_importances_, index=features)\n",
    "importance = importance.sort_values(ascending=True)\n",
    "\n",
    "plt.figure(figsize=(10, 6))\n",
    "importance.plot(kind='barh', color='#1976D2', edgecolor='white')\n",
    "plt.xlabel('Importance Score', fontsize=12)\n",
    "plt.title('Which Vital Signs Are Most Important for Detecting Anomalies?', \n",
    "          fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"💡 INSIGHT: The model tells us which vital signs are most useful\")\n",
    "print(\"   for distinguishing normal patients from those with anomalies.\")\n",
    "print(f\"\\n   Top 3 most important: {', '.join(importance.index[-3:][::-1])}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60546999",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 4: UNSUPERVISED Anomaly Detection\n",
    "\n",
    "### What is Unsupervised Anomaly Detection?\n",
    "In the real world, we often **don't have labels**. We receive patient data and need to find unusual patterns without being told what's \"normal\" or \"anomalous.\"\n",
    "\n",
    "**Think of it like this:** A new doctor arrives at an unfamiliar hospital. Without any training files, they observe hundreds of patients and start noticing which vital sign patterns seem \"typical\" and which seem \"unusual.\"\n",
    "\n",
    "We'll use two unsupervised methods:\n",
    "1. **Isolation Forest** — Isolates anomalies by randomly partitioning data\n",
    "2. **DBSCAN** — Finds dense clusters; points that don't belong to any cluster are anomalies\n",
    "\n",
    "---\n",
    "\n",
    "### 4.1 Isolation Forest\n",
    "\n",
    "**How it works:** Isolation Forest isolates observations by randomly selecting a feature and then randomly selecting a split value. Anomalies are easier to isolate (fewer splits needed), so they get shorter path lengths.\n",
    "\n",
    "**Analogy:** Imagine playing \"20 Questions\" to identify a patient. Normal patients all look similar and take many questions to tell apart. Anomalous patients are so different that you can identify them with just a few questions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "495c8ace",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# ISOLATION FOREST\n",
    "# ============================================================\n",
    "# We use ALL data (no train/test split needed — unsupervised!)\n",
    "X_all_scaled = scaler.fit_transform(X)\n",
    "\n",
    "# contamination = expected proportion of anomalies in the data\n",
    "# We know ~12% are anomalies, but in practice you'd estimate this\n",
    "iso_forest = IsolationForest(\n",
    "    n_estimators=100,\n",
    "    contamination=0.12,    # We expect ~12% anomalies\n",
    "    random_state=42\n",
    ")\n",
    "\n",
    "# Fit and predict: -1 = anomaly, 1 = normal\n",
    "iso_predictions = iso_forest.fit_predict(X_all_scaled)\n",
    "\n",
    "# Convert to match our labeling: 0 = normal, 1 = anomaly\n",
    "iso_labels = (iso_predictions == -1).astype(int)\n",
    "\n",
    "print(\"ISOLATION FOREST RESULTS\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"Predicted Normal:  {(iso_labels == 0).sum()}\")\n",
    "print(f\"Predicted Anomaly: {(iso_labels == 1).sum()}\")\n",
    "print(f\"\\nActual Normal:     {(y_binary == 0).sum()}\")\n",
    "print(f\"Actual Anomaly:    {(y_binary == 1).sum()}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31076baa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# EVALUATE ISOLATION FOREST (compare with true labels)\n",
    "# ============================================================\n",
    "print(\"ISOLATION FOREST — Classification Report\")\n",
    "print(\"=\" * 50)\n",
    "print(classification_report(y_binary, iso_labels, \n",
    "                            target_names=['Normal', 'Anomaly']))\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 6))\n",
    "cm_iso = confusion_matrix(y_binary, iso_labels)\n",
    "disp_iso = ConfusionMatrixDisplay(cm_iso, display_labels=['Normal', 'Anomaly'])\n",
    "disp_iso.plot(cmap='Oranges', ax=ax)\n",
    "plt.title('Isolation Forest (Unsupervised) — Confusion Matrix', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fa70b2f",
   "metadata": {},
   "source": [
    "### 4.2 DBSCAN (Density-Based Spatial Clustering)\n",
    "\n",
    "**How it works:** DBSCAN groups together points that are closely packed (high density). Points in low-density regions that don't belong to any cluster are labeled as anomalies (noise).\n",
    "\n",
    "**Analogy:** Imagine patients plotted on a map based on their vitals. Normal patients form large crowds. Anomalous patients stand alone, far from the crowds.\n",
    "\n",
    "**Key parameters:**\n",
    "- `eps`: How close two points must be to be considered neighbors\n",
    "- `min_samples`: Minimum points to form a cluster\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddb1dd6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# DBSCAN CLUSTERING\n",
    "# ============================================================\n",
    "dbscan = DBSCAN(\n",
    "    eps=2.5,          # Neighborhood radius\n",
    "    min_samples=10    # Minimum cluster size\n",
    ")\n",
    "\n",
    "dbscan_labels = dbscan.fit_predict(X_all_scaled)\n",
    "\n",
    "# DBSCAN labels: -1 = noise/anomaly, 0,1,2... = cluster IDs\n",
    "dbscan_anomaly = (dbscan_labels == -1).astype(int)\n",
    "\n",
    "print(\"DBSCAN RESULTS\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"Number of clusters found: {len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)}\")\n",
    "print(f\"Noise points (anomalies): {(dbscan_labels == -1).sum()}\")\n",
    "print(f\"\\nCluster distribution:\")\n",
    "for label in sorted(set(dbscan_labels)):\n",
    "    count = (dbscan_labels == label).sum()\n",
    "    name = \"Noise/Anomaly\" if label == -1 else f\"Cluster {label}\"\n",
    "    print(f\"  {name}: {count} patients\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "655eb087",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# EVALUATE DBSCAN\n",
    "# ============================================================\n",
    "print(\"DBSCAN — Classification Report\")\n",
    "print(\"=\" * 50)\n",
    "print(classification_report(y_binary, dbscan_anomaly, \n",
    "                            target_names=['Normal', 'Anomaly']))\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 6))\n",
    "cm_db = confusion_matrix(y_binary, dbscan_anomaly)\n",
    "disp_db = ConfusionMatrixDisplay(cm_db, display_labels=['Normal', 'Anomaly'])\n",
    "disp_db.plot(cmap='Greens', ax=ax)\n",
    "plt.title('DBSCAN (Unsupervised) — Confusion Matrix', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb0baa17",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 5: Compare All Three Methods\n",
    "\n",
    "Now let's put all three approaches side by side to see how they performed.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "771d2de1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ============================================================\n",
    "# SIDE-BY-SIDE COMPARISON\n",
    "# ============================================================\n",
    "\n",
    "# For fair comparison, evaluate all on the full dataset\n",
    "# (RF was trained on 70%, but let's see full-dataset performance for all)\n",
    "rf_full_pred = rf_model.predict(scaler.fit_transform(X))\n",
    "\n",
    "from sklearn.metrics import precision_score, recall_score, f1_score\n",
    "\n",
    "methods = ['Random Forest\\n(Supervised)', 'Isolation Forest\\n(Unsupervised)', 'DBSCAN\\n(Unsupervised)']\n",
    "predictions = [rf_full_pred, iso_labels, dbscan_anomaly]\n",
    "\n",
    "metrics = {'Accuracy': [], 'Precision': [], 'Recall': [], 'F1 Score': []}\n",
    "\n",
    "for pred in predictions:\n",
    "    metrics['Accuracy'].append(accuracy_score(y_binary, pred))\n",
    "    metrics['Precision'].append(precision_score(y_binary, pred, zero_division=0))\n",
    "    metrics['Recall'].append(recall_score(y_binary, pred, zero_division=0))\n",
    "    metrics['F1 Score'].append(f1_score(y_binary, pred, zero_division=0))\n",
    "\n",
    "fig, axes = plt.subplots(1, 4, figsize=(20, 5))\n",
    "colors = ['#1976D2', '#F57C00', '#388E3C']\n",
    "\n",
    "for i, (metric_name, values) in enumerate(metrics.items()):\n",
    "    bars = axes[i].bar(methods, values, color=colors, edgecolor='white', linewidth=1.5)\n",
    "    axes[i].set_title(metric_name, fontsize=14, fontweight='bold')\n",
    "    axes[i].set_ylim(0, 1.05)\n",
    "    axes[i].axhline(y=0.5, color='gray', linestyle='--', alpha=0.3)\n",
    "    for bar, val in zip(bars, values):\n",
    "        axes[i].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02,\n",
    "                     f'{val:.2f}', ha='center', fontsize=12, fontweight='bold')\n",
    "\n",
    "plt.suptitle('Model Comparison: Supervised vs. Unsupervised Anomaly Detection', \n",
    "             fontsize=16, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb47be61",
   "metadata": {},
   "source": [
    "## Step 6: Key Takeaways\n",
    "\n",
    "### 📊 What We Learned\n",
    "\n",
    "| Aspect | Supervised (Random Forest) | Unsupervised (Isolation Forest / DBSCAN) |\n",
    "|--------|---------------------------|------------------------------------------|\n",
    "| **Requires labels?** | ✅ Yes — needs labeled training data | ❌ No — works with unlabeled data |\n",
    "| **Best when...** | You have historical labeled data | You're exploring new data or labels are unavailable |\n",
    "| **Strengths** | High accuracy, interpretable | Can discover unknown anomaly types |\n",
    "| **Weaknesses** | Can only detect anomaly types it was trained on | May produce more false positives |\n",
    "| **Healthcare example** | Trained on past sepsis cases to catch future ones | Monitoring ICU for any unusual pattern |\n",
    "\n",
    "### 🏥 Real-World Healthcare Applications\n",
    "- **Supervised:** Hospital alarm systems trained on historical critical events\n",
    "- **Unsupervised:** Surveillance of new disease outbreaks (COVID-19 early detection)\n",
    "- **Hybrid:** Use unsupervised to discover, then supervised to automate\n",
    "\n",
    "### 🤔 Discussion Questions\n",
    "1. Why might **recall** (catching all anomalies) be more important than **precision** in healthcare?\n",
    "2. What would happen if a new type of anomaly appeared that wasn't in the training data?\n",
    "3. How would you decide between supervised and unsupervised in a real hospital setting?\n",
    "\n",
    "---\n",
    "*Notebook created for DATA 110 — Introduction to Data Science*  \n",
    "*School of Data Science and Society, UNC Chapel Hill*\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}