{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a6851986",
   "metadata": {},
   "source": [
    "# Nike Customer Segmentation — Clustering Demo\n",
    "## Data Analytics & Data Mining\n",
    "\n",
    "**Business Context:** Nike's direct-to-consumer team wants to segment their customer base to personalize marketing campaigns. Instead of treating all 1,200 customers the same, they want to identify natural groups and tailor outreach accordingly.\n",
    "\n",
    "**The Big Question:** Can we find meaningful customer segments that a marketing manager can *name*, *describe*, and *act on*?\n",
    "\n",
    "**What Makes This Different from Classification?**\n",
    "- In classification (like ShieldScore), we had labeled data — we *knew* the right answer\n",
    "- In clustering, there's no \"right answer\" column. We're discovering structure the data reveals on its own\n",
    "- The model's output isn't a prediction — it's a **segment assignment**\n",
    "- Deployment means: when a *new* customer shows up, which segment do they belong to?\n",
    "\n",
    "---\n",
    "### Pipeline Overview (PAIR Framework)\n",
    "| Step | What We Do | Business Connection |\n",
    "|------|-----------|---------------------|\n",
    "| **P**rediction | Assign each customer to a segment | \"Which group does this customer belong to?\" |\n",
    "| **A**ction | Trigger segment-specific campaigns | \"Send the right offer to the right person\" |\n",
    "| **I**mpact | Measure campaign lift by segment | \"Did personalization increase conversion?\" |\n",
    "| **R**isk | Monitor segment drift over time | \"Are our segments still valid next quarter?\" |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24a01a42",
   "metadata": {},
   "source": [
    "## Part 1: Get & Prepare the Data\n",
    "> **\"Feel the pain first\"** — Before we let the algorithm do its thing, we need to understand what we're working with. Clustering is *extremely* sensitive to data quality issues that supervised models might shrug off.\n",
    "\n",
    "### Why Data Prep Matters More for Clustering\n",
    "- Clustering uses **distance** to group customers. If `annual_spend` ranges from \\$20–\\$8,000 and `return_rate` ranges from 0–0.50, spending will dominate every distance calculation\n",
    "- Missing values? Most clustering algorithms can't handle them at all\n",
    "- Outliers? They'll create their own \"cluster of one\" and distort everything else\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8646ebba",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.cluster import KMeans\n",
    "from sklearn.metrics import silhouette_score, silhouette_samples\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Set plot style\n",
    "plt.rcParams['figure.figsize'] = (10, 6)\n",
    "plt.rcParams['font.size'] = 11\n",
    "sns.set_palette('Set2')\n",
    "\n",
    "# Load data\n",
    "df = pd.read_csv('nike_customers_student.csv')\n",
    "print(f\"Dataset: {df.shape[0]} customers, {df.shape[1]} features\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38e89fbd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Quick health check — what are we dealing with?\n",
    "print(\"=\" * 60)\n",
    "print(\"DATA QUALITY REPORT\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "print(f\"\\nTotal records: {len(df)}\")\n",
    "print(f\"\\nMissing Values:\")\n",
    "missing = df.isnull().sum()\n",
    "missing_pct = (missing / len(df) * 100).round(1)\n",
    "missing_report = pd.DataFrame({'Missing': missing, 'Pct': missing_pct})\n",
    "print(missing_report[missing_report['Missing'] > 0])\n",
    "\n",
    "print(f\"\\nNumeric Column Ranges:\")\n",
    "for col in df.select_dtypes(include=[np.number]).columns:\n",
    "    print(f\"  {col:30s}  min={df[col].min():>10.2f}  max={df[col].max():>10.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a40e5c22",
   "metadata": {},
   "source": [
    "### 🔍 What did you notice?\n",
    "**Three problems that will wreck our clustering if we ignore them:**\n",
    "1. **Missing values** (~8% in several columns) — K-Means can't compute distances with NaN\n",
    "2. **Scale differences** — `annual_spend` is in hundreds/thousands, `return_rate` is 0-0.50\n",
    "3. **Potential outliers** — check that spending range! Some customers spend 10-40x the average\n",
    "\n",
    "Let's fix each one, in order.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6d66432",
   "metadata": {},
   "outputs": [],
   "source": [
    "# STEP 1: Handle missing values\n",
    "# Strategy: median imputation (robust to outliers)\n",
    "# In production, you'd investigate WHY values are missing first!\n",
    "\n",
    "numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()\n",
    "\n",
    "print(\"Imputing missing values with median...\")\n",
    "for col in numeric_cols:\n",
    "    n_missing = df[col].isnull().sum()\n",
    "    if n_missing > 0:\n",
    "        median_val = df[col].median()\n",
    "        df[col].fillna(median_val, inplace=True)\n",
    "        print(f\"  {col}: filled {n_missing} missing values with median = {median_val:.2f}\")\n",
    "\n",
    "print(f\"\\nRemaining missing values: {df.isnull().sum().sum()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6248f6f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# STEP 2: Detect outliers\n",
    "# Let's see how extreme those spending whales are\n",
    "\n",
    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
    "\n",
    "for i, col in enumerate(['annual_spend', 'avg_order_value', 'app_sessions_monthly']):\n",
    "    axes[i].boxplot(df[col].dropna(), vert=True)\n",
    "    axes[i].set_title(col.replace('_', ' ').title())\n",
    "    axes[i].set_ylabel('Value')\n",
    "\n",
    "plt.suptitle('Outlier Check — Look for Extreme Values', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Count outliers using IQR method\n",
    "print(\"\\nOutlier count (IQR method, >3x IQR):\")\n",
    "for col in ['annual_spend', 'avg_order_value']:\n",
    "    Q1 = df[col].quantile(0.25)\n",
    "    Q3 = df[col].quantile(0.75)\n",
    "    IQR = Q3 - Q1\n",
    "    outliers = ((df[col] < Q1 - 3*IQR) | (df[col] > Q3 + 3*IQR)).sum()\n",
    "    print(f\"  {col}: {outliers} extreme outliers\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "654c4b48",
   "metadata": {},
   "outputs": [],
   "source": [
    "# STEP 2b: Cap outliers at 99th percentile (Winsorization)\n",
    "# We don't remove them — we pull them back to a reasonable range\n",
    "# This preserves the \"high spender\" signal without letting them hijack distances\n",
    "\n",
    "for col in ['annual_spend', 'avg_order_value']:\n",
    "    cap = df[col].quantile(0.99)\n",
    "    n_capped = (df[col] > cap).sum()\n",
    "    df[col] = df[col].clip(upper=cap)\n",
    "    print(f\"Capped {col} at {cap:.2f} ({n_capped} values affected)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9bdc4c8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# STEP 3: Select features for clustering\n",
    "# We only use BEHAVIORAL features — things customers DO, not who they ARE\n",
    "# Gender and loyalty_tier are categorical — we'll use them AFTER clustering to profile segments\n",
    "\n",
    "clustering_features = [\n",
    "    'annual_spend',           # How much they spend\n",
    "    'purchase_frequency',     # How often they buy\n",
    "    'avg_order_value',        # How much per transaction\n",
    "    'app_sessions_monthly',   # Digital engagement\n",
    "    'pct_running',            # Product category mix\n",
    "    'pct_training',\n",
    "    'pct_lifestyle',\n",
    "    'return_rate',            # Satisfaction signal\n",
    "    'days_since_last_purchase',  # Recency\n",
    "    'channel_online_pct',     # Channel preference\n",
    "]\n",
    "\n",
    "X = df[clustering_features].copy()\n",
    "print(f\"Feature matrix: {X.shape[0]} customers × {X.shape[1]} features\")\n",
    "print(f\"\\nFeature ranges BEFORE scaling:\")\n",
    "for col in X.columns:\n",
    "    print(f\"  {col:30s}  range: {X[col].min():>10.2f} to {X[col].max():>10.2f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e98abde9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# STEP 4: Standardize (Z-score scaling)\n",
    "# THIS IS THE MOST CRITICAL STEP FOR CLUSTERING\n",
    "# Without scaling, annual_spend ($20-$2000) would completely dominate\n",
    "# return_rate (0.0-0.5) in every distance calculation\n",
    "\n",
    "scaler = StandardScaler()\n",
    "X_scaled = scaler.fit_transform(X)\n",
    "X_scaled_df = pd.DataFrame(X_scaled, columns=clustering_features)\n",
    "\n",
    "print(\"Feature ranges AFTER scaling (should all be roughly -3 to +3):\")\n",
    "for col in X_scaled_df.columns:\n",
    "    print(f\"  {col:30s}  range: {X_scaled_df[col].min():>6.2f} to {X_scaled_df[col].max():>6.2f}\")\n",
    "\n",
    "# SAVE the scaler parameters — we'll need these for deployment!\n",
    "scaler_params = pd.DataFrame({\n",
    "    'feature': clustering_features,\n",
    "    'mean': scaler.mean_,\n",
    "    'std': scaler.scale_\n",
    "})\n",
    "print(\"\\n✅ Scaler parameters saved — these are ESSENTIAL for deployment\")\n",
    "print(scaler_params.to_string(index=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22e4b696",
   "metadata": {},
   "source": [
    "## Part 2: Train the Clustering Model\n",
    "\n",
    "### How K-Means Works (The 30-Second Version)\n",
    "1. Pick K random starting points (centroids)\n",
    "2. Assign each customer to their nearest centroid\n",
    "3. Recalculate centroids as the average of their assigned customers\n",
    "4. Repeat steps 2-3 until assignments stop changing\n",
    "\n",
    "**The big question: How do we choose K?**\n",
    "- Too few clusters → segments are too broad to be actionable\n",
    "- Too many clusters → segments are too narrow for marketing to manage\n",
    "- We'll use two methods: the **Elbow Method** and the **Silhouette Score**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91f43596",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Elbow Method — find where adding more clusters stops helping\n",
    "inertias = []\n",
    "sil_scores = []\n",
    "K_range = range(2, 11)\n",
    "\n",
    "for k in K_range:\n",
    "    km = KMeans(n_clusters=k, random_state=42, n_init=10)\n",
    "    labels = km.fit_predict(X_scaled)\n",
    "    inertias.append(km.inertia_)\n",
    "    sil_scores.append(silhouette_score(X_scaled, labels))\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Elbow plot\n",
    "axes[0].plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)\n",
    "axes[0].set_xlabel('Number of Clusters (K)')\n",
    "axes[0].set_ylabel('Inertia (Within-Cluster Sum of Squares)')\n",
    "axes[0].set_title('Elbow Method — Look for the \"Bend\"')\n",
    "axes[0].set_xticks(list(K_range))\n",
    "\n",
    "# Silhouette plot\n",
    "axes[1].plot(K_range, sil_scores, 'rs-', linewidth=2, markersize=8)\n",
    "axes[1].set_xlabel('Number of Clusters (K)')\n",
    "axes[1].set_ylabel('Silhouette Score')\n",
    "axes[1].set_title('Silhouette Score — Higher = Better Separation')\n",
    "axes[1].set_xticks(list(K_range))\n",
    "\n",
    "plt.suptitle('How Many Clusters Should We Use?', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"\\nSilhouette scores by K:\")\n",
    "for k, s in zip(K_range, sil_scores):\n",
    "    marker = \" ◄ BEST\" if s == max(sil_scores) else \"\"\n",
    "    print(f\"  K={k}: {s:.4f}{marker}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca0565d5",
   "metadata": {},
   "source": [
    "### 🤔 Choosing K — It's a Business Decision, Not Just Math\n",
    "\n",
    "The elbow plot and silhouette score suggest K=3 or K=4. But here's the thing:\n",
    "\n",
    "**The \"right\" number of clusters depends on what the business can actually DO with them.**\n",
    "\n",
    "- K=2: Too broad — \"big spenders\" vs \"everyone else\" isn't actionable enough\n",
    "- K=3: Clean separation, but might miss an important sub-group\n",
    "- K=4: Four distinct personas that a marketing team can realistically manage\n",
    "- K=8: Good math, but can the marketing team run 8 different campaigns? Probably not.\n",
    "\n",
    "**We'll go with K=4** — it balances statistical quality with practical usability.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e65cc76b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train the final model with K=4\n",
    "K = 4\n",
    "kmeans = KMeans(n_clusters=K, random_state=42, n_init=10)\n",
    "df['cluster'] = kmeans.fit_predict(X_scaled)\n",
    "\n",
    "print(f\"Final model: K={K}\")\n",
    "print(f\"Silhouette score: {silhouette_score(X_scaled, df['cluster']):.4f}\")\n",
    "print(f\"\\nCluster sizes:\")\n",
    "print(df['cluster'].value_counts().sort_index())\n",
    "\n",
    "# Store centroids (THESE are the deployed model)\n",
    "centroids_scaled = kmeans.cluster_centers_\n",
    "centroids_original = scaler.inverse_transform(centroids_scaled)\n",
    "centroids_df = pd.DataFrame(centroids_original, columns=clustering_features)\n",
    "centroids_df.index.name = 'cluster'\n",
    "print(\"\\n\" + \"=\" * 60)\n",
    "print(\"CLUSTER CENTROIDS (original scale) — This IS your deployed model\")\n",
    "print(\"=\" * 60)\n",
    "print(centroids_df.round(2).to_string())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfd47061",
   "metadata": {},
   "source": [
    "## Part 3: Validate the Model\n",
    "\n",
    "Validation for clustering is different from supervised learning. There's no accuracy score because there's no \"right answer.\" Instead, we ask:\n",
    "\n",
    "1. **Are the clusters well-separated?** (Silhouette analysis)\n",
    "2. **Are the clusters interpretable?** (Can a business person name them?)\n",
    "3. **Are the clusters stable?** (Would we get similar groups with different data?)\n",
    "4. **Are the clusters actionable?** (Can marketing DO something different for each one?)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dbd2b6ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Validation 1: Silhouette Analysis — per-cluster quality\n",
    "fig, ax = plt.subplots(figsize=(10, 6))\n",
    "\n",
    "sample_silhouette_values = silhouette_samples(X_scaled, df['cluster'])\n",
    "y_lower = 10\n",
    "\n",
    "colors = ['#2ecc71', '#3498db', '#e74c3c', '#f39c12']\n",
    "for i in range(K):\n",
    "    cluster_sil = sample_silhouette_values[df['cluster'] == i]\n",
    "    cluster_sil.sort()\n",
    "    size_cluster = cluster_sil.shape[0]\n",
    "    y_upper = y_lower + size_cluster\n",
    "\n",
    "    ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_sil,\n",
    "                     facecolor=colors[i], edgecolor=colors[i], alpha=0.7)\n",
    "    ax.text(-0.05, y_lower + 0.5 * size_cluster, f'Cluster {i}')\n",
    "    y_lower = y_upper + 10\n",
    "\n",
    "avg_sil = silhouette_score(X_scaled, df['cluster'])\n",
    "ax.axvline(x=avg_sil, color='red', linestyle='--', label=f'Average: {avg_sil:.3f}')\n",
    "ax.set_xlabel('Silhouette Coefficient')\n",
    "ax.set_ylabel('Customers (sorted by cluster)')\n",
    "ax.set_title('Silhouette Plot — Are Clusters Well-Separated?')\n",
    "ax.legend()\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"Interpretation guide:\")\n",
    "print(\"  Values near +1 = customer is well-matched to their cluster\")\n",
    "print(\"  Values near  0 = customer is on the border between clusters\")\n",
    "print(\"  Values < 0     = customer is probably in the WRONG cluster\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c29bb413",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Validation 2: Profile each cluster — can we NAME them?\n",
    "print(\"=\" * 80)\n",
    "print(\"CLUSTER PROFILES — The Business Validation Test\")\n",
    "print(\"If you can't name it and describe it in one sentence, it's not useful.\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "profile_cols = clustering_features + ['age']\n",
    "cluster_profiles = df.groupby('cluster')[profile_cols].mean().round(2)\n",
    "\n",
    "# Add size\n",
    "cluster_profiles['n_customers'] = df.groupby('cluster').size()\n",
    "cluster_profiles['pct_of_total'] = (cluster_profiles['n_customers'] / len(df) * 100).round(1)\n",
    "\n",
    "print(cluster_profiles.T.to_string())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3e1f358",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Validation 2b: Visual profiles — radar-style comparison\n",
    "fig, axes = plt.subplots(2, 3, figsize=(16, 10))\n",
    "axes = axes.flatten()\n",
    "\n",
    "viz_cols = ['annual_spend', 'purchase_frequency', 'avg_order_value', \n",
    "            'app_sessions_monthly', 'pct_lifestyle', 'days_since_last_purchase']\n",
    "colors = ['#2ecc71', '#3498db', '#e74c3c', '#f39c12']\n",
    "\n",
    "for i, col in enumerate(viz_cols):\n",
    "    ax = axes[i]\n",
    "    for c in range(K):\n",
    "        data = df[df['cluster'] == c][col]\n",
    "        ax.hist(data, bins=20, alpha=0.5, label=f'Cluster {c}', color=colors[c])\n",
    "    ax.set_title(col.replace('_', ' ').title())\n",
    "    ax.legend(fontsize=8)\n",
    "\n",
    "plt.suptitle('Cluster Distributions — Do They Look Different?', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "841e960f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Validation 2c: Cross-tab with categorical variables\n",
    "# These weren't used in clustering — they're independent validation\n",
    "print(\"\\nLoyalty Tier Distribution by Cluster:\")\n",
    "print(pd.crosstab(df['cluster'], df['loyalty_tier'], normalize='index').round(2))\n",
    "\n",
    "print(\"\\nGender Distribution by Cluster:\")\n",
    "print(pd.crosstab(df['cluster'], df['gender'], normalize='index').round(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba79e85b",
   "metadata": {},
   "source": [
    "### 🏷️ Name Your Clusters!\n",
    "\n",
    "Based on the profiles above, here are suggested segment names. A good name should be:\n",
    "- **Memorable** — a marketing manager can use it in a meeting without explaining\n",
    "- **Descriptive** — it captures the defining behavior of the group\n",
    "- **Actionable** — it implies what kind of campaign to run\n",
    "\n",
    "| Cluster | Suggested Name | Key Traits | Marketing Action |\n",
    "|---------|---------------|------------|-----------------|\n",
    "| 0 | *(fill in after reviewing profiles)* | | |\n",
    "| 1 | | | |\n",
    "| 2 | | | |\n",
    "| 3 | | | |\n",
    "\n",
    "> **Note:** The ground truth segments are Performance Athletes, Weekend Warriors, Sneakerheads, and Casual Walkers. Cluster numbers won't match exactly — that's expected! The question is whether students discover *similar* groupings.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f45e6234",
   "metadata": {},
   "source": [
    "## Part 4: Deploy the Model\n",
    "\n",
    "**This is where most clustering projects die.** Students (and many practitioners!) build beautiful cluster visualizations, present them in a slide deck, and then... nothing happens. The model never scores a new customer.\n",
    "\n",
    "### The Deployment Question for Clustering\n",
    "> \"A new customer just made their first 3 purchases. Which segment do they belong to? And what campaign should they get?\"\n",
    "\n",
    "**The answer is surprisingly simple:** Calculate the distance from the new customer to each centroid, and assign them to the nearest one. That's it.\n",
    "\n",
    "### What Gets Deployed (Not the Algorithm — the Artifacts!)\n",
    "\n",
    "| Artifact | What It Contains | Why It Matters |\n",
    "|----------|-----------------|---------------|\n",
    "| **Scaler parameters** | Mean and std for each feature | To standardize new data the same way |\n",
    "| **Centroids** | Center point of each cluster (scaled) | To calculate distances |\n",
    "| **Segment action map** | Business rules per segment | So the assignment triggers action |\n",
    "\n",
    "This is the **\"Train in Python, Deploy in Excel\"** pattern.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93a66cf8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export deployment artifacts\n",
    "\n",
    "# 1. Scaler parameters\n",
    "scaler_export = pd.DataFrame({\n",
    "    'feature': clustering_features,\n",
    "    'mean': scaler.mean_,\n",
    "    'std': scaler.scale_\n",
    "})\n",
    "scaler_export.to_csv('deployment_scaler_params.csv', index=False)\n",
    "\n",
    "# 2. Centroids (in SCALED space — this is what we calculate distance against)\n",
    "centroids_export = pd.DataFrame(\n",
    "    centroids_scaled, \n",
    "    columns=clustering_features\n",
    ")\n",
    "centroids_export.index.name = 'cluster'\n",
    "centroids_export.to_csv('deployment_centroids_scaled.csv')\n",
    "\n",
    "# 3. Centroids in original scale (for human interpretation)\n",
    "centroids_original_export = pd.DataFrame(\n",
    "    centroids_original,\n",
    "    columns=clustering_features\n",
    ")\n",
    "centroids_original_export.index.name = 'cluster'\n",
    "centroids_original_export.to_csv('deployment_centroids_original.csv')\n",
    "\n",
    "print(\"✅ Deployment files exported:\")\n",
    "print(\"   deployment_scaler_params.csv       — for standardizing new data\")\n",
    "print(\"   deployment_centroids_scaled.csv    — for distance calculations\")\n",
    "print(\"   deployment_centroids_original.csv  — for human-readable profiles\")\n",
    "print()\n",
    "print(\"Next step: Build the Excel scorer using these files!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60b0d33e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# DEMO: Score a new customer in Python (before we build the Excel version)\n",
    "\n",
    "def score_new_customer(customer_data, scaler, kmeans_model, feature_names):\n",
    "    \"\"\"\n",
    "    Score a single new customer.\n",
    "    This is exactly what the Excel scorer will do — just with formulas instead of code.\n",
    "    \"\"\"\n",
    "    # Step 1: Extract features in the right order\n",
    "    features = np.array([customer_data[f] for f in feature_names]).reshape(1, -1)\n",
    "    \n",
    "    # Step 2: Standardize using the SAME scaler from training\n",
    "    features_scaled = scaler.transform(features)\n",
    "    \n",
    "    # Step 3: Find nearest centroid\n",
    "    cluster = kmeans_model.predict(features_scaled)[0]\n",
    "    \n",
    "    # Step 4: Get distances to ALL centroids (for confidence)\n",
    "    distances = np.sqrt(((features_scaled - kmeans_model.cluster_centers_) ** 2).sum(axis=1))\n",
    "    \n",
    "    return cluster, distances\n",
    "\n",
    "# Example: A new customer who looks like a Sneakerhead\n",
    "new_customer = {\n",
    "    'annual_spend': 1200,\n",
    "    'purchase_frequency': 8,\n",
    "    'avg_order_value': 150,\n",
    "    'app_sessions_monthly': 25,\n",
    "    'pct_running': 0.05,\n",
    "    'pct_training': 0.10,\n",
    "    'pct_lifestyle': 0.85,\n",
    "    'return_rate': 0.10,\n",
    "    'days_since_last_purchase': 15,\n",
    "    'channel_online_pct': 0.78,\n",
    "}\n",
    "\n",
    "cluster_id, distances = score_new_customer(new_customer, scaler, kmeans, clustering_features)\n",
    "\n",
    "print(\"NEW CUSTOMER SCORING RESULT\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"Assigned to: Cluster {cluster_id}\")\n",
    "print(f\"\\nDistances to each centroid:\")\n",
    "for i, d in enumerate(distances):\n",
    "    marker = \" ◄ NEAREST\" if i == cluster_id else \"\"\n",
    "    print(f\"  Cluster {i}: {d:.3f}{marker}\")\n",
    "print(f\"\\nConfidence: The nearest cluster is {distances[cluster_id]:.1f} units away.\")\n",
    "print(f\"The next-nearest is {sorted(distances)[1]:.1f} units away.\")\n",
    "print(f\"Gap: {sorted(distances)[1] - distances[cluster_id]:.1f} units — {'strong' if sorted(distances)[1] - distances[cluster_id] > 1 else 'weak'} assignment\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "568e0c5b",
   "metadata": {},
   "source": [
    "## Part 5: Monitoring & Drift\n",
    "\n",
    "Once deployed, your clustering model isn't \"done.\" Over time:\n",
    "- **Customer behavior changes** — seasonal shifts, new product launches, economic changes\n",
    "- **New customer types emerge** — segments that didn't exist when you built the model\n",
    "- **Centroids drift** — the \"average\" customer in each segment shifts\n",
    "\n",
    "### What to Monitor\n",
    "1. **Cluster size stability** — if one segment suddenly doubles, something changed\n",
    "2. **Average distance to centroid** — if customers are getting farther from their centroids, the model is losing relevance\n",
    "3. **New customer assignment rates** — are new customers overwhelmingly landing in one segment?\n",
    "\n",
    "### When to Retrain\n",
    "- Re-run clustering quarterly on updated data\n",
    "- Compare new centroids to old ones\n",
    "- If centroids shift by more than 1 standard deviation, rebuild and relabel\n",
    "\n",
    "---\n",
    "### Key Takeaways\n",
    "\n",
    "| Supervised (ShieldScore) | Unsupervised (This Demo) |\n",
    "|-------------------------|-------------------------|\n",
    "| Has labeled training data | No labels — discovery mode |\n",
    "| Model outputs a prediction | Model outputs a segment assignment |\n",
    "| Deploy the model directly | Deploy the **artifacts** (centroids + scaler) |\n",
    "| Evaluate with accuracy, AUC | Evaluate with silhouette, business naming test |\n",
    "| Retrain when accuracy drops | Retrain when segments drift |\n",
    "\n",
    "**The bottom line:** Clustering deployment = export centroids + scaler → calculate distances in Excel → assign nearest cluster → trigger business action. That's it. No mystery.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}