{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.0" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \ud83c\udfe6 Bank Customer Segmentation with Cluster Analysis\n", "## A Complete Walkthrough: From Raw Data to Business Recommendations\n", "\n", "**Course:** Predictive Analytics & Data Mining \n", "**Demo Type:** In-Class Guided Walkthrough \n", "**Dataset:** 800 retail banking customers with demographics, account behavior, and engagement metrics\n", "\n", "---\n", "\n", "### What We'll Do Today\n", "\n", "In this demo, we'll play the role of a data analytics team at a regional bank. Leadership wants to **stop treating all 800 customers the same** and start offering personalized services. Our job is to find natural customer segments using clustering\u2014no labels, no target variable, just patterns in the data.\n", "\n", "### The Business Question\n", "> *\"Who are our customers, really? And how should we serve each group differently?\"*\n", "\n", "### Our Workflow\n", "1. **Load & Inspect** the data \u2014 understand what we're working with\n", "2. **Clean & Prepare** \u2014 handle missing values, check distributions\n", "3. **Explore** \u2014 look for patterns before we model\n", "4. **Engineer Features & Scale** \u2014 get the data ready for clustering\n", "5. **Cluster** \u2014 K-Means with elbow method and silhouette analysis\n", "6. **Profile & Interpret** \u2014 who's in each cluster?\n", "7. **Recommend** \u2014 what should the bank actually *do* with these segments?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 1. Setup & Data Loading\n", "\n", "First, let's import our libraries and load the dataset." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Core libraries\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# Visualization\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# Clustering & preprocessing\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.cluster import KMeans\n", "from sklearn.metrics import silhouette_score, silhouette_samples\n", "\n", "# Suppress warnings for clean output\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Plot settings\n", "plt.rcParams['figure.figsize'] = (10, 6)\n", "plt.rcParams['font.size'] = 12\n", "sns.set_style(\"whitegrid\")\n", "print(\"\u2705 All libraries loaded successfully!\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Load the dataset\n", "df = pd.read_excel(\"bank_customers.xlsx\", sheet_name=\"Customer_Data\")\n", "print(f\"Dataset shape: {df.shape[0]} customers \u00d7 {df.shape[1]} variables\")\n", "df.head()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \ud83d\udd0d Quick Orientation: What's in This Data?\n", "\n", "Let's check the data dictionary to understand what each column means and *why* it's relevant for segmentation." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Load the data dictionary\n", "dd = pd.read_excel(\"bank_customers.xlsx\", sheet_name=\"Data_Dictionary\")\n", "dd.style.set_properties(**{'text-align': 'left'}).set_table_styles(\n", " [{'selector': 'th', 'props': [('text-align', 'center')]}]\n", ")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 2. Data Inspection & Cleaning\n", "\n", "Before we cluster anything, we need to understand the shape and quality of our data. Think of this as the \"trust but verify\" step \u2014 we need to know what's clean, what's messy, and what might trip us up later.\n", "\n", "### Why This Matters\n", "Garbage in, garbage out. If we feed the clustering algorithm data with missing values, wildly different scales, or incorrect types, the resulting segments won't be meaningful. In the real world, **data preparation consumes roughly 80% of an analytics project's time.**" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Data types and non-null counts\n", "print(\"=\" * 60)\n", "print(\"DATA TYPES & MISSING VALUES\")\n", "print(\"=\" * 60)\n", "df.info()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Check for missing values\n", "missing = df.isnull().sum()\n", "missing_pct = (missing / len(df) * 100).round(2)\n", "missing_report = pd.DataFrame({\n", " 'Missing Count': missing[missing > 0],\n", " 'Missing %': missing_pct[missing_pct > 0]\n", "}).sort_values('Missing Count', ascending=False)\n", "\n", "print(\"\\nVariables with Missing Values:\")\n", "print(missing_report)\n", "print(f\"\\nTotal rows: {len(df)} | Columns with missing data: {len(missing_report)}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling Missing Values\n", "\n", "We have a small number of missing values in three columns. Since the missingness is very low (< 2%), we'll use **median imputation** for numeric fields. Median is more robust than mean when distributions might be skewed (which income and credit scores often are).\n", "\n", "> **\ud83d\udca1 Instructor Note:** In a production setting, you'd investigate *why* values are missing. Is it random? Or are certain customer types less likely to report income? That pattern itself could be informative." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Impute missing values with median\n", "for col in ['Annual_Income', 'Credit_Score', 'Satisfaction_Score']:\n", " median_val = df[col].median()\n", " n_missing = df[col].isnull().sum()\n", " df[col] = df[col].fillna(median_val)\n", " print(f\" {col}: filled {n_missing} missing values with median = {median_val}\")\n", "\n", "print(f\"\\n\u2705 Remaining missing values: {df.isnull().sum().sum()}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Summary statistics for numeric columns\n", "numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()\n", "numeric_cols.remove('Has_Credit_Card') # Binary flags - we'll look at these separately\n", "numeric_cols.remove('Has_Mortgage')\n", "numeric_cols.remove('Has_Personal_Loan')\n", "numeric_cols.remove('Has_Auto_Loan')\n", "numeric_cols.remove('Has_Investment_Account')\n", "\n", "print(\"DESCRIPTIVE STATISTICS (Continuous Variables)\")\n", "print(\"=\" * 60)\n", "df[numeric_cols].describe().round(2).T" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What to Notice in the Summary Stats\n", "\n", "Look at the **ranges**:\n", "- `Annual_Income` ranges from ~$28K to ~$350K \u2014 that's a 12x difference\n", "- `Account_Balance` ranges from ~$500 to ~$250K \u2014 a 500x difference!\n", "- `Mobile_Sessions_Month` ranges from 0 to 55\n", "\n", "**This is exactly why we need to scale our data before clustering.** Without scaling, high-magnitude features like income and balance would completely dominate the distance calculations, and low-magnitude features like overdraft count would be invisible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 3. Exploratory Data Analysis (EDA)\n", "\n", "Before we run any algorithm, let's *look* at the data. Good EDA helps us:\n", "- Spot potential clusters visually\n", "- Identify variables that might be most useful for segmentation\n", "- Catch outliers or oddities that could distort our results" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Distribution of key numeric variables\n", "fig, axes = plt.subplots(2, 3, figsize=(16, 10))\n", "fig.suptitle(\"Distribution of Key Customer Metrics\", fontsize=16, fontweight='bold', y=1.02)\n", "\n", "plot_vars = ['Age', 'Annual_Income', 'Account_Balance', \n", " 'Tenure_Years', 'Monthly_Transactions', 'Credit_Score']\n", "colors = ['#2A9D8F', '#E76F51', '#264653', '#E9C46A', '#F4A261', '#1B3A5C']\n", "\n", "for ax, var, color in zip(axes.flat, plot_vars, colors):\n", " ax.hist(df[var], bins=30, color=color, edgecolor='white', alpha=0.85)\n", " ax.set_title(var, fontsize=13, fontweight='bold')\n", " ax.axvline(df[var].median(), color='red', linestyle='--', alpha=0.7, label=f'Median: {df[var].median():.0f}')\n", " ax.legend(fontsize=9)\n", "\n", "plt.tight_layout()\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \ud83d\udd0d What Do You See?\n", "\n", "**Discussion questions for the class:**\n", "- Is `Annual_Income` normally distributed? What shape would you call it?\n", "- Look at `Account_Balance` \u2014 see that right tail? What does that tell us about our customer base?\n", "- `Tenure_Years` has an interesting shape. What might the two \"humps\" suggest?" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Categorical variable distributions\n", "fig, axes = plt.subplots(1, 3, figsize=(16, 5))\n", "\n", "for ax, var, title in zip(axes, ['Education', 'Marital_Status', 'Account_Type'],\n", " ['Education Level', 'Marital Status', 'Account Type']):\n", " order = df[var].value_counts().index\n", " sns.countplot(data=df, y=var, order=order, ax=ax, \n", " palette='viridis', edgecolor='white')\n", " ax.set_title(title, fontsize=13, fontweight='bold')\n", " ax.set_ylabel('')\n", " # Add count labels\n", " for container in ax.containers:\n", " ax.bar_label(container, padding=3, fontsize=10)\n", "\n", "plt.tight_layout()\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Digital vs. Traditional engagement: Are there natural groups?\n", "fig, ax = plt.subplots(figsize=(10, 7))\n", "\n", "scatter = ax.scatter(df['Mobile_Sessions_Month'], df['Branch_Visits_Year'],\n", " c=df['Age'], cmap='RdYlBu_r', alpha=0.6, edgecolors='gray', linewidth=0.3, s=50)\n", "\n", "ax.set_xlabel('Mobile App Sessions per Month', fontsize=13)\n", "ax.set_ylabel('Branch Visits per Year', fontsize=13)\n", "ax.set_title('Digital vs. Traditional Engagement\\n(colored by Age)', fontsize=15, fontweight='bold')\n", "\n", "cbar = plt.colorbar(scatter, ax=ax)\n", "cbar.set_label('Customer Age', fontsize=12)\n", "\n", "plt.tight_layout()\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \ud83d\udca1 Key Insight from This Scatter Plot\n", "\n", "Notice the pattern: **younger customers cluster in the upper-left** (high mobile, low branch) while **older customers cluster in the lower-right** (low mobile, high branch). This suggests that engagement channel preference could be an important dimension for segmentation.\n", "\n", "This is a great example of why EDA matters \u2014 we can already *see* potential clusters forming before we've run any algorithm." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Correlation heatmap of numeric features\n", "fig, ax = plt.subplots(figsize=(14, 10))\n", "\n", "# Select key numeric columns for correlation\n", "corr_cols = ['Age', 'Annual_Income', 'Credit_Score', 'Tenure_Years', \n", " 'Account_Balance', 'Num_Products', 'Monthly_Transactions',\n", " 'Avg_Transaction_Amount', 'Monthly_Deposits', 'Monthly_Withdrawals',\n", " 'Net_Monthly_Flow', 'Overdraft_Count_12mo',\n", " 'Online_Logins_Month', 'Mobile_Sessions_Month', \n", " 'Branch_Visits_Year', 'Satisfaction_Score']\n", "\n", "corr_matrix = df[corr_cols].corr()\n", "\n", "mask = np.triu(np.ones_like(corr_matrix, dtype=bool))\n", "sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',\n", " center=0, vmin=-1, vmax=1, square=True, linewidths=0.5,\n", " annot_kws={'size': 8}, ax=ax)\n", "\n", "ax.set_title('Feature Correlations\\n(Look for strongly correlated pairs \u2014 we may want to drop one)', \n", " fontsize=14, fontweight='bold')\n", "plt.tight_layout()\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading the Correlation Heatmap\n", "\n", "**Highly correlated pairs to watch:**\n", "- `Monthly_Deposits` and `Monthly_Withdrawals` are strongly correlated with `Annual_Income` \u2014 makes sense, higher earners move more money\n", "- `Online_Logins_Month` and `Mobile_Sessions_Month` are positively correlated \u2014 digitally engaged customers use both\n", "- `Age` and `Tenure_Years` are positively correlated \u2014 older customers have been around longer\n", "\n", "**Why does this matter for clustering?** When two features are highly correlated, they essentially tell the algorithm the same story twice, giving that dimension extra weight. We'll address this in feature engineering." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 4. Feature Engineering & Scaling\n", "\n", "Now we need to prepare the data specifically for clustering. This involves three key decisions:\n", "\n", "1. **Which features to include** (and which to exclude)\n", "2. **Whether to create new features** from existing ones\n", "3. **How to scale** the data so no single feature dominates\n", "\n", "### Feature Selection Rationale\n", "\n", "We'll exclude:\n", "- `Customer_ID` \u2014 just an identifier, not a behavioral signal\n", "- Raw categorical text fields \u2014 we'll encode what we need\n", "- `Monthly_Deposits` and `Monthly_Withdrawals` separately \u2014 we already have `Net_Monthly_Flow` which captures the net effect, plus `Monthly_Transactions` and `Avg_Transaction_Amount` capture volume and size\n", "\n", "We'll create:\n", "- `Digital_Engagement_Score` \u2014 combines online logins + mobile sessions into one digital metric\n", "- `Total_Product_Count` \u2014 sum of all product flags for a single cross-sell depth measure" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# FEATURE ENGINEERING\n", "# -----------------------------------------------------------\n", "\n", "# Create composite features\n", "df['Digital_Engagement'] = df['Online_Logins_Month'] + df['Mobile_Sessions_Month']\n", "df['Total_Products_Held'] = (df['Has_Credit_Card'] + df['Has_Mortgage'] + \n", " df['Has_Personal_Loan'] + df['Has_Auto_Loan'] + \n", " df['Has_Investment_Account'])\n", "\n", "# Verify\n", "print(\"New Features Created:\")\n", "print(f\" Digital_Engagement \u2014 range: {df['Digital_Engagement'].min()} to {df['Digital_Engagement'].max()}\")\n", "print(f\" Total_Products_Held \u2014 range: {df['Total_Products_Held'].min()} to {df['Total_Products_Held'].max()}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# SELECT CLUSTERING FEATURES\n", "# -----------------------------------------------------------\n", "\n", "# These are the features we'll use for clustering.\n", "# Each one captures a different dimension of customer behavior.\n", "\n", "clustering_features = [\n", " 'Age', # Lifecycle stage\n", " 'Annual_Income', # Economic capacity\n", " 'Credit_Score', # Creditworthiness\n", " 'Tenure_Years', # Loyalty / relationship depth\n", " 'Account_Balance', # Customer value\n", " 'Total_Products_Held', # Cross-sell depth\n", " 'Monthly_Transactions', # Activity level\n", " 'Avg_Transaction_Amount', # Spending power per transaction\n", " 'Net_Monthly_Flow', # Saving vs. spending behavior\n", " 'Digital_Engagement', # Channel preference (digital)\n", " 'Branch_Visits_Year', # Channel preference (traditional)\n", " 'Overdraft_Count_12mo', # Financial stress indicator\n", " 'Satisfaction_Score', # Experience quality\n", "]\n", "\n", "X = df[clustering_features].copy()\n", "print(f\"Clustering matrix: {X.shape[0]} customers \u00d7 {X.shape[1]} features\")\n", "print(f\"\\nFeatures selected:\")\n", "for i, f in enumerate(clustering_features, 1):\n", " print(f\" {i:2d}. {f}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why These 13 Features?\n", "\n", "Each feature captures a **distinct dimension** of customer behavior:\n", "\n", "| Dimension | Feature(s) | What It Tells Us |\n", "|-----------|-----------|-----------------|\n", "| **Demographics** | Age | Lifecycle stage and generational preferences |\n", "| **Financial Capacity** | Annual_Income, Credit_Score | Ability to use products and manage credit |\n", "| **Relationship** | Tenure_Years, Total_Products_Held | How deep the banking relationship is |\n", "| **Account Value** | Account_Balance, Net_Monthly_Flow | How much value the customer represents |\n", "| **Activity** | Monthly_Transactions, Avg_Transaction_Amount | How actively they use the bank |\n", "| **Channel** | Digital_Engagement, Branch_Visits_Year | How they prefer to interact |\n", "| **Risk/Stress** | Overdraft_Count_12mo | Signs of financial difficulty |\n", "| **Experience** | Satisfaction_Score | How happy they are |\n", "\n", "> **\ud83d\udca1 Discussion:** Why didn't we include `Gender` or `Region`? Because for customer *behavior* segmentation, we want clusters driven by what customers *do*, not who they *are*. We can always cross-tabulate demographics with clusters after the fact." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# STANDARDIZE (Z-SCORE SCALING)\n", "# -----------------------------------------------------------\n", "\n", "scaler = StandardScaler()\n", "X_scaled = scaler.fit_transform(X)\n", "\n", "# Convert back to DataFrame for readability\n", "X_scaled_df = pd.DataFrame(X_scaled, columns=clustering_features)\n", "\n", "print(\"Before scaling (first 3 rows):\")\n", "print(X.head(3).to_string())\n", "print(\"\\nAfter scaling (first 3 rows):\")\n", "print(X_scaled_df.head(3).round(3).to_string())\n", "\n", "print(\"\\n\u2705 All features now have mean \u2248 0 and std \u2248 1\")\n", "print(f\" Mean range: {X_scaled_df.mean().min():.4f} to {X_scaled_df.mean().max():.4f}\")\n", "print(f\" Std range: {X_scaled_df.std().min():.4f} to {X_scaled_df.std().max():.4f}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 5. Finding the Right Number of Clusters\n", "\n", "This is the hardest part of K-Means: **How many groups should we create?**\n", "\n", "Too few clusters \u2192 we miss important differences between customers. \n", "Too many clusters \u2192 the segments become too granular to act on.\n", "\n", "We'll use two complementary methods:\n", "1. **The Elbow Method** \u2014 looks at how much \"error\" decreases as we add clusters\n", "2. **Silhouette Analysis** \u2014 measures how well each point fits its assigned cluster" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# ELBOW METHOD\n", "# -----------------------------------------------------------\n", "\n", "inertias = []\n", "sil_scores = []\n", "K_range = range(2, 11)\n", "\n", "for k in K_range:\n", " km = KMeans(n_clusters=k, n_init=20, random_state=42, max_iter=300)\n", " km.fit(X_scaled)\n", " inertias.append(km.inertia_)\n", " sil_scores.append(silhouette_score(X_scaled, km.labels_))\n", " print(f\" k={k}: Inertia={km.inertia_:,.0f} | Silhouette={sil_scores[-1]:.4f}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Plot both methods side by side\n", "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))\n", "\n", "# Elbow Plot\n", "ax1.plot(K_range, inertias, 'o-', color='#1B3A5C', linewidth=2, markersize=8)\n", "ax1.set_xlabel('Number of Clusters (k)', fontsize=13)\n", "ax1.set_ylabel('Inertia (Within-Cluster Sum of Squares)', fontsize=13)\n", "ax1.set_title('Elbow Method', fontsize=15, fontweight='bold')\n", "ax1.set_xticks(list(K_range))\n", "\n", "# Highlight the elbow region\n", "ax1.axvspan(3.5, 5.5, alpha=0.15, color='#E76F51', label='Elbow region')\n", "ax1.legend(fontsize=11)\n", "\n", "# Silhouette Plot\n", "ax2.plot(K_range, sil_scores, 's-', color='#2A9D8F', linewidth=2, markersize=8)\n", "ax2.set_xlabel('Number of Clusters (k)', fontsize=13)\n", "ax2.set_ylabel('Average Silhouette Score', fontsize=13)\n", "ax2.set_title('Silhouette Analysis', fontsize=15, fontweight='bold')\n", "ax2.set_xticks(list(K_range))\n", "\n", "# Highlight best k\n", "best_k_sil = list(K_range)[np.argmax(sil_scores)]\n", "ax2.axvline(best_k_sil, color='#E76F51', linestyle='--', alpha=0.7, \n", " label=f'Best silhouette at k={best_k_sil}')\n", "ax2.legend(fontsize=11)\n", "\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(f\"\\n\ud83d\udcca Silhouette scores suggest k={best_k_sil} has the best cluster separation.\")\n", "print(\"\ud83d\udcca The elbow plot shows diminishing returns after k=4\u20135.\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How to Read These Plots\n", "\n", "**Elbow Method (left):** \n", "The y-axis shows \"inertia\" \u2014 the total distance of all points from their cluster centers. Lower is better, but it *always* decreases. We look for the \"elbow\" where the rate of improvement slows sharply. The shaded region shows this happens around k=4 to k=5.\n", "\n", "**Silhouette Score (right):** \n", "Ranges from -1 to +1. Higher is better. A score above 0.25 means clusters have reasonable structure. The peak tells us where clusters are most distinct.\n", "\n", "> **\ud83d\udca1 The Real Answer:** There's no single \"correct\" k. We balance statistical evidence with business practicality. Even if k=3 has the best silhouette, k=5 might be more *actionable* if it reveals meaningful sub-groups. **Let's try k=4 and k=5, then compare.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 6. Fitting the K-Means Model\n", "\n", "Based on our analysis, we'll go with **k=5** clusters. Here's why:\n", "\n", "- The elbow suggests k=4\u20135 is the sweet spot\n", "- With 800 customers, 5 clusters gives us ~160 customers per segment on average \u2014 large enough for meaningful analysis\n", "- From a business perspective, 5 segments is manageable for a marketing team to design distinct strategies\n", "\n", "We'll run the algorithm with `n_init=25` (25 different random starting positions) to reduce the risk of getting stuck in a local optimum." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# FIT K-MEANS WITH k=5\n", "# -----------------------------------------------------------\n", "\n", "kmeans = KMeans(n_clusters=5, n_init=25, random_state=42, max_iter=300)\n", "df['Cluster'] = kmeans.fit_predict(X_scaled)\n", "\n", "# Check cluster sizes\n", "cluster_sizes = df['Cluster'].value_counts().sort_index()\n", "print(\"CLUSTER SIZES\")\n", "print(\"=\" * 40)\n", "for cluster_id, count in cluster_sizes.items():\n", " pct = count / len(df) * 100\n", " print(f\" Cluster {cluster_id}: {count:>4} customers ({pct:.1f}%)\")\n", "\n", "print(f\"\\nSilhouette Score: {silhouette_score(X_scaled, df['Cluster']):.4f}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Silhouette plot by cluster \u2014 are any clusters poorly defined?\n", "fig, ax = plt.subplots(figsize=(10, 7))\n", "\n", "sample_silhouette_values = silhouette_samples(X_scaled, df['Cluster'])\n", "y_lower = 10\n", "\n", "colors = ['#264653', '#2A9D8F', '#E9C46A', '#F4A261', '#E76F51']\n", "\n", "for i in range(5):\n", " cluster_sil = sample_silhouette_values[df['Cluster'] == i]\n", " cluster_sil.sort()\n", " \n", " y_upper = y_lower + len(cluster_sil)\n", " ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_sil,\n", " facecolor=colors[i], edgecolor=colors[i], alpha=0.7)\n", " ax.text(-0.05, y_lower + 0.5 * len(cluster_sil), str(i), fontsize=14, fontweight='bold')\n", " y_lower = y_upper + 10\n", "\n", "avg_sil = silhouette_score(X_scaled, df['Cluster'])\n", "ax.axvline(x=avg_sil, color=\"red\", linestyle=\"--\", linewidth=2, label=f'Average: {avg_sil:.3f}')\n", "\n", "ax.set_xlabel('Silhouette Coefficient', fontsize=13)\n", "ax.set_ylabel('Customers (sorted by cluster)', fontsize=13)\n", "ax.set_title('Silhouette Plot by Cluster\\n(Wider = better fit, no points below 0 = good)', \n", " fontsize=14, fontweight='bold')\n", "ax.legend(fontsize=12)\n", "ax.set_yticks([])\n", "plt.tight_layout()\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading the Silhouette Plot\n", "\n", "Each colored \"blade\" represents one cluster. The width of each blade shows how well individual customers fit their assigned cluster:\n", "\n", "- **Wide blades** = customers are well-matched to their cluster\n", "- **Points near 0** = customers are on the boundary between clusters\n", "- **Points below 0** = customers might be in the wrong cluster\n", "\n", "The red dashed line shows the overall average. Clusters that extend well past this line are especially well-defined." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 7. Cluster Profiling & Interpretation\n", "\n", "This is where clustering goes from math to **business insight**. We need to answer: *Who is in each cluster, and what makes them different?*\n", "\n", "We'll create profiles by comparing the average values of each feature across clusters." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# CLUSTER PROFILES: Mean values for each feature\n", "# -----------------------------------------------------------\n", "\n", "profile = df.groupby('Cluster')[clustering_features].mean().round(1)\n", "profile.index.name = 'Cluster'\n", "\n", "# Add cluster size\n", "profile.insert(0, 'n_customers', df['Cluster'].value_counts().sort_index())\n", "\n", "# Display\n", "print(\"CLUSTER PROFILES (Mean Values)\")\n", "print(\"=\" * 100)\n", "profile.T" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# VISUAL PROFILES: Radar-style comparison using bar charts\n", "# -----------------------------------------------------------\n", "\n", "# Normalize cluster means to 0-1 scale for visual comparison\n", "profile_norm = profile[clustering_features].copy()\n", "for col in clustering_features:\n", " col_min = profile_norm[col].min()\n", " col_max = profile_norm[col].max()\n", " if col_max > col_min:\n", " profile_norm[col] = (profile_norm[col] - col_min) / (col_max - col_min)\n", " else:\n", " profile_norm[col] = 0.5\n", "\n", "fig, axes = plt.subplots(1, 5, figsize=(22, 8), subplot_kw=dict(polar=True))\n", "fig.suptitle(\"Cluster Profiles: Radar Comparison\\n(each spoke = one feature, scaled 0\u20131)\", \n", " fontsize=16, fontweight='bold', y=1.05)\n", "\n", "# Short labels for readability\n", "short_labels = ['Age', 'Income', 'Credit', 'Tenure', 'Balance', \n", " 'Products', 'Txns', 'Avg Txn$', 'NetFlow',\n", " 'Digital', 'Branch', 'Overdraft', 'Satisf']\n", "\n", "angles = np.linspace(0, 2 * np.pi, len(clustering_features), endpoint=False).tolist()\n", "angles += angles[:1]\n", "\n", "colors = ['#264653', '#2A9D8F', '#E9C46A', '#F4A261', '#E76F51']\n", "\n", "for idx, (ax, color) in enumerate(zip(axes.flat, colors)):\n", " values = profile_norm.iloc[idx].tolist()\n", " values += values[:1]\n", " \n", " ax.plot(angles, values, 'o-', linewidth=2, color=color)\n", " ax.fill(angles, values, alpha=0.25, color=color)\n", " ax.set_xticks(angles[:-1])\n", " ax.set_xticklabels(short_labels, fontsize=7)\n", " ax.set_ylim(0, 1)\n", " ax.set_title(f\"Cluster {idx}\\n(n={profile.iloc[idx]['n_customers']:.0f})\", \n", " fontsize=12, fontweight='bold', pad=20)\n", "\n", "plt.tight_layout()\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# HEATMAP: Standardized cluster differences\n", "# -----------------------------------------------------------\n", "\n", "# Z-score each feature's cluster means to highlight where clusters DIFFER most\n", "profile_z = profile[clustering_features].copy()\n", "for col in clustering_features:\n", " col_mean = profile_z[col].mean()\n", " col_std = profile_z[col].std()\n", " if col_std > 0:\n", " profile_z[col] = (profile_z[col] - col_mean) / col_std\n", " else:\n", " profile_z[col] = 0\n", "\n", "fig, ax = plt.subplots(figsize=(16, 6))\n", "sns.heatmap(profile_z.T, annot=True, fmt='.2f', cmap='RdYlGn', center=0,\n", " linewidths=0.5, ax=ax, cbar_kws={'label': '\u2190 Below Average | Above Average \u2192'})\n", "\n", "ax.set_title(\"How Each Cluster Differs from the Overall Average\\n(Green = higher than avg, Red = lower than avg)\", \n", " fontsize=14, fontweight='bold')\n", "ax.set_xlabel('Cluster', fontsize=13)\n", "ax.set_ylabel('')\n", "plt.tight_layout()\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \ud83d\udd0d Reading the Heatmap\n", "\n", "This is the single most important chart in the analysis. Each cell shows **how far above or below average** each cluster is on each feature:\n", "\n", "- **Dark green** = this cluster scores much higher than others\n", "- **Dark red** = this cluster scores much lower than others\n", "- **White/yellow** = about average\n", "\n", "Look for **rows with strong contrast** \u2014 those are the features that do the most to differentiate the clusters. Look for **columns with clear color patterns** \u2014 those are the most distinctive segments." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# DEMOGRAPHIC BREAKDOWN BY CLUSTER\n", "# -----------------------------------------------------------\n", "\n", "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n", "\n", "for ax, var, title in zip(axes, ['Education', 'Marital_Status', 'Account_Type'],\n", " ['Education Level', 'Marital Status', 'Account Type']):\n", " ct = pd.crosstab(df['Cluster'], df[var], normalize='index') * 100\n", " ct.plot(kind='bar', stacked=True, ax=ax, colormap='Set2', edgecolor='white', linewidth=0.5)\n", " ax.set_title(f'{title} by Cluster', fontsize=13, fontweight='bold')\n", " ax.set_xlabel('Cluster')\n", " ax.set_ylabel('Percentage')\n", " ax.legend(fontsize=8, title=var, bbox_to_anchor=(1.0, 1.0))\n", " ax.set_xticklabels(ax.get_xticklabels(), rotation=0)\n", "\n", "plt.tight_layout()\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# NAME THE CLUSTERS \u2014 make them business-friendly\n", "# -----------------------------------------------------------\n", "\n", "# Based on the profiles above, let's give each cluster a descriptive name.\n", "# (In class, we'd discuss and debate these names together!)\n", "\n", "# First, let's print a clean summary to guide naming\n", "print(\"CLUSTER NAMING GUIDE\")\n", "print(\"=\" * 80)\n", "for c in range(5):\n", " cluster_data = df[df['Cluster'] == c]\n", " print(f\"\\nCluster {c} (n={len(cluster_data)}):\")\n", " print(f\" Avg Age: {cluster_data['Age'].mean():.0f} | \"\n", " f\"Avg Income: ${cluster_data['Annual_Income'].mean():,.0f} | \"\n", " f\"Avg Balance: ${cluster_data['Account_Balance'].mean():,.0f}\")\n", " print(f\" Avg Tenure: {cluster_data['Tenure_Years'].mean():.1f} yrs | \"\n", " f\"Avg Products: {cluster_data['Total_Products_Held'].mean():.1f} | \"\n", " f\"Avg Credit: {cluster_data['Credit_Score'].mean():.0f}\")\n", " print(f\" Digital Engagement: {cluster_data['Digital_Engagement'].mean():.0f} | \"\n", " f\"Branch Visits: {cluster_data['Branch_Visits_Year'].mean():.0f} | \"\n", " f\"Satisfaction: {cluster_data['Satisfaction_Score'].mean():.1f}\")\n", " print(f\" Overdrafts: {cluster_data['Overdraft_Count_12mo'].mean():.1f} | \"\n", " f\"Net Flow: ${cluster_data['Net_Monthly_Flow'].mean():,.0f}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# ASSIGN BUSINESS-FRIENDLY SEGMENT NAMES\n", "# -----------------------------------------------------------\n", "\n", "# Sort clusters by average income to create logical ordering\n", "cluster_income = df.groupby('Cluster')['Annual_Income'].mean().sort_values()\n", "print(\"Clusters ranked by average income:\")\n", "for c, inc in cluster_income.items():\n", " age = df[df['Cluster']==c]['Age'].mean()\n", " balance = df[df['Cluster']==c]['Account_Balance'].mean()\n", " digital = df[df['Cluster']==c]['Digital_Engagement'].mean()\n", " tenure = df[df['Cluster']==c]['Tenure_Years'].mean()\n", " print(f\" Cluster {c}: Income=${inc:>10,.0f} | Age={age:.0f} | \"\n", " f\"Balance=${balance:>10,.0f} | Digital={digital:.0f} | Tenure={tenure:.1f}yr\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \ud83c\udff7\ufe0f Naming the Clusters\n", "\n", "**This is a team exercise in class.** Based on the profiles above, work with your team to give each cluster a name that a bank executive would immediately understand. Good segment names are:\n", "\n", "- **Descriptive** \u2014 capture the key defining characteristic\n", "- **Actionable** \u2014 suggest what the bank should do differently\n", "- **Memorable** \u2014 a VP should be able to remember them in a meeting\n", "\n", "Examples of good names: *\"Digital-First Starters,\" \"High-Value Loyalists,\" \"Budget-Conscious Families\"* \n", "Examples of bad names: *\"Cluster 3,\" \"Group B,\" \"Low-Income Segment\"*\n", "\n", "Below, I'll assign names based on the dominant patterns in each cluster's profile. Your names may differ \u2014 and that's fine! The important thing is that each name captures the story the data tells." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Create a mapping \u2014 adjust based on your actual cluster profiles\n", "# We'll use a data-driven approach to auto-detect the best names\n", "\n", "cluster_summary = df.groupby('Cluster').agg({\n", " 'Age': 'mean',\n", " 'Annual_Income': 'mean',\n", " 'Account_Balance': 'mean',\n", " 'Tenure_Years': 'mean',\n", " 'Digital_Engagement': 'mean',\n", " 'Branch_Visits_Year': 'mean',\n", " 'Total_Products_Held': 'mean',\n", " 'Overdraft_Count_12mo': 'mean',\n", " 'Credit_Score': 'mean',\n", " 'Satisfaction_Score': 'mean',\n", " 'Net_Monthly_Flow': 'mean',\n", "}).round(1)\n", "\n", "# Rule-based naming logic\n", "segment_names = {}\n", "for c in range(5):\n", " row = cluster_summary.loc[c]\n", " \n", " if row['Annual_Income'] > 120000 and row['Account_Balance'] > 60000:\n", " segment_names[c] = \"\ud83d\udc8e Affluent Established\"\n", " elif row['Age'] > 55 and row['Tenure_Years'] > 15:\n", " segment_names[c] = \"\ud83c\udfe1 Mature Loyalists\"\n", " elif row['Digital_Engagement'] > 40 and row['Age'] < 38:\n", " segment_names[c] = \"\ud83d\udcf1 Digital-First Achievers\"\n", " elif row['Overdraft_Count_12mo'] > 1.0 or (row['Net_Monthly_Flow'] < 300 and row['Annual_Income'] < 65000):\n", " segment_names[c] = \"\ud83d\udcb0 Budget-Conscious Families\"\n", " elif row['Age'] < 32 and row['Tenure_Years'] < 4:\n", " segment_names[c] = \"\ud83d\ude80 Young Professionals\"\n", " else:\n", " segment_names[c] = f\"\ud83d\udcca Segment {c}\"\n", "\n", "# Apply names\n", "df['Segment_Name'] = df['Cluster'].map(segment_names)\n", "\n", "print(\"\\nFINAL SEGMENT NAMES\")\n", "print(\"=\" * 50)\n", "for c, name in sorted(segment_names.items()):\n", " count = (df['Cluster'] == c).sum()\n", " pct = count / len(df) * 100\n", " print(f\" Cluster {c} \u2192 {name} ({count} customers, {pct:.1f}%)\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 8. Business Recommendations\n", "\n", "Now comes the most important part \u2014 the part that gets your analysis **deployed**. We need to translate each segment into a concrete action plan for the bank.\n", "\n", "For each segment, we'll answer:\n", "1. **Who are they?** (1-sentence description)\n", "2. **What do they value?** (channel, products, relationship style)\n", "3. **What should the bank do?** (specific marketing/product/service actions)\n", "4. **What's the risk if we don't act?** (churn risk, revenue at stake)" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# BUSINESS RECOMMENDATIONS SUMMARY\n", "# -----------------------------------------------------------\n", "\n", "print(\"=\" * 80)\n", "print(\"BANK CUSTOMER SEGMENTATION: STRATEGIC RECOMMENDATIONS\")\n", "print(\"=\" * 80)\n", "\n", "for c in sorted(segment_names.keys()):\n", " name = segment_names[c]\n", " seg = df[df['Cluster'] == c]\n", " \n", " print(f\"\\n{'\u2500' * 80}\")\n", " print(f\" {name}\")\n", " print(f\" {len(seg)} customers ({len(seg)/len(df)*100:.1f}% of base)\")\n", " print(f\"{'\u2500' * 80}\")\n", " \n", " print(f\" Profile:\")\n", " print(f\" Age: {seg['Age'].mean():.0f} avg | \"\n", " f\"Income: ${seg['Annual_Income'].mean():,.0f} | \"\n", " f\"Balance: ${seg['Account_Balance'].mean():,.0f}\")\n", " print(f\" Products: {seg['Total_Products_Held'].mean():.1f} avg | \"\n", " f\"Tenure: {seg['Tenure_Years'].mean():.1f} yrs | \"\n", " f\"Satisfaction: {seg['Satisfaction_Score'].mean():.1f}/10\")\n", " print(f\" Digital: {seg['Digital_Engagement'].mean():.0f}/mo | \"\n", " f\"Branch: {seg['Branch_Visits_Year'].mean():.0f}/yr | \"\n", " f\"Overdrafts: {seg['Overdraft_Count_12mo'].mean():.1f}/yr\")\n", " \n", " total_balance = seg['Account_Balance'].sum()\n", " print(f\"\\n Total Segment Value: ${total_balance:,.0f} in balances\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# DETAILED RECOMMENDATIONS TABLE\n", "# -----------------------------------------------------------\n", "\n", "recommendations = {\n", " \"\ud83d\udc8e Affluent Established\": {\n", " \"Who\": \"High-income, long-tenure customers with large balances and multiple products.\",\n", " \"Priority\": \"RETAIN \u2014 they are your most profitable customers\",\n", " \"Products\": \"Wealth management, premium credit cards, estate planning, exclusive investment options\",\n", " \"Channel\": \"Dedicated relationship manager; proactive outreach, not reactive service\",\n", " \"Risk\": \"Competitors actively target this segment. Losing one customer = losing $80K+ in deposits\",\n", " \"Quick Win\": \"Launch a 'Private Banking' tier with dedicated advisors and fee waivers\",\n", " },\n", " \"\ud83c\udfe1 Mature Loyalists\": {\n", " \"Who\": \"Older, long-tenured customers who prefer branches and have strong credit.\",\n", " \"Priority\": \"PROTECT \u2014 they are loyal but aging; plan for wealth transfer and retirement needs\",\n", " \"Products\": \"Retirement planning, annuities, trust services, simplified digital tools\",\n", " \"Channel\": \"Branch-first with gentle digital onboarding; never force them online-only\",\n", " \"Risk\": \"They'll leave if branches close. Their children may leave if the bank feels outdated\",\n", " \"Quick Win\": \"Offer 'Digital Companion' sessions at branches to build mobile comfort\",\n", " },\n", " \"\ud83d\udcf1 Digital-First Achievers\": {\n", " \"Who\": \"Younger, tech-savvy customers with rising incomes and high digital engagement.\",\n", " \"Priority\": \"GROW \u2014 invest in this segment now; they are tomorrow's affluent customers\",\n", " \"Products\": \"Robo-advisory, budgeting tools, early mortgage pre-approval, credit building\",\n", " \"Channel\": \"App-first everything; in-app chat support, push notifications, zero branch friction\",\n", " \"Risk\": \"They'll switch to a fintech in a heartbeat if the app experience falls behind\",\n", " \"Quick Win\": \"Build a 'Financial Milestone' feature in the app (track savings goals, credit growth)\",\n", " },\n", " \"\ud83d\udcb0 Budget-Conscious Families\": {\n", " \"Who\": \"Mid-career customers with moderate incomes, multiple loans, and occasional overdrafts.\",\n", " \"Priority\": \"SUPPORT \u2014 help them stabilize; reduce their pain points and they become loyal\",\n", " \"Products\": \"Overdraft protection, automated savings, debt consolidation, family accounts\",\n", " \"Channel\": \"Proactive alerts (low balance, upcoming bills); financial education content\",\n", " \"Risk\": \"Highest churn risk due to fee sensitivity and financial stress\",\n", " \"Quick Win\": \"Introduce fee-free overdraft buffer ($100 cushion) to reduce frustration\",\n", " },\n", " \"\ud83d\ude80 Young Professionals\": {\n", " \"Who\": \"Early-career customers with modest balances but high digital adoption.\",\n", " \"Priority\": \"ACQUIRE & DEEPEN \u2014 get them using more products before a competitor does\",\n", " \"Products\": \"First credit card, starter investment account, student loan refinancing\",\n", " \"Channel\": \"Social media, referral programs, app-based rewards for product adoption\",\n", " \"Risk\": \"Low balances = low switching cost. They'll leave for a better sign-up bonus\",\n", " \"Quick Win\": \"Launch 'First 5 Products' rewards \u2014 bonus for each new product activated\",\n", " },\n", "}\n", "\n", "for name, rec in recommendations.items():\n", " if name in segment_names.values():\n", " print(f\"\\n{'\u2501' * 70}\")\n", " print(f\" {name}\")\n", " print(f\"{'\u2501' * 70}\")\n", " for key, value in rec.items():\n", " print(f\" {key:>10}: {value}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# -----------------------------------------------------------\n", "# FINAL VISUALIZATION: Segment value vs. engagement\n", "# -----------------------------------------------------------\n", "\n", "fig, ax = plt.subplots(figsize=(12, 8))\n", "\n", "colors_map = {\n", " 0: '#264653', 1: '#2A9D8F', 2: '#E9C46A', 3: '#F4A261', 4: '#E76F51'\n", "}\n", "\n", "for c in range(5):\n", " seg = df[df['Cluster'] == c]\n", " ax.scatter(seg['Digital_Engagement'], seg['Account_Balance'],\n", " c=colors_map[c], alpha=0.5, s=60, edgecolors='gray', linewidth=0.3,\n", " label=f\"{segment_names[c]} (n={len(seg)})\")\n", "\n", "ax.set_xlabel('Digital Engagement (Online + Mobile Sessions/Month)', fontsize=13)\n", "ax.set_ylabel('Account Balance ($)', fontsize=13)\n", "ax.set_title('Customer Segments: Digital Engagement vs. Account Value\\n'\n", " 'Each dot = one customer, colored by segment', fontsize=14, fontweight='bold')\n", "ax.legend(fontsize=10, loc='upper right', framealpha=0.9)\n", "ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x:,.0f}'))\n", "\n", "plt.tight_layout()\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 9. Export Results & Next Steps\n", "\n", "### Saving the Segmented Dataset\n", "We'll export the data with cluster assignments so the marketing team can build campaigns." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Export segmented data\n", "output_cols = ['Customer_ID', 'Cluster', 'Segment_Name'] + clustering_features + \\\n", " ['Gender', 'Marital_Status', 'Education', 'Region', 'Account_Type']\n", "\n", "df_export = df[output_cols].copy()\n", "df_export.to_excel(\"bank_customers_segmented.xlsx\", index=False, sheet_name=\"Segmented_Customers\")\n", "print(f\"\u2705 Exported {len(df_export)} customers with segment labels to 'bank_customers_segmented.xlsx'\")\n", "print(f\"\\nSegment distribution:\")\n", "print(df_export['Segment_Name'].value_counts().to_string())" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What Would Come Next in a Real Project?\n", "\n", "1. **Stakeholder Presentation** \u2014 Present these segments to marketing, product, and branch teams. Use the heatmap and radar charts to make the story visual.\n", "\n", "2. **A/B Testing** \u2014 Before rolling out different strategies for each segment, test them:\n", " - Does the \"fee-free overdraft buffer\" actually reduce churn for Budget-Conscious Families?\n", " - Does the \"Private Banking\" tier increase retention for Affluent Established?\n", "\n", "3. **Monitoring & Refresh** \u2014 Customers change over time. A Young Professional today may become a Digital-First Achiever in 3 years. Re-run the clustering quarterly and track migration between segments.\n", "\n", "4. **Integration** \u2014 Feed segment labels into the CRM system so every customer-facing employee knows which segment they're talking to.\n", "\n", "5. **Try Other Algorithms** \u2014 Compare K-Means results with Hierarchical Clustering (to see how segments nest inside each other) and DBSCAN (to find any genuine outlier customers who don't fit any segment).\n", "\n", "---\n", "\n", "### \ud83d\udcdd Key Takeaways\n", "\n", "| Lesson | Why It Matters |\n", "|--------|---------------|\n", "| **Data prep is 80% of the work** | Scaling, missing values, and feature selection drive your results |\n", "| **The elbow isn't always obvious** | Use silhouette scores as a second opinion |\n", "| **Name your clusters** | Numbers mean nothing to a VP; stories drive action |\n", "| **Clustering is exploratory** | There's no \"right answer\" \u2014 the best solution is the one that gets deployed |\n", "| **The business question comes first** | Don't cluster because you can; cluster because someone needs to make a decision |" ] } ] }