{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Keep or Let Go? Predicting Customer Churn\n",
    "## Logistic Regression vs. Decision Trees — A Model Comparison Lab\n",
    "\n",
    "**Course:** GMBA 621 — Predictive Analytics & Data Mining  & FINC 332 — Dat Analytics, Data Mining, & Data Visualization\n",
    "\n",
    "**Gannon University — Dahlkemper School of Business**\n",
    "\n",
    "*By Dr. Benyawarath \"Yaa\" Nithithanatchinnapat, Feb 10, 2026*\n",
    "\n",
    "---\n",
    "\n",
    "### The Business Problem\n",
    "\n",
    "You're an analyst at a telecommunications company. Leadership is worried — customers are leaving for competitors, and every lost customer costs the company money. The marketing team wants to know:\n",
    "\n",
    "> **\"Can we predict which customers are about to leave so we can intervene before it's too late?\"**\n",
    "\n",
    "This is a **churn prediction** problem, and it's one of the most common use cases for predictive analytics across industries — telecom, banking, insurance, SaaS, you name it.\n",
    "\n",
    "### Why This Matters (The Business Case)\n",
    "\n",
    "Acquiring a new customer costs **5–7x more** than retaining an existing one. If we can identify at-risk customers early, the company can:\n",
    "\n",
    "- Offer targeted promotions or discounts\n",
    "- Proactively reach out through customer service\n",
    "- Fix service issues before the customer walks away\n",
    "\n",
    "Even a modest improvement in retention can translate to millions in saved revenue.\n",
    "\n",
    "### What You'll Learn in This Notebook\n",
    "\n",
    "We'll tackle this problem using **two different classification models** and then compare them head-to-head:\n",
    "\n",
    "1. **Logistic Regression** — A statistical model that estimates the *probability* of churn\n",
    "2. **Decision Tree** — A rule-based model that creates an *interpretable flowchart* of decisions\n",
    "3. **Model Comparison** — Which model should we actually deploy? We'll use multiple metrics to decide\n",
    "\n",
    "By the end, you'll be able to build both models, evaluate them properly, and make a recommendation to leadership about which one to use."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 0: Setting Up Our Environment\n",
    "\n",
    "Before we dive in, let's load the Python libraries we'll need. Think of these as tools in a toolbox — each one serves a purpose:\n",
    "\n",
    "| Library | What It Does |\n",
    "|---|---|\n",
    "| `pandas` | Works with data tables (like Excel for Python) |\n",
    "| `numpy` | Handles math and numerical operations |\n",
    "| `matplotlib` / `seaborn` | Creates charts and visualizations |\n",
    "| `scikit-learn` | Our machine learning toolkit — models, metrics, everything |\n",
    "| `openpyxl` | Lets Python read Excel files (.xlsx) |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Core data libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Visualization\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "%matplotlib inline\n",
    "\n",
    "# Modeling\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.tree import DecisionTreeClassifier, plot_tree\n",
    "\n",
    "# Preprocessing & splitting\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# Evaluation metrics\n",
    "from sklearn.metrics import (\n",
    "    accuracy_score, \n",
    "    confusion_matrix, \n",
    "    classification_report, \n",
    "    roc_auc_score, \n",
    "    roc_curve, \n",
    "    log_loss,\n",
    "    ConfusionMatrixDisplay\n",
    ")\n",
    "\n",
    "# Suppress warnings for cleaner output\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Set visual style\n",
    "sns.set_style('whitegrid')\n",
    "plt.rcParams['figure.figsize'] = (8, 5)\n",
    "\n",
    "print('All libraries loaded successfully!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 1: Understanding the Data\n",
    "\n",
    "### Loading the Dataset from Excel\n",
    "\n",
    "Our dataset is stored in an Excel file called **`TelcoChurn_Dataset.xlsx`**. This is what you'll typically encounter in the workplace — data comes in spreadsheets, CSVs, databases, not from some pre-cleaned Python library.\n",
    "\n",
    "The Excel file has two sheets:\n",
    "- **Customer_Data** — the actual data (200 customer records)\n",
    "- **Data_Dictionary** — descriptions of what each column means\n",
    "\n",
    "We use `pandas.read_excel()` to load it. Notice we specify:\n",
    "- `sheet_name` — which sheet to read\n",
    "- `skiprows` — because our Excel file has a title row at the top that isn't data\n",
    "\n",
    "**Make sure the Excel file is in the same folder as this notebook!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the dataset from Excel\n",
    "# The data starts on row 4 (after the title and subtitle), so we skip 3 rows\n",
    "churn_df = pd.read_excel(\n",
    "    'TelcoChurn_Dataset.xlsx', \n",
    "    sheet_name='Customer_Data',\n",
    "    skiprows=3\n",
    ")\n",
    "\n",
    "# First look at the data\n",
    "print(f'Dataset shape: {churn_df.shape[0]} customers, {churn_df.shape[1]} columns\\n')\n",
    "churn_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's also peek at the Data Dictionary sheet so we know what we're working with\n",
    "data_dict = pd.read_excel(\n",
    "    'TelcoChurn_Dataset.xlsx', \n",
    "    sheet_name='Data_Dictionary',\n",
    "    skiprows=2\n",
    ")\n",
    "\n",
    "print('=== DATA DICTIONARY ===')\n",
    "print('(This tells us what each column means)\\n')\n",
    "for _, row in data_dict.iterrows():\n",
    "    print(f\"  {row['Column Name']:<25} → {row['Description']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quick Data Quality Check\n",
    "\n",
    "In the real world, the first thing you do after loading data is check for problems — missing values, wrong data types, unexpected entries. Let's do a quick health check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data quality check\n",
    "print('=== DATA QUALITY CHECK ===\\n')\n",
    "print(f'Total records:  {churn_df.shape[0]}')\n",
    "print(f'Total columns:  {churn_df.shape[1]}')\n",
    "print(f'\\nMissing values per column:')\n",
    "missing = churn_df.isnull().sum()\n",
    "if missing.sum() == 0:\n",
    "    print('  None found — great, clean data!')\n",
    "else:\n",
    "    print(missing[missing > 0])\n",
    "\n",
    "print(f'\\nData types:')\n",
    "print(churn_df.dtypes.to_string())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting Our Features\n",
    "\n",
    "The `CustomerID` column is just an identifier — it has no predictive power. (If our model used customer IDs to predict churn, that would be a red flag!) Let's drop it and make sure our target variable is properly formatted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop the ID column — it's not a feature, just a label\n",
    "churn_df = churn_df.drop(columns=['CustomerID'])\n",
    "\n",
    "# Make sure target is integer\n",
    "churn_df['Churned'] = churn_df['Churned'].astype(int)\n",
    "\n",
    "print(f'Working with {churn_df.shape[1]} columns and {churn_df.shape[0]} rows')\n",
    "print(f'\\nColumns: {list(churn_df.columns)}')\n",
    "churn_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exploratory Data Analysis (EDA)\n",
    "\n",
    "Before jumping to modeling, let's explore the data. A good analyst always looks at their data first. Here's why:\n",
    "\n",
    "- **Class balance**: Is churn roughly 50/50, or are most customers staying? This matters for how we evaluate our models.\n",
    "- **Distributions**: Are there outliers or unusual patterns?\n",
    "- **Relationships**: Do any features seem obviously connected to churn?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How balanced is our target variable?\n",
    "churn_counts = churn_df['Churned'].value_counts()\n",
    "churn_pct = churn_df['Churned'].value_counts(normalize=True) * 100\n",
    "\n",
    "print('=== Target Variable Distribution ===')\n",
    "print(f'Stayed (0):  {churn_counts[0]} customers ({churn_pct[0]:.1f}%)')\n",
    "print(f'Churned (1): {churn_counts[1]} customers ({churn_pct[1]:.1f}%)')\n",
    "print(f'\\nChurn rate: {churn_pct[1]:.1f}%')\n",
    "\n",
    "# Visualize\n",
    "fig, ax = plt.subplots(figsize=(6, 4))\n",
    "colors = ['#2ecc71', '#e74c3c']\n",
    "churn_df['Churned'].value_counts().plot(kind='bar', color=colors, edgecolor='black', ax=ax)\n",
    "ax.set_xticklabels(['Stayed (0)', 'Churned (1)'], rotation=0)\n",
    "ax.set_ylabel('Number of Customers')\n",
    "ax.set_title('Customer Churn Distribution')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Thinking Point: Class Imbalance\n",
    "\n",
    "Look at the distribution above. Is churn balanced or imbalanced?\n",
    "\n",
    "**Why this matters:** If ~73% of customers stay, a model that *always* predicts \\\"stay\\\" would be 73% accurate — without learning anything useful! This is why **accuracy alone is a misleading metric** for imbalanced datasets. We'll need better metrics like precision, recall, and AUC."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary statistics — look for anything unusual\n",
    "churn_df.describe().round(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How do churners differ from non-churners? Let's compare averages.\n",
    "comparison = churn_df.groupby('Churned').mean(numeric_only=True).round(2)\n",
    "print('=== Average Feature Values by Churn Status ===')\n",
    "print('(Row 0 = Stayed, Row 1 = Churned)\\n')\n",
    "comparison"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What Patterns Do You Notice?\n",
    "\n",
    "Look at the comparison table above. Ask yourself:\n",
    "\n",
    "- Do churners have longer or shorter tenure?\n",
    "- Is income higher or lower for churners?\n",
    "- Does employment length seem to matter?\n",
    "\n",
    "**Write down your observations.** We'll see if the models confirm your intuition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize feature distributions by churn status\n",
    "features_to_plot = ['Tenure_Months', 'Age', 'Household_Income_K', 'Years_Employed']\n",
    "\n",
    "fig, axes = plt.subplots(2, 2, figsize=(12, 8))\n",
    "axes = axes.flatten()\n",
    "\n",
    "for i, feature in enumerate(features_to_plot):\n",
    "    ax = axes[i]\n",
    "    churn_df[churn_df['Churned'] == 0][feature].hist(alpha=0.6, color='#2ecc71', \n",
    "                                                       label='Stayed', bins=20, ax=ax)\n",
    "    churn_df[churn_df['Churned'] == 1][feature].hist(alpha=0.6, color='#e74c3c', \n",
    "                                                       label='Churned', bins=20, ax=ax)\n",
    "    ax.set_title(f'{feature} by Churn Status')\n",
    "    ax.legend()\n",
    "    ax.set_xlabel(feature)\n",
    "    ax.set_ylabel('Count')\n",
    "\n",
    "plt.suptitle('Feature Distributions: Churners vs. Non-Churners', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Correlation heatmap — which features are related to each other and to churn?\n",
    "fig, ax = plt.subplots(figsize=(10, 7))\n",
    "corr_matrix = churn_df.corr(numeric_only=True).round(2)\n",
    "sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, fmt='.2f',\n",
    "            linewidths=0.5, ax=ax, vmin=-1, vmax=1)\n",
    "ax.set_title('Correlation Heatmap\\n(Look at the bottom row — correlations with Churned)', \n",
    "             fontsize=13, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print('Tip: Focus on the last row/column (Churned).')\n",
    "print('Negative correlations mean: as the feature goes up, churn goes down.')\n",
    "print('Positive correlations mean: as the feature goes up, churn goes up.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 2: Preparing Data for Modeling\n",
    "\n",
    "Before we can train any model, we need to do three things:\n",
    "\n",
    "1. **Separate features (X) from the target (y)** — The model needs to know what it's predicting vs. what it's using as inputs.\n",
    "2. **Split into training and test sets** — We train on one portion and test on data the model has never seen.\n",
    "3. **Normalize the features** — Some algorithms (especially logistic regression) work better when all features are on the same scale.\n",
    "\n",
    "### Why Do We Split the Data?\n",
    "\n",
    "Imagine studying for an exam by memorizing the answer key. You'd ace *that* exam, but would you actually understand the material? Probably not. The same thing happens with models — if we test on the same data we trained on, we're just measuring memorization, not understanding. This is called **overfitting**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 1: Separate features (X) from target (y)\n",
    "feature_cols = ['Tenure_Months', 'Age', 'Years_at_Address', 'Household_Income_K', \n",
    "                'Education_Level', 'Years_Employed', 'Num_Equipment']\n",
    "\n",
    "X = churn_df[feature_cols].values\n",
    "y = churn_df['Churned'].values\n",
    "\n",
    "print(f'Features (X) shape: {X.shape}  — {X.shape[0]} customers, {X.shape[1]} features')\n",
    "print(f'Target (y) shape:   {y.shape}  — {y.shape[0]} labels')\n",
    "print(f'\\nFeatures used: {feature_cols}')\n",
    "print(f'\\nNote: We excluded Has_CallCard and Has_Wireless to keep the feature set')\n",
    "print(f'focused on customer demographics and behavior.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 2: Split into training (80%) and test (20%) sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42\n",
    ")\n",
    "\n",
    "print(f'Training set: {X_train.shape[0]} customers')\n",
    "print(f'Test set:     {X_test.shape[0]} customers')\n",
    "print(f'\\nChurn rate in training: {y_train.mean():.1%}')\n",
    "print(f'Churn rate in test:     {y_test.mean():.1%}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 3: Normalize features (important for logistic regression!)\n",
    "# We fit the scaler on training data only, then apply it to both sets.\n",
    "# Why? To prevent \"data leakage\" — the test set should be truly unseen.\n",
    "\n",
    "scaler = StandardScaler()\n",
    "X_train_scaled = scaler.fit_transform(X_train)  # Fit AND transform on training\n",
    "X_test_scaled = scaler.transform(X_test)         # Only transform on test (no fitting!)\n",
    "\n",
    "print('Before scaling (first training row):')\n",
    "print([f'{v:.1f}' for v in X_train[0]])\n",
    "print('\\nAfter scaling (first training row):')\n",
    "print([f'{v:.2f}' for v in X_train_scaled[0]])\n",
    "print('\\n→ Values are now centered around 0 with similar ranges')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### A Note on Normalization\n",
    "\n",
    "**Why do we normalize?** Features like `Household_Income_K` (range: thousands) and `Education_Level` (range: 1–5) are on very different scales. Without normalization, a model might think income is more important simply because its numbers are bigger — not because it actually matters more.\n",
    "\n",
    "**Important:** Logistic regression is sensitive to feature scales, so normalization really helps. Decision trees, on the other hand, are **not** sensitive to scale — they split on individual feature values regardless of range. We'll use normalized data for logistic regression and raw data for the decision tree."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 3: Model 1 — Logistic Regression\n",
    "\n",
    "### What Is Logistic Regression?\n",
    "\n",
    "Despite the name, logistic regression is a **classification** algorithm, not a regression one. It answers the question:\n",
    "\n",
    "> **\"What's the probability that this customer will churn?\"**\n",
    "\n",
    "It works by fitting a curve (the \\\"sigmoid\\\" or \\\"logistic\\\" function) that squeezes any value into a range between 0 and 1 — perfect for probabilities.\n",
    "\n",
    "**How it makes decisions:**\n",
    "- If P(churn) >= 0.5 → Predict **churn** (1)\n",
    "- If P(churn) < 0.5 → Predict **stay** (0)\n",
    "\n",
    "**Business advantage:** You can rank customers by churn probability and focus retention efforts on the highest-risk ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build the logistic regression model\n",
    "log_reg = LogisticRegression(C=0.01, solver='liblinear', random_state=42)\n",
    "log_reg.fit(X_train_scaled, y_train)\n",
    "\n",
    "print('Logistic Regression model trained successfully!')\n",
    "print(f'Number of features used: {X_train_scaled.shape[1]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Making Predictions\n",
    "\n",
    "Now let's see what our model predicts. We'll get two things:\n",
    "\n",
    "1. **Class predictions** (`predict`): A hard yes/no label for each customer\n",
    "2. **Probability estimates** (`predict_proba`): The model's confidence level"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get predictions\n",
    "y_pred_lr = log_reg.predict(X_test_scaled)\n",
    "y_prob_lr = log_reg.predict_proba(X_test_scaled)\n",
    "\n",
    "print('First 10 predictions vs. actual:')\n",
    "print(f'{\"Predicted\":>10} {\"Actual\":>10} {\"P(Churn)\":>10} {\"Correct?\":>10}')\n",
    "print('-' * 42)\n",
    "for i in range(10):\n",
    "    match = 'Yes' if y_pred_lr[i] == y_test[i] else 'No'\n",
    "    print(f'{y_pred_lr[i]:>10} {y_test[i]:>10} {y_prob_lr[i][1]:>10.3f} {match:>10}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature importance via coefficients\n",
    "coef_df = pd.DataFrame({\n",
    "    'Feature': feature_cols,\n",
    "    'Coefficient': log_reg.coef_[0]\n",
    "}).sort_values('Coefficient', ascending=True)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 5))\n",
    "colors = ['#e74c3c' if c > 0 else '#2ecc71' for c in coef_df['Coefficient']]\n",
    "ax.barh(coef_df['Feature'], coef_df['Coefficient'], color=colors, edgecolor='black')\n",
    "ax.axvline(x=0, color='black', linewidth=0.8)\n",
    "ax.set_xlabel('Coefficient Value')\n",
    "ax.set_title('Logistic Regression Coefficients\\n(Red = Increases Churn Risk, Green = Decreases Churn Risk)')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print('How to read this chart:')\n",
    "print('  Positive coefficients (red) → Higher values increase churn risk')\n",
    "print('  Negative coefficients (green) → Higher values decrease churn risk')\n",
    "print('  Larger bars → Stronger effect on the prediction')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Discussion Question\n",
    "\n",
    "Look at the coefficients. Do the directions make business sense? If `Tenure_Months` has a negative coefficient, it means longer-tenured customers are *less* likely to churn. Does that seem right?\n",
    "\n",
    "**Business insight matters:** If a model produces results that contradict business logic, that's a red flag worth investigating."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating the Logistic Regression Model\n",
    "\n",
    "#### Quick Metrics Refresher\n",
    "\n",
    "| Metric | Plain English |\n",
    "|---|---|\n",
    "| **Accuracy** | \"How often is the model right?\" |\n",
    "| **Precision** | \"When we say someone will churn, how often are we right?\" |\n",
    "| **Recall** | \"Of all customers who actually left, what % did we identify?\" |\n",
    "| **F1 Score** | \"A combined measure of precision and recall\" |\n",
    "| **AUC** | \"How good is the model at ranking customers by risk?\" |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# === LOGISTIC REGRESSION EVALUATION ===\n",
    "\n",
    "print('=' * 50)\n",
    "print('LOGISTIC REGRESSION — EVALUATION RESULTS')\n",
    "print('=' * 50)\n",
    "\n",
    "acc_lr = accuracy_score(y_test, y_pred_lr)\n",
    "auc_lr = roc_auc_score(y_test, y_prob_lr[:, 1])\n",
    "logloss_lr = log_loss(y_test, y_prob_lr)\n",
    "\n",
    "print(f'\\nAccuracy:  {acc_lr:.4f} ({acc_lr:.1%})')\n",
    "print(f'AUC Score: {auc_lr:.4f}')\n",
    "print(f'Log Loss:  {logloss_lr:.4f}')\n",
    "\n",
    "print(f'\\n--- Classification Report ---')\n",
    "print(classification_report(y_test, y_pred_lr, target_names=['Stayed', 'Churned']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confusion Matrix for Logistic Regression\n",
    "fig, ax = plt.subplots(figsize=(6, 5))\n",
    "cm_lr = confusion_matrix(y_test, y_pred_lr)\n",
    "disp = ConfusionMatrixDisplay(confusion_matrix=cm_lr, display_labels=['Stayed', 'Churned'])\n",
    "disp.plot(cmap='Blues', ax=ax, values_format='d')\n",
    "ax.set_title('Logistic Regression — Confusion Matrix')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "tn, fp, fn, tp = cm_lr.ravel()\n",
    "print('How to read this matrix:')\n",
    "print(f'  True Negatives  (correctly predicted Stay):    {tn}')\n",
    "print(f'  False Positives (predicted Churn, but Stayed): {fp}')\n",
    "print(f'  False Negatives (predicted Stay, but Churned): {fn}  ← The costly mistakes!')\n",
    "print(f'  True Positives  (correctly predicted Churn):   {tp}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Business Insight: Which Errors Are More Costly?\n",
    "\n",
    "In churn prediction, **False Negatives are usually more expensive** than False Positives.\n",
    "\n",
    "- **False Negative**: We predicted the customer would stay, but they left. *We lost them because we didn't intervene.*\n",
    "- **False Positive**: We predicted churn, but they stayed. *We might give them an unnecessary discount, but we keep them.*\n",
    "\n",
    "This means **recall** (catching actual churners) is often more important than precision."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 4: Model 2 — Decision Tree\n",
    "\n",
    "### What Is a Decision Tree?\n",
    "\n",
    "A decision tree is a **flowchart of yes/no questions** that leads to a prediction. Think of it like the game \\\"20 Questions.\\\" For example:\n",
    "\n",
    "- *Is their tenure less than 12 months?* → Yes\n",
    "- *Is their income below $30K?* → Yes\n",
    "- → **Predict: Likely to churn**\n",
    "\n",
    "**Strengths:** Extremely interpretable, no normalization needed, captures non-linear patterns.\n",
    "\n",
    "**Weaknesses:** Can easily overfit, sensitive to small data changes.\n",
    "\n",
    "**Business advantage:** When leadership asks *\\\"Why did you flag this customer?\\\"*, you can give a clear, rule-based answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build the decision tree — using UNSCALED data\n",
    "dt_model = DecisionTreeClassifier(\n",
    "    max_depth=4, \n",
    "    criterion='entropy',\n",
    "    random_state=42\n",
    ")\n",
    "dt_model.fit(X_train, y_train)\n",
    "\n",
    "print('Decision Tree model trained successfully!')\n",
    "print(f'Tree depth: {dt_model.get_depth()}')\n",
    "print(f'Number of leaves: {dt_model.get_n_leaves()}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize the tree\n",
    "fig, ax = plt.subplots(figsize=(20, 10))\n",
    "plot_tree(\n",
    "    dt_model, \n",
    "    feature_names=feature_cols, \n",
    "    class_names=['Stay', 'Churn'],\n",
    "    filled=True,\n",
    "    rounded=True,\n",
    "    fontsize=10,\n",
    "    ax=ax\n",
    ")\n",
    "ax.set_title('Decision Tree for Customer Churn Prediction', fontsize=16, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print('How to read this tree:')\n",
    "print('  Each box shows a question about a feature')\n",
    "print('  Left branch = Yes (condition is true), Right branch = No')\n",
    "print('  \"samples\" = how many training customers reached that node')\n",
    "print('  \"value\" = [count of Stay, count of Churn]')\n",
    "print('  Blue = leans Stay, Orange = leans Churn')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature importance\n",
    "importance_df = pd.DataFrame({\n",
    "    'Feature': feature_cols,\n",
    "    'Importance': dt_model.feature_importances_\n",
    "}).sort_values('Importance', ascending=True)\n",
    "\n",
    "importance_df = importance_df[importance_df['Importance'] > 0]\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 5))\n",
    "ax.barh(importance_df['Feature'], importance_df['Importance'], \n",
    "        color='#e67e22', edgecolor='black')\n",
    "ax.set_xlabel('Feature Importance')\n",
    "ax.set_title('Decision Tree — Feature Importance')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Decision Tree evaluation\n",
    "y_pred_dt = dt_model.predict(X_test)\n",
    "y_prob_dt = dt_model.predict_proba(X_test)\n",
    "\n",
    "print('=' * 50)\n",
    "print('DECISION TREE — EVALUATION RESULTS')\n",
    "print('=' * 50)\n",
    "\n",
    "acc_dt = accuracy_score(y_test, y_pred_dt)\n",
    "auc_dt = roc_auc_score(y_test, y_prob_dt[:, 1])\n",
    "logloss_dt = log_loss(y_test, y_prob_dt)\n",
    "\n",
    "print(f'\\nAccuracy:  {acc_dt:.4f} ({acc_dt:.1%})')\n",
    "print(f'AUC Score: {auc_dt:.4f}')\n",
    "print(f'Log Loss:  {logloss_dt:.4f}')\n",
    "\n",
    "print(f'\\n--- Classification Report ---')\n",
    "print(classification_report(y_test, y_pred_dt, target_names=['Stayed', 'Churned']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confusion Matrix for Decision Tree\n",
    "fig, ax = plt.subplots(figsize=(6, 5))\n",
    "cm_dt = confusion_matrix(y_test, y_pred_dt)\n",
    "disp = ConfusionMatrixDisplay(confusion_matrix=cm_dt, display_labels=['Stayed', 'Churned'])\n",
    "disp.plot(cmap='Oranges', ax=ax, values_format='d')\n",
    "ax.set_title('Decision Tree — Confusion Matrix')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "tn, fp, fn, tp = cm_dt.ravel()\n",
    "print(f'  True Negatives:  {tn}')\n",
    "print(f'  False Positives: {fp}')\n",
    "print(f'  False Negatives: {fn}')\n",
    "print(f'  True Positives:  {tp}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 5: Head-to-Head Model Comparison\n",
    "\n",
    "Now for the big question: **Which model should we recommend to leadership?**\n",
    "\n",
    "The \\\"best\\\" model depends on what the business values most:\n",
    "\n",
    "- **Catch as many churners as possible** (even with false alarms) → prioritize **Recall**\n",
    "- **Be confident when predicting churn** (minimize wasted spending) → prioritize **Precision**\n",
    "- **Best overall discriminating ability** → prioritize **AUC**\n",
    "- **Explain predictions to stakeholders** → prioritize **Interpretability**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# === SIDE-BY-SIDE COMPARISON TABLE ===\n",
    "\n",
    "cr_lr = classification_report(y_test, y_pred_lr, target_names=['Stayed', 'Churned'], output_dict=True)\n",
    "cr_dt = classification_report(y_test, y_pred_dt, target_names=['Stayed', 'Churned'], output_dict=True)\n",
    "\n",
    "print('\\n' + '=' * 65)\n",
    "print('  MODEL COMPARISON: LOGISTIC REGRESSION vs. DECISION TREE')\n",
    "print('=' * 65)\n",
    "print(f'\\n{\"Metric\":<25} {\"Logistic Regression\":>20} {\"Decision Tree\":>20}')\n",
    "print('-' * 65)\n",
    "print(f'{\"Accuracy\":<25} {acc_lr:>20.4f} {acc_dt:>20.4f}')\n",
    "print(f'{\"AUC Score\":<25} {auc_lr:>20.4f} {auc_dt:>20.4f}')\n",
    "print(f'{\"Log Loss (lower=better)\":<25} {logloss_lr:>20.4f} {logloss_dt:>20.4f}')\n",
    "print(f'{\"Recall (Churn)\":<25} {cr_lr[\"Churned\"][\"recall\"]:>20.4f} {cr_dt[\"Churned\"][\"recall\"]:>20.4f}')\n",
    "print(f'{\"Precision (Churn)\":<25} {cr_lr[\"Churned\"][\"precision\"]:>20.4f} {cr_dt[\"Churned\"][\"precision\"]:>20.4f}')\n",
    "print(f'{\"F1 Score (Churn)\":<25} {cr_lr[\"Churned\"][\"f1-score\"]:>20.4f} {cr_dt[\"Churned\"][\"f1-score\"]:>20.4f}')\n",
    "print('-' * 65)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# === VISUAL COMPARISON: Key Metrics Bar Chart ===\n",
    "\n",
    "metrics = ['Accuracy', 'AUC', 'Precision\\n(Churn)', 'Recall\\n(Churn)', 'F1 Score\\n(Churn)']\n",
    "lr_scores = [acc_lr, auc_lr, cr_lr['Churned']['precision'], \n",
    "             cr_lr['Churned']['recall'], cr_lr['Churned']['f1-score']]\n",
    "dt_scores = [acc_dt, auc_dt, cr_dt['Churned']['precision'], \n",
    "             cr_dt['Churned']['recall'], cr_dt['Churned']['f1-score']]\n",
    "\n",
    "x = np.arange(len(metrics))\n",
    "width = 0.35\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(12, 6))\n",
    "bars1 = ax.bar(x - width/2, lr_scores, width, label='Logistic Regression', \n",
    "               color='#3498db', edgecolor='black', alpha=0.85)\n",
    "bars2 = ax.bar(x + width/2, dt_scores, width, label='Decision Tree', \n",
    "               color='#e67e22', edgecolor='black', alpha=0.85)\n",
    "\n",
    "for bar in bars1:\n",
    "    ax.annotate(f'{bar.get_height():.3f}', \n",
    "                xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),\n",
    "                xytext=(0, 5), textcoords='offset points', ha='center', fontsize=10)\n",
    "for bar in bars2:\n",
    "    ax.annotate(f'{bar.get_height():.3f}', \n",
    "                xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),\n",
    "                xytext=(0, 5), textcoords='offset points', ha='center', fontsize=10)\n",
    "\n",
    "ax.set_ylim(0, 1.15)\n",
    "ax.set_ylabel('Score', fontsize=12)\n",
    "ax.set_title('Model Comparison: Logistic Regression vs. Decision Tree', fontsize=14, fontweight='bold')\n",
    "ax.set_xticks(x)\n",
    "ax.set_xticklabels(metrics, fontsize=11)\n",
    "ax.legend(fontsize=12)\n",
    "ax.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# === CONFUSION MATRICES SIDE BY SIDE ===\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(13, 5))\n",
    "\n",
    "ConfusionMatrixDisplay(\n",
    "    confusion_matrix=cm_lr, display_labels=['Stayed', 'Churned']\n",
    ").plot(cmap='Blues', ax=axes[0], values_format='d')\n",
    "axes[0].set_title('Logistic Regression', fontsize=13, fontweight='bold')\n",
    "\n",
    "ConfusionMatrixDisplay(\n",
    "    confusion_matrix=cm_dt, display_labels=['Stayed', 'Churned']\n",
    ").plot(cmap='Oranges', ax=axes[1], values_format='d')\n",
    "axes[1].set_title('Decision Tree', fontsize=13, fontweight='bold')\n",
    "\n",
    "plt.suptitle('Confusion Matrix Comparison', fontsize=14, fontweight='bold', y=1.02)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# === ROC CURVE COMPARISON ===\n",
    "\n",
    "fpr_lr, tpr_lr, _ = roc_curve(y_test, y_prob_lr[:, 1])\n",
    "fpr_dt, tpr_dt, _ = roc_curve(y_test, y_prob_dt[:, 1])\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 6))\n",
    "ax.plot(fpr_lr, tpr_lr, color='#3498db', linewidth=2, \n",
    "        label=f'Logistic Regression (AUC = {auc_lr:.3f})')\n",
    "ax.plot(fpr_dt, tpr_dt, color='#e67e22', linewidth=2, \n",
    "        label=f'Decision Tree (AUC = {auc_dt:.3f})')\n",
    "ax.plot([0, 1], [0, 1], color='gray', linestyle='--', linewidth=1, \n",
    "        label='Random Guess (AUC = 0.500)')\n",
    "\n",
    "ax.set_xlabel('False Positive Rate', fontsize=12)\n",
    "ax.set_ylabel('True Positive Rate', fontsize=12)\n",
    "ax.set_title('ROC Curve Comparison', fontsize=14, fontweight='bold')\n",
    "ax.legend(fontsize=11, loc='lower right')\n",
    "ax.set_xlim([0, 1])\n",
    "ax.set_ylim([0, 1.05])\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print('How to read the ROC curve:')\n",
    "print('  The diagonal dashed line = random guessing (useless)')\n",
    "print('  A curve hugging the top-left corner = better at separating classes')\n",
    "print('  AUC closer to 1.0 = better; 0.5 = random')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 6: Making Your Recommendation\n",
    "\n",
    "### Beyond the Numbers: The Full Picture\n",
    "\n",
    "| Factor | Logistic Regression | Decision Tree |\n",
    "|---|---|---|\n",
    "| **Interpretability** | Moderate — coefficients show direction and strength | High — visualize as a flowchart |\n",
    "| **Probability estimates** | Excellent — well-calibrated | Good — can be \\\"blocky\\\" |\n",
    "| **Non-linear patterns** | Limited — assumes linear log-odds | Strong — captures interactions |\n",
    "| **Overfitting risk** | Lower (with regularization) | Higher (control tree depth) |\n",
    "| **Explaining to stakeholders** | \\\"Tenure reduces churn risk\\\" | \\\"If tenure < 12 and income < 30K, then churn\\\" |\n",
    "\n",
    "### Think Like a Consultant\n",
    "\n",
    "When presenting to leadership, frame it in **business terms**:\n",
    "\n",
    "> *\\\"Based on our analysis, we recommend [Model X] for predicting customer churn. It achieved an AUC of [score], meaning it effectively ranks customers by their likelihood to leave. It identifies [recall]% of actual churners, translating to approximately [number] at-risk customers we can now proactively target.\\\"*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# === FINAL SUMMARY WITH WINNERS ===\n",
    "\n",
    "print('\\n' + '=' * 65)\n",
    "print('  FINAL MODEL COMPARISON SUMMARY')\n",
    "print('=' * 65)\n",
    "print(f'\\n{\"Metric\":<25} {\"Logistic Regression\":>20} {\"Decision Tree\":>20}')\n",
    "print('-' * 65)\n",
    "print(f'{\"Accuracy\":<25} {acc_lr:>20.4f} {acc_dt:>20.4f}')\n",
    "print(f'{\"AUC Score\":<25} {auc_lr:>20.4f} {auc_dt:>20.4f}')\n",
    "print(f'{\"Log Loss (lower=better)\":<25} {logloss_lr:>20.4f} {logloss_dt:>20.4f}')\n",
    "print(f'{\"Recall (Churn)\":<25} {cr_lr[\"Churned\"][\"recall\"]:>20.4f} {cr_dt[\"Churned\"][\"recall\"]:>20.4f}')\n",
    "print(f'{\"Precision (Churn)\":<25} {cr_lr[\"Churned\"][\"precision\"]:>20.4f} {cr_dt[\"Churned\"][\"precision\"]:>20.4f}')\n",
    "print(f'{\"F1 Score (Churn)\":<25} {cr_lr[\"Churned\"][\"f1-score\"]:>20.4f} {cr_dt[\"Churned\"][\"f1-score\"]:>20.4f}')\n",
    "print('-' * 65)\n",
    "\n",
    "print('\\nMetric Winners:')\n",
    "print(f'  Accuracy:        {\"Logistic Regression\" if acc_lr >= acc_dt else \"Decision Tree\"}')\n",
    "print(f'  AUC:             {\"Logistic Regression\" if auc_lr >= auc_dt else \"Decision Tree\"}')\n",
    "print(f'  Log Loss:        {\"Logistic Regression\" if logloss_lr <= logloss_dt else \"Decision Tree\"}')\n",
    "print(f'  Recall (Churn):  {\"Logistic Regression\" if cr_lr[\"Churned\"][\"recall\"] >= cr_dt[\"Churned\"][\"recall\"] else \"Decision Tree\"}')\n",
    "print(f'  F1 (Churn):      {\"Logistic Regression\" if cr_lr[\"Churned\"][\"f1-score\"] >= cr_dt[\"Churned\"][\"f1-score\"] else \"Decision Tree\"}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 7: Your Turn — Practice Exercises\n",
    "\n",
    "### Exercise 1: Tune the Logistic Regression\n",
    "\n",
    "The regularization parameter `C` controls model complexity. Try different values.\n",
    "\n",
    "**Question:** Which value of C gives the best AUC? Does more complex always mean better?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Try different C values — uncomment and run\n",
    "\n",
    "# print(f'{\"C Value\":<10} {\"AUC\":>10} {\"Accuracy\":>10}')\n",
    "# print('-' * 32)\n",
    "# for c_val in [0.001, 0.01, 0.1, 1.0, 10.0]:\n",
    "#     lr_temp = LogisticRegression(C=c_val, solver='liblinear', random_state=42)\n",
    "#     lr_temp.fit(X_train_scaled, y_train)\n",
    "#     y_prob_temp = lr_temp.predict_proba(X_test_scaled)\n",
    "#     y_pred_temp = lr_temp.predict(X_test_scaled)\n",
    "#     auc_temp = roc_auc_score(y_test, y_prob_temp[:, 1])\n",
    "#     acc_temp = accuracy_score(y_test, y_pred_temp)\n",
    "#     print(f'{c_val:<10} {auc_temp:>10.4f} {acc_temp:>10.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2: Tune the Decision Tree\n",
    "\n",
    "Try different `max_depth` values. What happens with `max_depth=None` (unlimited)? What does this tell you about overfitting?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Try different max_depth values — uncomment and run\n",
    "\n",
    "# print(f'{\"Depth\":<8} {\"AUC\":>10} {\"Accuracy\":>10} {\"Leaves\":>10}')\n",
    "# print('-' * 40)\n",
    "# for depth in [2, 3, 4, 5, 6, 8, None]:\n",
    "#     dt_temp = DecisionTreeClassifier(max_depth=depth, criterion='entropy', random_state=42)\n",
    "#     dt_temp.fit(X_train, y_train)\n",
    "#     y_pred_temp = dt_temp.predict(X_test)\n",
    "#     y_prob_temp = dt_temp.predict_proba(X_test)\n",
    "#     acc_temp = accuracy_score(y_test, y_pred_temp)\n",
    "#     auc_temp = roc_auc_score(y_test, y_prob_temp[:, 1])\n",
    "#     depth_str = str(depth) if depth else 'None'\n",
    "#     print(f'{depth_str:<8} {auc_temp:>10.4f} {acc_temp:>10.4f} {dt_temp.get_n_leaves():>10}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3: Include All Features\n",
    "\n",
    "We excluded `Has_CallCard` and `Has_Wireless`. Rebuild both models with **all 9 features**.\n",
    "\n",
    "**Questions:** Does including them improve either model? Are they important in the decision tree?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Rebuild with all 9 features\n",
    "# Your code here\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 4: Write Your Business Recommendation\n",
    "\n",
    "Write a **2–3 paragraph recommendation** for the telecom company's VP of Marketing:\n",
    "\n",
    "1. Which model do you recommend and why?\n",
    "2. What are the most important churn predictors?\n",
    "3. What specific actions should the marketing team take?\n",
    "\n",
    "**Remember:** Business language, not technical jargon."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Write your recommendation here:*\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Key Takeaways\n",
    "\n",
    "**1. No single metric tells the whole story.**  \n",
    "Accuracy can be misleading with imbalanced data. Always look at precision, recall, AUC, and the confusion matrix together.\n",
    "\n",
    "**2. Different models have different strengths.**  \n",
    "Logistic regression gives smooth probabilities and clear coefficients. Decision trees give visual, rule-based explanations.\n",
    "\n",
    "**3. Model comparison is essential.**  \n",
    "Never deploy the first model you build. Always compare at least two approaches.\n",
    "\n",
    "**4. The best model is the one that gets deployed and creates value.**  \n",
    "A slightly less accurate model that leadership trusts is often more valuable than a complex \\\"black box.\\\"\n",
    "\n",
    "**5. Always connect your analysis to business decisions.**  \n",
    "Your job isn't just to build models — it's to help the business make smarter choices.\n",
    "\n",
    "---\n",
    "\n",
    "### What's Next?\n",
    "\n",
    "In future notebooks, we'll explore Random Forests, Feature Engineering, Cross-Validation, and Model Deployment.\n",
    "\n",
    "---\n",
    "*Notebook created for Predictive Analytics & Data Mining*  \n",
    "*By Dr. Benyawarath \"Yaa\" Nithithanatchinnapat, Feb 10, 2026*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}