{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Keep or Let Go? Predicting Customer Churn\n", "## Logistic Regression vs. Decision Trees — A Model Comparison Lab\n", "\n", "**Course:** GMBA 621 — Predictive Analytics & Data Mining & FINC 332 — Dat Analytics, Data Mining, & Data Visualization\n", "\n", "**Gannon University — Dahlkemper School of Business**\n", "\n", "*By Dr. Benyawarath \"Yaa\" Nithithanatchinnapat, Feb 10, 2026*\n", "\n", "---\n", "\n", "### The Business Problem\n", "\n", "You're an analyst at a telecommunications company. Leadership is worried — customers are leaving for competitors, and every lost customer costs the company money. The marketing team wants to know:\n", "\n", "> **\"Can we predict which customers are about to leave so we can intervene before it's too late?\"**\n", "\n", "This is a **churn prediction** problem, and it's one of the most common use cases for predictive analytics across industries — telecom, banking, insurance, SaaS, you name it.\n", "\n", "### Why This Matters (The Business Case)\n", "\n", "Acquiring a new customer costs **5–7x more** than retaining an existing one. If we can identify at-risk customers early, the company can:\n", "\n", "- Offer targeted promotions or discounts\n", "- Proactively reach out through customer service\n", "- Fix service issues before the customer walks away\n", "\n", "Even a modest improvement in retention can translate to millions in saved revenue.\n", "\n", "### What You'll Learn in This Notebook\n", "\n", "We'll tackle this problem using **two different classification models** and then compare them head-to-head:\n", "\n", "1. **Logistic Regression** — A statistical model that estimates the *probability* of churn\n", "2. **Decision Tree** — A rule-based model that creates an *interpretable flowchart* of decisions\n", "3. **Model Comparison** — Which model should we actually deploy? We'll use multiple metrics to decide\n", "\n", "By the end, you'll be able to build both models, evaluate them properly, and make a recommendation to leadership about which one to use." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 0: Setting Up Our Environment\n", "\n", "Before we dive in, let's load the Python libraries we'll need. Think of these as tools in a toolbox — each one serves a purpose:\n", "\n", "| Library | What It Does |\n", "|---|---|\n", "| `pandas` | Works with data tables (like Excel for Python) |\n", "| `numpy` | Handles math and numerical operations |\n", "| `matplotlib` / `seaborn` | Creates charts and visualizations |\n", "| `scikit-learn` | Our machine learning toolkit — models, metrics, everything |\n", "| `openpyxl` | Lets Python read Excel files (.xlsx) |" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Core data libraries\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# Visualization\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline\n", "\n", "# Modeling\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier, plot_tree\n", "\n", "# Preprocessing & splitting\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "# Evaluation metrics\n", "from sklearn.metrics import (\n", " accuracy_score, \n", " confusion_matrix, \n", " classification_report, \n", " roc_auc_score, \n", " roc_curve, \n", " log_loss,\n", " ConfusionMatrixDisplay\n", ")\n", "\n", "# Suppress warnings for cleaner output\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Set visual style\n", "sns.set_style('whitegrid')\n", "plt.rcParams['figure.figsize'] = (8, 5)\n", "\n", "print('All libraries loaded successfully!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 1: Understanding the Data\n", "\n", "### Loading the Dataset from Excel\n", "\n", "Our dataset is stored in an Excel file called **`TelcoChurn_Dataset.xlsx`**. This is what you'll typically encounter in the workplace — data comes in spreadsheets, CSVs, databases, not from some pre-cleaned Python library.\n", "\n", "The Excel file has two sheets:\n", "- **Customer_Data** — the actual data (200 customer records)\n", "- **Data_Dictionary** — descriptions of what each column means\n", "\n", "We use `pandas.read_excel()` to load it. Notice we specify:\n", "- `sheet_name` — which sheet to read\n", "- `skiprows` — because our Excel file has a title row at the top that isn't data\n", "\n", "**Make sure the Excel file is in the same folder as this notebook!**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the dataset from Excel\n", "# The data starts on row 4 (after the title and subtitle), so we skip 3 rows\n", "churn_df = pd.read_excel(\n", " 'TelcoChurn_Dataset.xlsx', \n", " sheet_name='Customer_Data',\n", " skiprows=3\n", ")\n", "\n", "# First look at the data\n", "print(f'Dataset shape: {churn_df.shape[0]} customers, {churn_df.shape[1]} columns\\n')\n", "churn_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's also peek at the Data Dictionary sheet so we know what we're working with\n", "data_dict = pd.read_excel(\n", " 'TelcoChurn_Dataset.xlsx', \n", " sheet_name='Data_Dictionary',\n", " skiprows=2\n", ")\n", "\n", "print('=== DATA DICTIONARY ===')\n", "print('(This tells us what each column means)\\n')\n", "for _, row in data_dict.iterrows():\n", " print(f\" {row['Column Name']:<25} → {row['Description']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quick Data Quality Check\n", "\n", "In the real world, the first thing you do after loading data is check for problems — missing values, wrong data types, unexpected entries. Let's do a quick health check." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Data quality check\n", "print('=== DATA QUALITY CHECK ===\\n')\n", "print(f'Total records: {churn_df.shape[0]}')\n", "print(f'Total columns: {churn_df.shape[1]}')\n", "print(f'\\nMissing values per column:')\n", "missing = churn_df.isnull().sum()\n", "if missing.sum() == 0:\n", " print(' None found — great, clean data!')\n", "else:\n", " print(missing[missing > 0])\n", "\n", "print(f'\\nData types:')\n", "print(churn_df.dtypes.to_string())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Our Features\n", "\n", "The `CustomerID` column is just an identifier — it has no predictive power. (If our model used customer IDs to predict churn, that would be a red flag!) Let's drop it and make sure our target variable is properly formatted." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Drop the ID column — it's not a feature, just a label\n", "churn_df = churn_df.drop(columns=['CustomerID'])\n", "\n", "# Make sure target is integer\n", "churn_df['Churned'] = churn_df['Churned'].astype(int)\n", "\n", "print(f'Working with {churn_df.shape[1]} columns and {churn_df.shape[0]} rows')\n", "print(f'\\nColumns: {list(churn_df.columns)}')\n", "churn_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploratory Data Analysis (EDA)\n", "\n", "Before jumping to modeling, let's explore the data. A good analyst always looks at their data first. Here's why:\n", "\n", "- **Class balance**: Is churn roughly 50/50, or are most customers staying? This matters for how we evaluate our models.\n", "- **Distributions**: Are there outliers or unusual patterns?\n", "- **Relationships**: Do any features seem obviously connected to churn?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# How balanced is our target variable?\n", "churn_counts = churn_df['Churned'].value_counts()\n", "churn_pct = churn_df['Churned'].value_counts(normalize=True) * 100\n", "\n", "print('=== Target Variable Distribution ===')\n", "print(f'Stayed (0): {churn_counts[0]} customers ({churn_pct[0]:.1f}%)')\n", "print(f'Churned (1): {churn_counts[1]} customers ({churn_pct[1]:.1f}%)')\n", "print(f'\\nChurn rate: {churn_pct[1]:.1f}%')\n", "\n", "# Visualize\n", "fig, ax = plt.subplots(figsize=(6, 4))\n", "colors = ['#2ecc71', '#e74c3c']\n", "churn_df['Churned'].value_counts().plot(kind='bar', color=colors, edgecolor='black', ax=ax)\n", "ax.set_xticklabels(['Stayed (0)', 'Churned (1)'], rotation=0)\n", "ax.set_ylabel('Number of Customers')\n", "ax.set_title('Customer Churn Distribution')\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Thinking Point: Class Imbalance\n", "\n", "Look at the distribution above. Is churn balanced or imbalanced?\n", "\n", "**Why this matters:** If ~73% of customers stay, a model that *always* predicts \\\"stay\\\" would be 73% accurate — without learning anything useful! This is why **accuracy alone is a misleading metric** for imbalanced datasets. We'll need better metrics like precision, recall, and AUC." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Summary statistics — look for anything unusual\n", "churn_df.describe().round(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# How do churners differ from non-churners? Let's compare averages.\n", "comparison = churn_df.groupby('Churned').mean(numeric_only=True).round(2)\n", "print('=== Average Feature Values by Churn Status ===')\n", "print('(Row 0 = Stayed, Row 1 = Churned)\\n')\n", "comparison" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### What Patterns Do You Notice?\n", "\n", "Look at the comparison table above. Ask yourself:\n", "\n", "- Do churners have longer or shorter tenure?\n", "- Is income higher or lower for churners?\n", "- Does employment length seem to matter?\n", "\n", "**Write down your observations.** We'll see if the models confirm your intuition." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize feature distributions by churn status\n", "features_to_plot = ['Tenure_Months', 'Age', 'Household_Income_K', 'Years_Employed']\n", "\n", "fig, axes = plt.subplots(2, 2, figsize=(12, 8))\n", "axes = axes.flatten()\n", "\n", "for i, feature in enumerate(features_to_plot):\n", " ax = axes[i]\n", " churn_df[churn_df['Churned'] == 0][feature].hist(alpha=0.6, color='#2ecc71', \n", " label='Stayed', bins=20, ax=ax)\n", " churn_df[churn_df['Churned'] == 1][feature].hist(alpha=0.6, color='#e74c3c', \n", " label='Churned', bins=20, ax=ax)\n", " ax.set_title(f'{feature} by Churn Status')\n", " ax.legend()\n", " ax.set_xlabel(feature)\n", " ax.set_ylabel('Count')\n", "\n", "plt.suptitle('Feature Distributions: Churners vs. Non-Churners', fontsize=14, fontweight='bold')\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Correlation heatmap — which features are related to each other and to churn?\n", "fig, ax = plt.subplots(figsize=(10, 7))\n", "corr_matrix = churn_df.corr(numeric_only=True).round(2)\n", "sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, fmt='.2f',\n", " linewidths=0.5, ax=ax, vmin=-1, vmax=1)\n", "ax.set_title('Correlation Heatmap\\n(Look at the bottom row — correlations with Churned)', \n", " fontsize=13, fontweight='bold')\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print('Tip: Focus on the last row/column (Churned).')\n", "print('Negative correlations mean: as the feature goes up, churn goes down.')\n", "print('Positive correlations mean: as the feature goes up, churn goes up.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 2: Preparing Data for Modeling\n", "\n", "Before we can train any model, we need to do three things:\n", "\n", "1. **Separate features (X) from the target (y)** — The model needs to know what it's predicting vs. what it's using as inputs.\n", "2. **Split into training and test sets** — We train on one portion and test on data the model has never seen.\n", "3. **Normalize the features** — Some algorithms (especially logistic regression) work better when all features are on the same scale.\n", "\n", "### Why Do We Split the Data?\n", "\n", "Imagine studying for an exam by memorizing the answer key. You'd ace *that* exam, but would you actually understand the material? Probably not. The same thing happens with models — if we test on the same data we trained on, we're just measuring memorization, not understanding. This is called **overfitting**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Step 1: Separate features (X) from target (y)\n", "feature_cols = ['Tenure_Months', 'Age', 'Years_at_Address', 'Household_Income_K', \n", " 'Education_Level', 'Years_Employed', 'Num_Equipment']\n", "\n", "X = churn_df[feature_cols].values\n", "y = churn_df['Churned'].values\n", "\n", "print(f'Features (X) shape: {X.shape} — {X.shape[0]} customers, {X.shape[1]} features')\n", "print(f'Target (y) shape: {y.shape} — {y.shape[0]} labels')\n", "print(f'\\nFeatures used: {feature_cols}')\n", "print(f'\\nNote: We excluded Has_CallCard and Has_Wireless to keep the feature set')\n", "print(f'focused on customer demographics and behavior.')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Step 2: Split into training (80%) and test (20%) sets\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2, random_state=42\n", ")\n", "\n", "print(f'Training set: {X_train.shape[0]} customers')\n", "print(f'Test set: {X_test.shape[0]} customers')\n", "print(f'\\nChurn rate in training: {y_train.mean():.1%}')\n", "print(f'Churn rate in test: {y_test.mean():.1%}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Step 3: Normalize features (important for logistic regression!)\n", "# We fit the scaler on training data only, then apply it to both sets.\n", "# Why? To prevent \"data leakage\" — the test set should be truly unseen.\n", "\n", "scaler = StandardScaler()\n", "X_train_scaled = scaler.fit_transform(X_train) # Fit AND transform on training\n", "X_test_scaled = scaler.transform(X_test) # Only transform on test (no fitting!)\n", "\n", "print('Before scaling (first training row):')\n", "print([f'{v:.1f}' for v in X_train[0]])\n", "print('\\nAfter scaling (first training row):')\n", "print([f'{v:.2f}' for v in X_train_scaled[0]])\n", "print('\\n→ Values are now centered around 0 with similar ranges')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A Note on Normalization\n", "\n", "**Why do we normalize?** Features like `Household_Income_K` (range: thousands) and `Education_Level` (range: 1–5) are on very different scales. Without normalization, a model might think income is more important simply because its numbers are bigger — not because it actually matters more.\n", "\n", "**Important:** Logistic regression is sensitive to feature scales, so normalization really helps. Decision trees, on the other hand, are **not** sensitive to scale — they split on individual feature values regardless of range. We'll use normalized data for logistic regression and raw data for the decision tree." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 3: Model 1 — Logistic Regression\n", "\n", "### What Is Logistic Regression?\n", "\n", "Despite the name, logistic regression is a **classification** algorithm, not a regression one. It answers the question:\n", "\n", "> **\"What's the probability that this customer will churn?\"**\n", "\n", "It works by fitting a curve (the \\\"sigmoid\\\" or \\\"logistic\\\" function) that squeezes any value into a range between 0 and 1 — perfect for probabilities.\n", "\n", "**How it makes decisions:**\n", "- If P(churn) >= 0.5 → Predict **churn** (1)\n", "- If P(churn) < 0.5 → Predict **stay** (0)\n", "\n", "**Business advantage:** You can rank customers by churn probability and focus retention efforts on the highest-risk ones." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build the logistic regression model\n", "log_reg = LogisticRegression(C=0.01, solver='liblinear', random_state=42)\n", "log_reg.fit(X_train_scaled, y_train)\n", "\n", "print('Logistic Regression model trained successfully!')\n", "print(f'Number of features used: {X_train_scaled.shape[1]}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making Predictions\n", "\n", "Now let's see what our model predicts. We'll get two things:\n", "\n", "1. **Class predictions** (`predict`): A hard yes/no label for each customer\n", "2. **Probability estimates** (`predict_proba`): The model's confidence level" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get predictions\n", "y_pred_lr = log_reg.predict(X_test_scaled)\n", "y_prob_lr = log_reg.predict_proba(X_test_scaled)\n", "\n", "print('First 10 predictions vs. actual:')\n", "print(f'{\"Predicted\":>10} {\"Actual\":>10} {\"P(Churn)\":>10} {\"Correct?\":>10}')\n", "print('-' * 42)\n", "for i in range(10):\n", " match = 'Yes' if y_pred_lr[i] == y_test[i] else 'No'\n", " print(f'{y_pred_lr[i]:>10} {y_test[i]:>10} {y_prob_lr[i][1]:>10.3f} {match:>10}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Feature importance via coefficients\n", "coef_df = pd.DataFrame({\n", " 'Feature': feature_cols,\n", " 'Coefficient': log_reg.coef_[0]\n", "}).sort_values('Coefficient', ascending=True)\n", "\n", "fig, ax = plt.subplots(figsize=(8, 5))\n", "colors = ['#e74c3c' if c > 0 else '#2ecc71' for c in coef_df['Coefficient']]\n", "ax.barh(coef_df['Feature'], coef_df['Coefficient'], color=colors, edgecolor='black')\n", "ax.axvline(x=0, color='black', linewidth=0.8)\n", "ax.set_xlabel('Coefficient Value')\n", "ax.set_title('Logistic Regression Coefficients\\n(Red = Increases Churn Risk, Green = Decreases Churn Risk)')\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print('How to read this chart:')\n", "print(' Positive coefficients (red) → Higher values increase churn risk')\n", "print(' Negative coefficients (green) → Higher values decrease churn risk')\n", "print(' Larger bars → Stronger effect on the prediction')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Discussion Question\n", "\n", "Look at the coefficients. Do the directions make business sense? If `Tenure_Months` has a negative coefficient, it means longer-tenured customers are *less* likely to churn. Does that seem right?\n", "\n", "**Business insight matters:** If a model produces results that contradict business logic, that's a red flag worth investigating." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluating the Logistic Regression Model\n", "\n", "#### Quick Metrics Refresher\n", "\n", "| Metric | Plain English |\n", "|---|---|\n", "| **Accuracy** | \"How often is the model right?\" |\n", "| **Precision** | \"When we say someone will churn, how often are we right?\" |\n", "| **Recall** | \"Of all customers who actually left, what % did we identify?\" |\n", "| **F1 Score** | \"A combined measure of precision and recall\" |\n", "| **AUC** | \"How good is the model at ranking customers by risk?\" |" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# === LOGISTIC REGRESSION EVALUATION ===\n", "\n", "print('=' * 50)\n", "print('LOGISTIC REGRESSION — EVALUATION RESULTS')\n", "print('=' * 50)\n", "\n", "acc_lr = accuracy_score(y_test, y_pred_lr)\n", "auc_lr = roc_auc_score(y_test, y_prob_lr[:, 1])\n", "logloss_lr = log_loss(y_test, y_prob_lr)\n", "\n", "print(f'\\nAccuracy: {acc_lr:.4f} ({acc_lr:.1%})')\n", "print(f'AUC Score: {auc_lr:.4f}')\n", "print(f'Log Loss: {logloss_lr:.4f}')\n", "\n", "print(f'\\n--- Classification Report ---')\n", "print(classification_report(y_test, y_pred_lr, target_names=['Stayed', 'Churned']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Confusion Matrix for Logistic Regression\n", "fig, ax = plt.subplots(figsize=(6, 5))\n", "cm_lr = confusion_matrix(y_test, y_pred_lr)\n", "disp = ConfusionMatrixDisplay(confusion_matrix=cm_lr, display_labels=['Stayed', 'Churned'])\n", "disp.plot(cmap='Blues', ax=ax, values_format='d')\n", "ax.set_title('Logistic Regression — Confusion Matrix')\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "tn, fp, fn, tp = cm_lr.ravel()\n", "print('How to read this matrix:')\n", "print(f' True Negatives (correctly predicted Stay): {tn}')\n", "print(f' False Positives (predicted Churn, but Stayed): {fp}')\n", "print(f' False Negatives (predicted Stay, but Churned): {fn} ← The costly mistakes!')\n", "print(f' True Positives (correctly predicted Churn): {tp}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Business Insight: Which Errors Are More Costly?\n", "\n", "In churn prediction, **False Negatives are usually more expensive** than False Positives.\n", "\n", "- **False Negative**: We predicted the customer would stay, but they left. *We lost them because we didn't intervene.*\n", "- **False Positive**: We predicted churn, but they stayed. *We might give them an unnecessary discount, but we keep them.*\n", "\n", "This means **recall** (catching actual churners) is often more important than precision." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 4: Model 2 — Decision Tree\n", "\n", "### What Is a Decision Tree?\n", "\n", "A decision tree is a **flowchart of yes/no questions** that leads to a prediction. Think of it like the game \\\"20 Questions.\\\" For example:\n", "\n", "- *Is their tenure less than 12 months?* → Yes\n", "- *Is their income below $30K?* → Yes\n", "- → **Predict: Likely to churn**\n", "\n", "**Strengths:** Extremely interpretable, no normalization needed, captures non-linear patterns.\n", "\n", "**Weaknesses:** Can easily overfit, sensitive to small data changes.\n", "\n", "**Business advantage:** When leadership asks *\\\"Why did you flag this customer?\\\"*, you can give a clear, rule-based answer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build the decision tree — using UNSCALED data\n", "dt_model = DecisionTreeClassifier(\n", " max_depth=4, \n", " criterion='entropy',\n", " random_state=42\n", ")\n", "dt_model.fit(X_train, y_train)\n", "\n", "print('Decision Tree model trained successfully!')\n", "print(f'Tree depth: {dt_model.get_depth()}')\n", "print(f'Number of leaves: {dt_model.get_n_leaves()}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize the tree\n", "fig, ax = plt.subplots(figsize=(20, 10))\n", "plot_tree(\n", " dt_model, \n", " feature_names=feature_cols, \n", " class_names=['Stay', 'Churn'],\n", " filled=True,\n", " rounded=True,\n", " fontsize=10,\n", " ax=ax\n", ")\n", "ax.set_title('Decision Tree for Customer Churn Prediction', fontsize=16, fontweight='bold')\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print('How to read this tree:')\n", "print(' Each box shows a question about a feature')\n", "print(' Left branch = Yes (condition is true), Right branch = No')\n", "print(' \"samples\" = how many training customers reached that node')\n", "print(' \"value\" = [count of Stay, count of Churn]')\n", "print(' Blue = leans Stay, Orange = leans Churn')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Feature importance\n", "importance_df = pd.DataFrame({\n", " 'Feature': feature_cols,\n", " 'Importance': dt_model.feature_importances_\n", "}).sort_values('Importance', ascending=True)\n", "\n", "importance_df = importance_df[importance_df['Importance'] > 0]\n", "\n", "fig, ax = plt.subplots(figsize=(8, 5))\n", "ax.barh(importance_df['Feature'], importance_df['Importance'], \n", " color='#e67e22', edgecolor='black')\n", "ax.set_xlabel('Feature Importance')\n", "ax.set_title('Decision Tree — Feature Importance')\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Decision Tree evaluation\n", "y_pred_dt = dt_model.predict(X_test)\n", "y_prob_dt = dt_model.predict_proba(X_test)\n", "\n", "print('=' * 50)\n", "print('DECISION TREE — EVALUATION RESULTS')\n", "print('=' * 50)\n", "\n", "acc_dt = accuracy_score(y_test, y_pred_dt)\n", "auc_dt = roc_auc_score(y_test, y_prob_dt[:, 1])\n", "logloss_dt = log_loss(y_test, y_prob_dt)\n", "\n", "print(f'\\nAccuracy: {acc_dt:.4f} ({acc_dt:.1%})')\n", "print(f'AUC Score: {auc_dt:.4f}')\n", "print(f'Log Loss: {logloss_dt:.4f}')\n", "\n", "print(f'\\n--- Classification Report ---')\n", "print(classification_report(y_test, y_pred_dt, target_names=['Stayed', 'Churned']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Confusion Matrix for Decision Tree\n", "fig, ax = plt.subplots(figsize=(6, 5))\n", "cm_dt = confusion_matrix(y_test, y_pred_dt)\n", "disp = ConfusionMatrixDisplay(confusion_matrix=cm_dt, display_labels=['Stayed', 'Churned'])\n", "disp.plot(cmap='Oranges', ax=ax, values_format='d')\n", "ax.set_title('Decision Tree — Confusion Matrix')\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "tn, fp, fn, tp = cm_dt.ravel()\n", "print(f' True Negatives: {tn}')\n", "print(f' False Positives: {fp}')\n", "print(f' False Negatives: {fn}')\n", "print(f' True Positives: {tp}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 5: Head-to-Head Model Comparison\n", "\n", "Now for the big question: **Which model should we recommend to leadership?**\n", "\n", "The \\\"best\\\" model depends on what the business values most:\n", "\n", "- **Catch as many churners as possible** (even with false alarms) → prioritize **Recall**\n", "- **Be confident when predicting churn** (minimize wasted spending) → prioritize **Precision**\n", "- **Best overall discriminating ability** → prioritize **AUC**\n", "- **Explain predictions to stakeholders** → prioritize **Interpretability**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# === SIDE-BY-SIDE COMPARISON TABLE ===\n", "\n", "cr_lr = classification_report(y_test, y_pred_lr, target_names=['Stayed', 'Churned'], output_dict=True)\n", "cr_dt = classification_report(y_test, y_pred_dt, target_names=['Stayed', 'Churned'], output_dict=True)\n", "\n", "print('\\n' + '=' * 65)\n", "print(' MODEL COMPARISON: LOGISTIC REGRESSION vs. DECISION TREE')\n", "print('=' * 65)\n", "print(f'\\n{\"Metric\":<25} {\"Logistic Regression\":>20} {\"Decision Tree\":>20}')\n", "print('-' * 65)\n", "print(f'{\"Accuracy\":<25} {acc_lr:>20.4f} {acc_dt:>20.4f}')\n", "print(f'{\"AUC Score\":<25} {auc_lr:>20.4f} {auc_dt:>20.4f}')\n", "print(f'{\"Log Loss (lower=better)\":<25} {logloss_lr:>20.4f} {logloss_dt:>20.4f}')\n", "print(f'{\"Recall (Churn)\":<25} {cr_lr[\"Churned\"][\"recall\"]:>20.4f} {cr_dt[\"Churned\"][\"recall\"]:>20.4f}')\n", "print(f'{\"Precision (Churn)\":<25} {cr_lr[\"Churned\"][\"precision\"]:>20.4f} {cr_dt[\"Churned\"][\"precision\"]:>20.4f}')\n", "print(f'{\"F1 Score (Churn)\":<25} {cr_lr[\"Churned\"][\"f1-score\"]:>20.4f} {cr_dt[\"Churned\"][\"f1-score\"]:>20.4f}')\n", "print('-' * 65)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# === VISUAL COMPARISON: Key Metrics Bar Chart ===\n", "\n", "metrics = ['Accuracy', 'AUC', 'Precision\\n(Churn)', 'Recall\\n(Churn)', 'F1 Score\\n(Churn)']\n", "lr_scores = [acc_lr, auc_lr, cr_lr['Churned']['precision'], \n", " cr_lr['Churned']['recall'], cr_lr['Churned']['f1-score']]\n", "dt_scores = [acc_dt, auc_dt, cr_dt['Churned']['precision'], \n", " cr_dt['Churned']['recall'], cr_dt['Churned']['f1-score']]\n", "\n", "x = np.arange(len(metrics))\n", "width = 0.35\n", "\n", "fig, ax = plt.subplots(figsize=(12, 6))\n", "bars1 = ax.bar(x - width/2, lr_scores, width, label='Logistic Regression', \n", " color='#3498db', edgecolor='black', alpha=0.85)\n", "bars2 = ax.bar(x + width/2, dt_scores, width, label='Decision Tree', \n", " color='#e67e22', edgecolor='black', alpha=0.85)\n", "\n", "for bar in bars1:\n", " ax.annotate(f'{bar.get_height():.3f}', \n", " xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),\n", " xytext=(0, 5), textcoords='offset points', ha='center', fontsize=10)\n", "for bar in bars2:\n", " ax.annotate(f'{bar.get_height():.3f}', \n", " xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),\n", " xytext=(0, 5), textcoords='offset points', ha='center', fontsize=10)\n", "\n", "ax.set_ylim(0, 1.15)\n", "ax.set_ylabel('Score', fontsize=12)\n", "ax.set_title('Model Comparison: Logistic Regression vs. Decision Tree', fontsize=14, fontweight='bold')\n", "ax.set_xticks(x)\n", "ax.set_xticklabels(metrics, fontsize=11)\n", "ax.legend(fontsize=12)\n", "ax.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# === CONFUSION MATRICES SIDE BY SIDE ===\n", "\n", "fig, axes = plt.subplots(1, 2, figsize=(13, 5))\n", "\n", "ConfusionMatrixDisplay(\n", " confusion_matrix=cm_lr, display_labels=['Stayed', 'Churned']\n", ").plot(cmap='Blues', ax=axes[0], values_format='d')\n", "axes[0].set_title('Logistic Regression', fontsize=13, fontweight='bold')\n", "\n", "ConfusionMatrixDisplay(\n", " confusion_matrix=cm_dt, display_labels=['Stayed', 'Churned']\n", ").plot(cmap='Oranges', ax=axes[1], values_format='d')\n", "axes[1].set_title('Decision Tree', fontsize=13, fontweight='bold')\n", "\n", "plt.suptitle('Confusion Matrix Comparison', fontsize=14, fontweight='bold', y=1.02)\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# === ROC CURVE COMPARISON ===\n", "\n", "fpr_lr, tpr_lr, _ = roc_curve(y_test, y_prob_lr[:, 1])\n", "fpr_dt, tpr_dt, _ = roc_curve(y_test, y_prob_dt[:, 1])\n", "\n", "fig, ax = plt.subplots(figsize=(8, 6))\n", "ax.plot(fpr_lr, tpr_lr, color='#3498db', linewidth=2, \n", " label=f'Logistic Regression (AUC = {auc_lr:.3f})')\n", "ax.plot(fpr_dt, tpr_dt, color='#e67e22', linewidth=2, \n", " label=f'Decision Tree (AUC = {auc_dt:.3f})')\n", "ax.plot([0, 1], [0, 1], color='gray', linestyle='--', linewidth=1, \n", " label='Random Guess (AUC = 0.500)')\n", "\n", "ax.set_xlabel('False Positive Rate', fontsize=12)\n", "ax.set_ylabel('True Positive Rate', fontsize=12)\n", "ax.set_title('ROC Curve Comparison', fontsize=14, fontweight='bold')\n", "ax.legend(fontsize=11, loc='lower right')\n", "ax.set_xlim([0, 1])\n", "ax.set_ylim([0, 1.05])\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print('How to read the ROC curve:')\n", "print(' The diagonal dashed line = random guessing (useless)')\n", "print(' A curve hugging the top-left corner = better at separating classes')\n", "print(' AUC closer to 1.0 = better; 0.5 = random')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 6: Making Your Recommendation\n", "\n", "### Beyond the Numbers: The Full Picture\n", "\n", "| Factor | Logistic Regression | Decision Tree |\n", "|---|---|---|\n", "| **Interpretability** | Moderate — coefficients show direction and strength | High — visualize as a flowchart |\n", "| **Probability estimates** | Excellent — well-calibrated | Good — can be \\\"blocky\\\" |\n", "| **Non-linear patterns** | Limited — assumes linear log-odds | Strong — captures interactions |\n", "| **Overfitting risk** | Lower (with regularization) | Higher (control tree depth) |\n", "| **Explaining to stakeholders** | \\\"Tenure reduces churn risk\\\" | \\\"If tenure < 12 and income < 30K, then churn\\\" |\n", "\n", "### Think Like a Consultant\n", "\n", "When presenting to leadership, frame it in **business terms**:\n", "\n", "> *\\\"Based on our analysis, we recommend [Model X] for predicting customer churn. It achieved an AUC of [score], meaning it effectively ranks customers by their likelihood to leave. It identifies [recall]% of actual churners, translating to approximately [number] at-risk customers we can now proactively target.\\\"*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# === FINAL SUMMARY WITH WINNERS ===\n", "\n", "print('\\n' + '=' * 65)\n", "print(' FINAL MODEL COMPARISON SUMMARY')\n", "print('=' * 65)\n", "print(f'\\n{\"Metric\":<25} {\"Logistic Regression\":>20} {\"Decision Tree\":>20}')\n", "print('-' * 65)\n", "print(f'{\"Accuracy\":<25} {acc_lr:>20.4f} {acc_dt:>20.4f}')\n", "print(f'{\"AUC Score\":<25} {auc_lr:>20.4f} {auc_dt:>20.4f}')\n", "print(f'{\"Log Loss (lower=better)\":<25} {logloss_lr:>20.4f} {logloss_dt:>20.4f}')\n", "print(f'{\"Recall (Churn)\":<25} {cr_lr[\"Churned\"][\"recall\"]:>20.4f} {cr_dt[\"Churned\"][\"recall\"]:>20.4f}')\n", "print(f'{\"Precision (Churn)\":<25} {cr_lr[\"Churned\"][\"precision\"]:>20.4f} {cr_dt[\"Churned\"][\"precision\"]:>20.4f}')\n", "print(f'{\"F1 Score (Churn)\":<25} {cr_lr[\"Churned\"][\"f1-score\"]:>20.4f} {cr_dt[\"Churned\"][\"f1-score\"]:>20.4f}')\n", "print('-' * 65)\n", "\n", "print('\\nMetric Winners:')\n", "print(f' Accuracy: {\"Logistic Regression\" if acc_lr >= acc_dt else \"Decision Tree\"}')\n", "print(f' AUC: {\"Logistic Regression\" if auc_lr >= auc_dt else \"Decision Tree\"}')\n", "print(f' Log Loss: {\"Logistic Regression\" if logloss_lr <= logloss_dt else \"Decision Tree\"}')\n", "print(f' Recall (Churn): {\"Logistic Regression\" if cr_lr[\"Churned\"][\"recall\"] >= cr_dt[\"Churned\"][\"recall\"] else \"Decision Tree\"}')\n", "print(f' F1 (Churn): {\"Logistic Regression\" if cr_lr[\"Churned\"][\"f1-score\"] >= cr_dt[\"Churned\"][\"f1-score\"] else \"Decision Tree\"}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 7: Your Turn — Practice Exercises\n", "\n", "### Exercise 1: Tune the Logistic Regression\n", "\n", "The regularization parameter `C` controls model complexity. Try different values.\n", "\n", "**Question:** Which value of C gives the best AUC? Does more complex always mean better?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Exercise 1: Try different C values — uncomment and run\n", "\n", "# print(f'{\"C Value\":<10} {\"AUC\":>10} {\"Accuracy\":>10}')\n", "# print('-' * 32)\n", "# for c_val in [0.001, 0.01, 0.1, 1.0, 10.0]:\n", "# lr_temp = LogisticRegression(C=c_val, solver='liblinear', random_state=42)\n", "# lr_temp.fit(X_train_scaled, y_train)\n", "# y_prob_temp = lr_temp.predict_proba(X_test_scaled)\n", "# y_pred_temp = lr_temp.predict(X_test_scaled)\n", "# auc_temp = roc_auc_score(y_test, y_prob_temp[:, 1])\n", "# acc_temp = accuracy_score(y_test, y_pred_temp)\n", "# print(f'{c_val:<10} {auc_temp:>10.4f} {acc_temp:>10.4f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2: Tune the Decision Tree\n", "\n", "Try different `max_depth` values. What happens with `max_depth=None` (unlimited)? What does this tell you about overfitting?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Exercise 2: Try different max_depth values — uncomment and run\n", "\n", "# print(f'{\"Depth\":<8} {\"AUC\":>10} {\"Accuracy\":>10} {\"Leaves\":>10}')\n", "# print('-' * 40)\n", "# for depth in [2, 3, 4, 5, 6, 8, None]:\n", "# dt_temp = DecisionTreeClassifier(max_depth=depth, criterion='entropy', random_state=42)\n", "# dt_temp.fit(X_train, y_train)\n", "# y_pred_temp = dt_temp.predict(X_test)\n", "# y_prob_temp = dt_temp.predict_proba(X_test)\n", "# acc_temp = accuracy_score(y_test, y_pred_temp)\n", "# auc_temp = roc_auc_score(y_test, y_prob_temp[:, 1])\n", "# depth_str = str(depth) if depth else 'None'\n", "# print(f'{depth_str:<8} {auc_temp:>10.4f} {acc_temp:>10.4f} {dt_temp.get_n_leaves():>10}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3: Include All Features\n", "\n", "We excluded `Has_CallCard` and `Has_Wireless`. Rebuild both models with **all 9 features**.\n", "\n", "**Questions:** Does including them improve either model? Are they important in the decision tree?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Exercise 3: Rebuild with all 9 features\n", "# Your code here\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4: Write Your Business Recommendation\n", "\n", "Write a **2–3 paragraph recommendation** for the telecom company's VP of Marketing:\n", "\n", "1. Which model do you recommend and why?\n", "2. What are the most important churn predictors?\n", "3. What specific actions should the marketing team take?\n", "\n", "**Remember:** Business language, not technical jargon." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Write your recommendation here:*\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Key Takeaways\n", "\n", "**1. No single metric tells the whole story.** \n", "Accuracy can be misleading with imbalanced data. Always look at precision, recall, AUC, and the confusion matrix together.\n", "\n", "**2. Different models have different strengths.** \n", "Logistic regression gives smooth probabilities and clear coefficients. Decision trees give visual, rule-based explanations.\n", "\n", "**3. Model comparison is essential.** \n", "Never deploy the first model you build. Always compare at least two approaches.\n", "\n", "**4. The best model is the one that gets deployed and creates value.** \n", "A slightly less accurate model that leadership trusts is often more valuable than a complex \\\"black box.\\\"\n", "\n", "**5. Always connect your analysis to business decisions.** \n", "Your job isn't just to build models — it's to help the business make smarter choices.\n", "\n", "---\n", "\n", "### What's Next?\n", "\n", "In future notebooks, we'll explore Random Forests, Feature Engineering, Cross-Validation, and Model Deployment.\n", "\n", "---\n", "*Notebook created for Predictive Analytics & Data Mining* \n", "*By Dr. Benyawarath \"Yaa\" Nithithanatchinnapat, Feb 10, 2026*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }