{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e4857927",
   "metadata": {},
   "source": [
    "# 🚗 FleetPulse: Forecasting Vehicle Demand\n",
    "## Enterprise Mobility — End-to-End Demand Analytics Demo\n",
    "\n",
    "**Course:** GMBA 621 / FINC 332 — Predictive Analytics & Data Mining  \n",
    "**Framework:** BizML Lifecycle + PAIR Framework  \n",
    "**Techniques:** EDA → Feature Engineering → Regression / Time Series → Model Evaluation → Deployment\n",
    "\n",
    "---\n",
    "\n",
    "### The Business Problem\n",
    "\n",
    "> *Enterprise Mobility wants to forecast daily rental demand at the branch level, 7 days ahead, so fleet operations can proactively position vehicles where they'll be needed — reducing idle inventory costs and customer walk-aways.*\n",
    "\n",
    "### PAIR Framework\n",
    "\n",
    "| Element | Application |\n",
    "|---------|------------|\n",
    "| **Prediction** | Number of rentals per branch per day, 7 days out |\n",
    "| **Action** | Pre-position vehicles via overnight transfers; adjust pricing to shift demand |\n",
    "| **Impact** | Reduced idle fleet cost ($35–45/day/vehicle), fewer walk-aways ($85+ lost), cheaper proactive transfers |\n",
    "| **Risk** | Over-forecasting wastes transport budget; under-forecasting loses customers; one-time events missed |\n",
    "\n",
    "### How This Differs from ShieldScore\n",
    "\n",
    "| | ShieldScore | FleetPulse |\n",
    "|---|---|---|\n",
    "| **Target** | Binary (claim yes/no) | Continuous (rental count) |\n",
    "| **Metrics** | AUC, sensitivity, specificity | MAE, RMSE, MAPE, R² |\n",
    "| **Split** | Random stratified | Temporal (train before test) |\n",
    "| **Key lesson** | Threshold selection | Forecast horizon & seasonality |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e402b67e",
   "metadata": {},
   "source": [
    "## Phase 1: Setup & Data Loading"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc076df1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "plt.rcParams['figure.figsize'] = (12, 5)\n",
    "plt.rcParams['font.family'] = 'sans-serif'\n",
    "plt.rcParams['axes.spines.top'] = False\n",
    "plt.rcParams['axes.spines.right'] = False\n",
    "\n",
    "NAVY, TEAL, GOLD = '#1B2A4A', '#2A7B88', '#D4A843'\n",
    "CREAM, CORAL = '#FDF6E3', '#E07A5F'\n",
    "palette = [TEAL, NAVY, GOLD, CORAL, '#81B29A', '#3D405B']\n",
    "sns.set_palette(palette)\n",
    "\n",
    "print(\"✅ Libraries loaded\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c302e75",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('fleetpulse_enterprise_demand.csv', parse_dates=['DATE'])\n",
    "\n",
    "print(f\"Dataset: {df.shape[0]:,} rows × {df.shape[1]} columns\")\n",
    "print(f\"Branches: {df['BRANCH_ID'].nunique()}\")\n",
    "print(f\"Date range: {df['DATE'].min().date()} to {df['DATE'].max().date()}\")\n",
    "print(f\"Avg daily rentals: {df['RENTALS'].mean():.1f}\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff0133a4",
   "metadata": {},
   "source": [
    "---\n",
    "## Phase 2: Exploratory Data Analysis\n",
    "\n",
    "### 2.1 Data Quality Audit\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "938d30c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"🔍 DATA QUALITY AUDIT\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "# Missing values\n",
    "print(\"\\n1️⃣  MISSING VALUES:\")\n",
    "missing = df.isnull().sum()\n",
    "for col in missing[missing > 0].index:\n",
    "    print(f\"   {col}: {missing[col]} ({missing[col]/len(df)*100:.1f}%)\")\n",
    "\n",
    "# Missing dates (gaps in time series)\n",
    "print(\"\\n2️⃣  MISSING DATES (time series gaps):\")\n",
    "gap_count = 0\n",
    "for bid in df['BRANCH_ID'].unique():\n",
    "    branch = df[df['BRANCH_ID'] == bid].sort_values('DATE')\n",
    "    full_range = pd.date_range(branch['DATE'].min(), branch['DATE'].max())\n",
    "    missing_dates = full_range.difference(branch['DATE'])\n",
    "    if len(missing_dates) > 0:\n",
    "        gap_count += len(missing_dates)\n",
    "print(f\"   {gap_count} missing branch-days across all branches\")\n",
    "\n",
    "# Outliers\n",
    "outliers = df[df['RENTALS'] > 300]\n",
    "print(f\"\\n3️⃣  RENTAL OUTLIERS (> 300): {len(outliers)} records\")\n",
    "if len(outliers) > 0:\n",
    "    print(f\"   Values: {outliers['RENTALS'].tolist()}\")\n",
    "\n",
    "# Negative walk-aways\n",
    "neg_wa = df[df['WALK_AWAYS'] < 0]\n",
    "print(f\"\\n4️⃣  NEGATIVE WALK_AWAYS: {len(neg_wa)} records\")\n",
    "\n",
    "# Duplicates\n",
    "dups = df.duplicated(subset=['BRANCH_ID', 'DATE'], keep=False)\n",
    "print(f\"\\n5️⃣  DUPLICATE BRANCH-DAYS: {dups.sum()} rows\")\n",
    "\n",
    "# Holiday miscoding check\n",
    "jul4_weekend = df[(df['DATE'].dt.month == 7) & (df['DATE'].dt.day.isin([5,6])) & (df['DATE'].dt.year == 2025)]\n",
    "print(f\"\\n6️⃣  JULY 4 WEEKEND 2025 (IS_HOLIDAY): {jul4_weekend['IS_HOLIDAY'].unique()} ← should include 1!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49f4c11c",
   "metadata": {},
   "source": [
    "### 2.2 Demand Patterns\n",
    "The most important patterns in fleet data are **weekly cycles** and **seasonality**.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ceffa107",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── Weekly pattern by branch type ──\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Average rentals by day of week, by branch type\n",
    "day_order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']\n",
    "for btype in ['Airport', 'Suburban', 'Downtown', 'College']:\n",
    "    subset = df[df['BRANCH_TYPE'] == btype]\n",
    "    daily = subset.groupby('DAY_OF_WEEK')['RENTALS'].mean().reindex(day_order)\n",
    "    axes[0].plot(daily.index, daily.values, marker='o', linewidth=2, label=btype)\n",
    "\n",
    "axes[0].set_title('Weekly Demand Pattern by Branch Type', fontsize=13, fontweight='bold', color=NAVY)\n",
    "axes[0].set_ylabel('Avg Daily Rentals')\n",
    "axes[0].legend()\n",
    "axes[0].tick_params(axis='x', rotation=45)\n",
    "\n",
    "# Monthly seasonality\n",
    "monthly = df.groupby('MONTH')['RENTALS'].mean()\n",
    "axes[1].bar(monthly.index, monthly.values, color=TEAL, edgecolor='white')\n",
    "axes[1].set_title('Monthly Seasonality (All Branches)', fontsize=13, fontweight='bold', color=NAVY)\n",
    "axes[1].set_xlabel('Month')\n",
    "axes[1].set_ylabel('Avg Daily Rentals')\n",
    "axes[1].set_xticks(range(1, 13))\n",
    "axes[1].set_xticklabels(['J','F','M','A','M','J','J','A','S','O','N','D'])\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"💡 Key patterns:\")\n",
    "print(\"   • Airport branches peak MIDWEEK (business travelers)\")\n",
    "print(\"   • Suburban branches peak on WEEKENDS (leisure rentals)\")\n",
    "print(\"   • Summer months (Jun-Aug) show ~30-40% higher demand than winter\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "039e122c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── Time series view for two contrasting branches ──\n",
    "fig, axes = plt.subplots(2, 1, figsize=(14, 7), sharex=True)\n",
    "\n",
    "for ax, bid, label in zip(axes, ['BR-003', 'BR-004'], ['Pittsburgh Airport (Large)', 'Erie Suburban (Small)']):\n",
    "    branch = df[df['BRANCH_ID'] == bid].sort_values('DATE')\n",
    "    ax.plot(branch['DATE'], branch['RENTALS'], linewidth=0.8, color=TEAL, alpha=0.6)\n",
    "    # Rolling average overlay\n",
    "    ax.plot(branch['DATE'], branch['RENTALS'].rolling(14).mean(), linewidth=2, color=CORAL, label='14-day rolling avg')\n",
    "    ax.set_title(f'{label} — Daily Rentals Over Time', fontsize=12, fontweight='bold', color=NAVY)\n",
    "    ax.set_ylabel('Rentals')\n",
    "    ax.legend()\n",
    "\n",
    "plt.xlabel('Date')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a1349f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── Event impact ──\n",
    "fig, ax = plt.subplots(figsize=(10, 4))\n",
    "event_order = ['None', 'Small', 'Medium', 'Large']\n",
    "event_means = df.groupby('LOCAL_EVENT')['RENTALS'].mean().reindex(event_order)\n",
    "colors_evt = [TEAL, GOLD, 'orange', CORAL]\n",
    "ax.bar(event_means.index, event_means.values, color=colors_evt, edgecolor='white')\n",
    "for i, v in enumerate(event_means.values):\n",
    "    mult = v / event_means['None']\n",
    "    ax.text(i, v + 1, f'{v:.0f}\\n({mult:.1f}x)', ha='center', fontsize=10, fontweight='bold')\n",
    "ax.set_title('Impact of Local Events on Demand', fontsize=13, fontweight='bold', color=NAVY)\n",
    "ax.set_ylabel('Avg Rentals')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "print(\"💡 Large events create 2x+ demand — but they're rare and hard to predict structurally.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd6faada",
   "metadata": {},
   "source": [
    "---\n",
    "## Phase 3: Data Preparation\n",
    "\n",
    "### ⚠️ Data Leakage Warning\n",
    "\n",
    "**Same-day AVAILABLE_INVENTORY must NOT be a predictor.** When inventory is low, rentals are mechanically capped. The model would learn the cap, not the demand. Use **yesterday's inventory** instead.\n",
    "\n",
    "**WALK_AWAYS is a consequence of demand**, not a predictor. Exclude it.\n",
    "\n",
    "### ⏰ Time-Based Split Warning\n",
    "\n",
    "**You CANNOT randomly split time series data.** Random splitting lets the model \"see the future\" — training on November to predict August. In real forecasting, you only ever predict FORWARD.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25f32fa9",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = df.copy()\n",
    "\n",
    "# 1. Remove duplicates\n",
    "data = data.drop_duplicates(subset=['BRANCH_ID', 'DATE'], keep='first')\n",
    "\n",
    "# 2. Remove rental outliers (>300)\n",
    "data = data[data['RENTALS'] <= 300]\n",
    "\n",
    "# 3. Fix negative walk-aways\n",
    "data.loc[data['WALK_AWAYS'] < 0, 'WALK_AWAYS'] = 0\n",
    "\n",
    "# 4. Fix holiday miscoding\n",
    "data.loc[(data['DATE'].dt.month == 7) & (data['DATE'].dt.day.isin([3,4,5,6])) & \n",
    "         (data['DATE'].dt.year == 2025), 'IS_HOLIDAY'] = 1\n",
    "\n",
    "# 5. Impute missing COMPETITOR_RATE with branch median\n",
    "data['COMPETITOR_RATE'] = pd.to_numeric(data['COMPETITOR_RATE'], errors='coerce')\n",
    "data['COMPETITOR_RATE'] = data.groupby('BRANCH_ID')['COMPETITOR_RATE'].transform(\n",
    "    lambda x: x.fillna(x.median()))\n",
    "\n",
    "# 6. Impute missing FLIGHT_ARRIVALS (0 for non-airport, branch median for airport)\n",
    "data['FLIGHT_ARRIVALS'] = pd.to_numeric(data['FLIGHT_ARRIVALS'], errors='coerce')\n",
    "data.loc[(data['BRANCH_TYPE'] != 'Airport') & (data['FLIGHT_ARRIVALS'].isnull()), 'FLIGHT_ARRIVALS'] = 0\n",
    "data['FLIGHT_ARRIVALS'] = data.groupby('BRANCH_ID')['FLIGHT_ARRIVALS'].transform(\n",
    "    lambda x: x.fillna(x.median()))\n",
    "\n",
    "print(f\"✅ Cleaned dataset: {len(data):,} rows\")\n",
    "print(f\"   Remaining nulls: {data.isnull().sum().sum()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b1cb0de",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── Feature Engineering ──\n",
    "data = data.sort_values(['BRANCH_ID', 'DATE']).reset_index(drop=True)\n",
    "\n",
    "# Lag features (per branch)\n",
    "for lag in [1, 7]:\n",
    "    data[f'RENTALS_LAG_{lag}'] = data.groupby('BRANCH_ID')['RENTALS'].shift(lag)\n",
    "\n",
    "# Rolling averages\n",
    "data['ROLLING_AVG_7'] = data.groupby('BRANCH_ID')['RENTALS'].transform(\n",
    "    lambda x: x.shift(1).rolling(7, min_periods=1).mean())\n",
    "data['ROLLING_AVG_30'] = data.groupby('BRANCH_ID')['RENTALS'].transform(\n",
    "    lambda x: x.shift(1).rolling(30, min_periods=1).mean())\n",
    "\n",
    "# Lagged inventory (yesterday's, NOT today's)\n",
    "data['INVENTORY_LAG_1'] = data.groupby('BRANCH_ID')['AVAILABLE_INVENTORY'].shift(1)\n",
    "\n",
    "# Utilization lag\n",
    "data['UTILIZATION_LAG_1'] = data.groupby('BRANCH_ID').apply(\n",
    "    lambda g: g['RENTALS'].shift(1) / g['FLEET_CAPACITY']).reset_index(level=0, drop=True)\n",
    "\n",
    "# Days to next holiday\n",
    "holidays_2024_2025 = pd.to_datetime([\n",
    "    '2024-07-04','2024-09-02','2024-11-28','2024-12-25','2025-01-01',\n",
    "    '2025-05-26','2025-07-04','2025-09-01','2025-11-27','2025-12-25'])\n",
    "\n",
    "def days_to_holiday(date):\n",
    "    future = holidays_2024_2025[holidays_2024_2025 >= date]\n",
    "    return (future[0] - date).days if len(future) > 0 else 30\n",
    "\n",
    "data['DAYS_TO_HOLIDAY'] = data['DATE'].apply(days_to_holiday).clip(upper=30)\n",
    "\n",
    "# Event intensity (ordinal)\n",
    "event_map = {'None': 0, 'Small': 1, 'Medium': 2, 'Large': 3}\n",
    "data['EVENT_INTENSITY'] = data['LOCAL_EVENT'].map(event_map)\n",
    "\n",
    "# Day-of-week dummies\n",
    "data = pd.get_dummies(data, columns=['DAY_OF_WEEK'], drop_first=True, prefix='DOW')\n",
    "\n",
    "# Weather dummies\n",
    "data = pd.get_dummies(data, columns=['WEATHER_CONDITION'], drop_first=True, prefix='WX')\n",
    "\n",
    "# Branch type dummies\n",
    "data = pd.get_dummies(data, columns=['BRANCH_TYPE'], drop_first=True, prefix='BT')\n",
    "\n",
    "# Drop rows with NaN from lag features (first few days per branch)\n",
    "data = data.dropna(subset=['RENTALS_LAG_7', 'ROLLING_AVG_30'])\n",
    "\n",
    "print(f\"✅ Feature-engineered dataset: {len(data):,} rows × {len(data.columns)} columns\")\n",
    "new_feats = ['RENTALS_LAG_1','RENTALS_LAG_7','ROLLING_AVG_7','ROLLING_AVG_30',\n",
    "             'INVENTORY_LAG_1','UTILIZATION_LAG_1','DAYS_TO_HOLIDAY','EVENT_INTENSITY']\n",
    "for f in new_feats:\n",
    "    print(f\"   {f}: mean={data[f].mean():.1f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93cfffe7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── Temporal Train/Validation/Test Split ──\n",
    "# Train: Jul 2024 – Sep 2025 | Val: Oct 2025 | Test: Nov–Dec 2025\n",
    "\n",
    "train = data[data['DATE'] < '2025-10-01']\n",
    "val = data[(data['DATE'] >= '2025-10-01') & (data['DATE'] < '2025-11-01')]\n",
    "test = data[data['DATE'] >= '2025-11-01']\n",
    "\n",
    "exclude = ['BRANCH_ID', 'BRANCH_NAME', 'DATE', 'RENTALS',  # IDs + target\n",
    "           'AVAILABLE_INVENTORY', 'RETURNS_TODAY', 'WALK_AWAYS',  # Leakage\n",
    "           'LOCAL_EVENT', 'LOCAL_EVENT_NAME', 'REGION', 'BRANCH_SIZE',  # Already encoded\n",
    "           'DAY_OF_WEEK', 'WEATHER_CONDITION', 'BRANCH_TYPE', 'QUARTER']  # Already encoded or redundant\n",
    "\n",
    "# Some of these might not exist if already dummified; filter safely\n",
    "feature_cols = [c for c in data.columns if c not in exclude and c in train.columns]\n",
    "\n",
    "X_train, y_train = train[feature_cols], train['RENTALS']\n",
    "X_val, y_val = val[feature_cols], val['RENTALS']\n",
    "X_test, y_test = test[feature_cols], test['RENTALS']\n",
    "\n",
    "print(f\"Train:      {len(X_train):,} rows  ({train['DATE'].min().date()} to {train['DATE'].max().date()})\")\n",
    "print(f\"Validation: {len(X_val):,} rows  ({val['DATE'].min().date()} to {val['DATE'].max().date()})\")\n",
    "print(f\"Test:       {len(X_test):,} rows  ({test['DATE'].min().date()} to {test['DATE'].max().date()})\")\n",
    "print(f\"\\n💡 Temporal split: we only ever predict FORWARD in time. No peeking at the future.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0ff9fd6",
   "metadata": {},
   "source": [
    "---\n",
    "## Phase 4: Build and Compare Models\n",
    "\n",
    "### Key Difference from ShieldScore\n",
    "We're doing **regression** now — predicting a *number* (how many rentals), not a *category* (yes/no claim). So we use MAE, RMSE, and R² instead of AUC and confusion matrices.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28efc680",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── This can take a couple of minutes, wait till you see the output before running the next block ──\n",
    "\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n",
    "from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "def evaluate(name, y_true, y_pred):\n",
    "    mae = mean_absolute_error(y_true, y_pred)\n",
    "    rmse = np.sqrt(mean_squared_error(y_true, y_pred))\n",
    "    mape = np.mean(np.abs((y_true - y_pred) / y_true.clip(lower=1))) * 100\n",
    "    r2 = r2_score(y_true, y_pred)\n",
    "    return {\n",
    "        'Model': name,\n",
    "        'MAE': round(mae, 2),\n",
    "        'RMSE': round(rmse, 2),\n",
    "        'MAPE': f'{mape:.1f}%',\n",
    "        'R²': round(r2, 3)\n",
    "    }\n",
    "\n",
    "# ---- Remove rows with missing values ----\n",
    "train_mask = X_train.notna().all(axis=1)\n",
    "val_mask = X_val.notna().all(axis=1)\n",
    "\n",
    "X_train_clean = X_train.loc[train_mask]\n",
    "y_train_clean = y_train.loc[train_mask]\n",
    "\n",
    "X_val_clean = X_val.loc[val_mask]\n",
    "y_val_clean = y_val.loc[val_mask]\n",
    "\n",
    "results = []\n",
    "\n",
    "# ── Baseline: Same day last week ──\n",
    "baseline_pred = X_val_clean['RENTALS_LAG_7']\n",
    "results.append(evaluate('Naïve Baseline (lag 7)', y_val_clean, baseline_pred))\n",
    "\n",
    "# ── Model 1: Linear Regression ──\n",
    "lr = LinearRegression()\n",
    "lr.fit(X_train_clean, y_train_clean)\n",
    "lr_pred = lr.predict(X_val_clean)\n",
    "results.append(evaluate('Linear Regression', y_val_clean, lr_pred))\n",
    "\n",
    "# ── Model 2: Decision Tree ──\n",
    "dt = DecisionTreeRegressor(max_depth=8, random_state=42)\n",
    "dt.fit(X_train_clean, y_train_clean)\n",
    "dt_pred = dt.predict(X_val_clean)\n",
    "results.append(evaluate('Decision Tree (d=8)', y_val_clean, dt_pred))\n",
    "\n",
    "# ── Model 3: Random Forest ──\n",
    "rf = RandomForestRegressor(n_estimators=300, max_depth=12, random_state=42, n_jobs=-1)\n",
    "rf.fit(X_train_clean, y_train_clean)\n",
    "rf_pred = rf.predict(X_val_clean)\n",
    "results.append(evaluate('Random Forest (300)', y_val_clean, rf_pred))\n",
    "\n",
    "# ── Model 4: Gradient Boosting ──\n",
    "gb = GradientBoostingRegressor(n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42)\n",
    "gb.fit(X_train_clean, y_train_clean)\n",
    "gb_pred = gb.predict(X_val_clean)\n",
    "results.append(evaluate('Gradient Boosting', y_val_clean, gb_pred))\n",
    "\n",
    "results_df = pd.DataFrame(results)\n",
    "print(\"📊 Model Comparison (Validation Set)\")\n",
    "print(\"=\" * 70)\n",
    "print(results_df.to_string(index=False))\n",
    "print(f\"\\n🏆 Best MAE: {results_df.loc[results_df['MAE'].idxmin(), 'Model']}\")\n",
    "print(f\"\\n💡 Every ML model beats the naïve baseline — the analytics adds real value.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ed0a8ad",
   "metadata": {},
   "source": [
    "---\n",
    "## Phase 5: Evaluate the Champion\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21f2b96b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── Actual vs. Predicted scatter plot ──\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Gradient Boosting (champion)\n",
    "axes[0].scatter(y_val, gb_pred, alpha=0.4, s=20, color=TEAL, edgecolors='none')\n",
    "axes[0].plot([0, y_val.max()], [0, y_val.max()], 'k--', linewidth=1, label='Perfect forecast')\n",
    "axes[0].set_xlabel('Actual Rentals')\n",
    "axes[0].set_ylabel('Predicted Rentals')\n",
    "axes[0].set_title('Gradient Boosting: Actual vs. Predicted', fontsize=13, fontweight='bold', color=NAVY)\n",
    "axes[0].legend()\n",
    "\n",
    "# Residual distribution\n",
    "residuals = y_val.values - gb_pred\n",
    "axes[1].hist(residuals, bins=30, color=TEAL, edgecolor='white', alpha=0.8)\n",
    "axes[1].axvline(x=0, color=CORAL, linewidth=2, label=f'Mean residual: {residuals.mean():.1f}')\n",
    "axes[1].set_xlabel('Residual (Actual − Predicted)')\n",
    "axes[1].set_ylabel('Count')\n",
    "axes[1].set_title('Residual Distribution', fontsize=13, fontweight='bold', color=NAVY)\n",
    "axes[1].legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(f\"📏 Mean residual: {residuals.mean():.2f} (near zero = unbiased)\")\n",
    "print(f\"   Std of residuals: {residuals.std():.2f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "818928c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── Feature Importance ──\n",
    "importance = pd.Series(gb.feature_importances_, index=X_train.columns).sort_values(ascending=True)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 8))\n",
    "importance.tail(15).plot(kind='barh', ax=ax, color=TEAL, edgecolor='white')\n",
    "ax.set_title('Top 15 Features — Gradient Boosting', fontsize=14, fontweight='bold', color=NAVY)\n",
    "ax.set_xlabel('Feature Importance')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"\\n💡 Top demand drivers:\")\n",
    "for feat, imp in importance.tail(5).items():\n",
    "    print(f\"   {feat}: {imp:.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "983a0618",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── Performance by Branch Type ──\n",
    "val_with_info = val.loc[X_val_clean.index, ['BRANCH_ID', 'DATE', 'RENTALS']].copy()\n",
    "val_with_info['PREDICTED'] = gb_pred\n",
    "\n",
    "if 'BRANCH_TYPE' in val.columns:\n",
    "    val_with_info['BRANCH_TYPE'] = val.loc[X_val_clean.index, 'BRANCH_TYPE']\n",
    "else:\n",
    "    branch_map = df[['BRANCH_ID', 'BRANCH_TYPE']].drop_duplicates().set_index('BRANCH_ID')['BRANCH_TYPE']\n",
    "    val_with_info['BRANCH_TYPE'] = val_with_info['BRANCH_ID'].map(branch_map)\n",
    "\n",
    "bt_results = []\n",
    "for bt in val_with_info['BRANCH_TYPE'].dropna().unique():\n",
    "    subset = val_with_info[val_with_info['BRANCH_TYPE'] == bt]\n",
    "    mae = mean_absolute_error(subset['RENTALS'], subset['PREDICTED'])\n",
    "    bt_results.append({\n",
    "        'Branch Type': bt,\n",
    "        'MAE': round(mae, 1),\n",
    "        'Count': len(subset)\n",
    "    })\n",
    "\n",
    "bt_df = pd.DataFrame(bt_results).sort_values('MAE')\n",
    "print(\"📊 MAE by Branch Type\")\n",
    "print(bt_df.to_string(index=False))\n",
    "print(\"\\n💡 If one branch type has much higher MAE, consider building separate models per type.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cac784b3",
   "metadata": {},
   "source": [
    "---\n",
    "## Phase 6: Deploy FleetPulse\n",
    "\n",
    "### 6.1 Scoring a Single Branch-Day\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c110131",
   "metadata": {},
   "outputs": [],
   "source": [
    "def fleetpulse_forecast(branch_data, model=gb, feature_columns=X_train.columns):\n",
    "    \"\"\"Generate a demand forecast for a single branch-day.\"\"\"\n",
    "    input_df = pd.DataFrame([branch_data])\n",
    "    for col in feature_columns:\n",
    "        if col not in input_df.columns:\n",
    "            input_df[col] = 0\n",
    "    input_df = input_df[feature_columns]\n",
    "    \n",
    "    forecast = max(0, int(round(model.predict(input_df)[0])))\n",
    "    capacity = branch_data.get('FLEET_CAPACITY', 100)\n",
    "    utilization = forecast / capacity\n",
    "    \n",
    "    if utilization >= 0.90: level = \"🔴 SURGE\"\n",
    "    elif utilization >= 0.75: level = \"🟠 HIGH\"\n",
    "    elif utilization >= 0.50: level = \"🟢 NORMAL\"\n",
    "    else: level = \"🔵 LOW\"\n",
    "    \n",
    "    return {\n",
    "        'Forecast': forecast,\n",
    "        'Capacity': capacity,\n",
    "        'Utilization': f'{utilization:.0%}',\n",
    "        'Level': level,\n",
    "        'Action': 'Request transfers' if utilization >= 0.85 else \n",
    "                  'Transfer out excess' if utilization < 0.40 else 'Hold steady'\n",
    "    }\n",
    "\n",
    "# Example forecasts\n",
    "airport_tuesday = {\n",
    "    'FLEET_CAPACITY': 180, 'MONTH': 7, 'IS_HOLIDAY': 0, 'IS_WEEKEND': 0,\n",
    "    'EVENT_INTENSITY': 0, 'RENTALS_LAG_7': 105, 'RENTALS_LAG_1': 98,\n",
    "    'ROLLING_AVG_7': 100, 'ROLLING_AVG_30': 95, 'INVENTORY_LAG_1': 80,\n",
    "    'UTILIZATION_LAG_1': 0.58, 'DAYS_TO_HOLIDAY': 15,\n",
    "    'AVG_DAILY_RATE': 68, 'COMPETITOR_RATE': 62, 'HIGH_TEMP_F': 82,\n",
    "    'FLIGHT_ARRIVALS': 250, 'UNEMPLOYMENT_RATE': 3.8, 'GAS_PRICE': 3.45,\n",
    "    'PCT_ECONOMY': 0.15, 'PCT_MIDSIZE': 0.35, 'PCT_SUV': 0.30,\n",
    "    'PCT_PREMIUM': 0.10, 'PCT_TRUCK': 0.10,\n",
    "    'BT_Downtown': 0, 'BT_Highway': 0, 'BT_Suburban': 0, 'BT_College': 0,\n",
    "    'DOW_Tuesday': 1, 'DOW_Wednesday': 0, 'DOW_Thursday': 0,\n",
    "    'DOW_Friday': 0, 'DOW_Saturday': 0, 'DOW_Sunday': 0,\n",
    "    'WX_Rain': 0, 'WX_Severe': 0, 'WX_Snow': 0,\n",
    "}\n",
    "\n",
    "suburban_saturday = airport_tuesday.copy()\n",
    "suburban_saturday.update({\n",
    "    'FLEET_CAPACITY': 45, 'RENTALS_LAG_7': 35, 'RENTALS_LAG_1': 20,\n",
    "    'ROLLING_AVG_7': 22, 'ROLLING_AVG_30': 20, 'INVENTORY_LAG_1': 30,\n",
    "    'UTILIZATION_LAG_1': 0.44, 'FLIGHT_ARRIVALS': 0, 'AVG_DAILY_RATE': 42,\n",
    "    'COMPETITOR_RATE': 38,\n",
    "    'BT_Suburban': 1, 'BT_Downtown': 0,\n",
    "    'DOW_Tuesday': 0, 'DOW_Saturday': 1,\n",
    "})\n",
    "\n",
    "print(\"🛫 Airport branch, Tuesday in July:\")\n",
    "for k, v in fleetpulse_forecast(airport_tuesday).items():\n",
    "    print(f\"   {k}: {v}\")\n",
    "\n",
    "print(\"\\n🏡 Suburban branch, Saturday in July:\")\n",
    "for k, v in fleetpulse_forecast(suburban_saturday).items():\n",
    "    print(f\"   {k}: {v}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a827747c",
   "metadata": {},
   "source": [
    "### 6.2 Batch Forecast — All Branches\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "129dfdc6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── Score the entire test set ──\n",
    "test_mask = X_test.notna().all(axis=1)\n",
    "X_test_clean = X_test.loc[test_mask]\n",
    "\n",
    "test_preds = gb.predict(X_test_clean)\n",
    "\n",
    "test_results = test.loc[X_test_clean.index, \n",
    "    ['BRANCH_ID', 'BRANCH_NAME', 'DATE', 'RENTALS', 'FLEET_CAPACITY']\n",
    "].copy()\n",
    "\n",
    "test_results['FORECAST'] = np.clip(np.round(test_preds).astype(int), 0, None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "766bcf7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── Forecast accuracy over time ──\n",
    "daily_mae = test_results.groupby('DATE').apply(\n",
    "    lambda g: mean_absolute_error(g['RENTALS'], g['FORECAST'])).reset_index()\n",
    "daily_mae.columns = ['DATE', 'MAE']\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(12, 4))\n",
    "ax.plot(daily_mae['DATE'], daily_mae['MAE'], color=TEAL, linewidth=1.5)\n",
    "ax.axhline(y=daily_mae['MAE'].mean(), color=CORAL, linestyle='--', \n",
    "           label=f'Average MAE: {daily_mae[\"MAE\"].mean():.1f}')\n",
    "ax.fill_between(daily_mae['DATE'], daily_mae['MAE'], alpha=0.1, color=TEAL)\n",
    "ax.set_title('Forecast Error Over Time (Test Period)', fontsize=13, fontweight='bold', color=NAVY)\n",
    "ax.set_ylabel('MAE (vehicles)')\n",
    "ax.set_xlabel('Date')\n",
    "ax.legend()\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "print(\"💡 Stable error over time = no concept drift. If MAE trends upward, the model needs retraining.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15a3e57e",
   "metadata": {},
   "source": [
    "---\n",
    "## Key Takeaways\n",
    "\n",
    "1. **Same framework, different technique** — PAIR and BizML work identically for regression (FleetPulse) and classification (ShieldScore)\n",
    "\n",
    "2. **Temporal splits are mandatory for time series** — random splitting lets the model cheat by seeing the future\n",
    "\n",
    "3. **Data leakage looks different here** — same-day inventory caps rentals mechanically; use lagged versions instead\n",
    "\n",
    "4. **MAE is the business metric** — \"off by 7 vehicles\" is what the operations VP understands, not R²\n",
    "\n",
    "5. **The naïve baseline matters** — if your ML model can't beat \"same as last Tuesday,\" it's not adding value\n",
    "\n",
    "6. **87% of ML projects never reach production** — the Excel demand planner that a branch manager uses daily beats a perfect API nobody logs into\n",
    "\n",
    "> *\"Methodology over tools. The thinking stays the same regardless of the industry.\"*\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}