{ "cells": [ { "cell_type": "markdown", "id": "e4857927", "metadata": {}, "source": [ "# πŸš— FleetPulse: Forecasting Vehicle Demand\n", "## Enterprise Mobility β€” End-to-End Demand Analytics Demo\n", "\n", "**Course:** GMBA 621 / FINC 332 β€” Predictive Analytics & Data Mining \n", "**Framework:** BizML Lifecycle + PAIR Framework \n", "**Techniques:** EDA β†’ Feature Engineering β†’ Regression / Time Series β†’ Model Evaluation β†’ Deployment\n", "\n", "---\n", "\n", "### The Business Problem\n", "\n", "> *Enterprise Mobility wants to forecast daily rental demand at the branch level, 7 days ahead, so fleet operations can proactively position vehicles where they'll be needed β€” reducing idle inventory costs and customer walk-aways.*\n", "\n", "### PAIR Framework\n", "\n", "| Element | Application |\n", "|---------|------------|\n", "| **Prediction** | Number of rentals per branch per day, 7 days out |\n", "| **Action** | Pre-position vehicles via overnight transfers; adjust pricing to shift demand |\n", "| **Impact** | Reduced idle fleet cost ($35–45/day/vehicle), fewer walk-aways ($85+ lost), cheaper proactive transfers |\n", "| **Risk** | Over-forecasting wastes transport budget; under-forecasting loses customers; one-time events missed |\n", "\n", "### How This Differs from ShieldScore\n", "\n", "| | ShieldScore | FleetPulse |\n", "|---|---|---|\n", "| **Target** | Binary (claim yes/no) | Continuous (rental count) |\n", "| **Metrics** | AUC, sensitivity, specificity | MAE, RMSE, MAPE, RΒ² |\n", "| **Split** | Random stratified | Temporal (train before test) |\n", "| **Key lesson** | Threshold selection | Forecast horizon & seasonality |\n" ] }, { "cell_type": "markdown", "id": "e402b67e", "metadata": {}, "source": [ "## Phase 1: Setup & Data Loading" ] }, { "cell_type": "code", "execution_count": null, "id": "bc076df1", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "plt.rcParams['figure.figsize'] = (12, 5)\n", "plt.rcParams['font.family'] = 'sans-serif'\n", "plt.rcParams['axes.spines.top'] = False\n", "plt.rcParams['axes.spines.right'] = False\n", "\n", "NAVY, TEAL, GOLD = '#1B2A4A', '#2A7B88', '#D4A843'\n", "CREAM, CORAL = '#FDF6E3', '#E07A5F'\n", "palette = [TEAL, NAVY, GOLD, CORAL, '#81B29A', '#3D405B']\n", "sns.set_palette(palette)\n", "\n", "print(\"βœ… Libraries loaded\")" ] }, { "cell_type": "code", "execution_count": null, "id": "2c302e75", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('fleetpulse_enterprise_demand.csv', parse_dates=['DATE'])\n", "\n", "print(f\"Dataset: {df.shape[0]:,} rows Γ— {df.shape[1]} columns\")\n", "print(f\"Branches: {df['BRANCH_ID'].nunique()}\")\n", "print(f\"Date range: {df['DATE'].min().date()} to {df['DATE'].max().date()}\")\n", "print(f\"Avg daily rentals: {df['RENTALS'].mean():.1f}\")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "ff0133a4", "metadata": {}, "source": [ "---\n", "## Phase 2: Exploratory Data Analysis\n", "\n", "### 2.1 Data Quality Audit\n" ] }, { "cell_type": "code", "execution_count": null, "id": "938d30c5", "metadata": {}, "outputs": [], "source": [ "print(\"πŸ” DATA QUALITY AUDIT\")\n", "print(\"=\" * 60)\n", "\n", "# Missing values\n", "print(\"\\n1️⃣ MISSING VALUES:\")\n", "missing = df.isnull().sum()\n", "for col in missing[missing > 0].index:\n", " print(f\" {col}: {missing[col]} ({missing[col]/len(df)*100:.1f}%)\")\n", "\n", "# Missing dates (gaps in time series)\n", "print(\"\\n2️⃣ MISSING DATES (time series gaps):\")\n", "gap_count = 0\n", "for bid in df['BRANCH_ID'].unique():\n", " branch = df[df['BRANCH_ID'] == bid].sort_values('DATE')\n", " full_range = pd.date_range(branch['DATE'].min(), branch['DATE'].max())\n", " missing_dates = full_range.difference(branch['DATE'])\n", " if len(missing_dates) > 0:\n", " gap_count += len(missing_dates)\n", "print(f\" {gap_count} missing branch-days across all branches\")\n", "\n", "# Outliers\n", "outliers = df[df['RENTALS'] > 300]\n", "print(f\"\\n3️⃣ RENTAL OUTLIERS (> 300): {len(outliers)} records\")\n", "if len(outliers) > 0:\n", " print(f\" Values: {outliers['RENTALS'].tolist()}\")\n", "\n", "# Negative walk-aways\n", "neg_wa = df[df['WALK_AWAYS'] < 0]\n", "print(f\"\\n4️⃣ NEGATIVE WALK_AWAYS: {len(neg_wa)} records\")\n", "\n", "# Duplicates\n", "dups = df.duplicated(subset=['BRANCH_ID', 'DATE'], keep=False)\n", "print(f\"\\n5️⃣ DUPLICATE BRANCH-DAYS: {dups.sum()} rows\")\n", "\n", "# Holiday miscoding check\n", "jul4_weekend = df[(df['DATE'].dt.month == 7) & (df['DATE'].dt.day.isin([5,6])) & (df['DATE'].dt.year == 2025)]\n", "print(f\"\\n6️⃣ JULY 4 WEEKEND 2025 (IS_HOLIDAY): {jul4_weekend['IS_HOLIDAY'].unique()} ← should include 1!\")" ] }, { "cell_type": "markdown", "id": "49f4c11c", "metadata": {}, "source": [ "### 2.2 Demand Patterns\n", "The most important patterns in fleet data are **weekly cycles** and **seasonality**.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ceffa107", "metadata": {}, "outputs": [], "source": [ "# ── Weekly pattern by branch type ──\n", "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", "\n", "# Average rentals by day of week, by branch type\n", "day_order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']\n", "for btype in ['Airport', 'Suburban', 'Downtown', 'College']:\n", " subset = df[df['BRANCH_TYPE'] == btype]\n", " daily = subset.groupby('DAY_OF_WEEK')['RENTALS'].mean().reindex(day_order)\n", " axes[0].plot(daily.index, daily.values, marker='o', linewidth=2, label=btype)\n", "\n", "axes[0].set_title('Weekly Demand Pattern by Branch Type', fontsize=13, fontweight='bold', color=NAVY)\n", "axes[0].set_ylabel('Avg Daily Rentals')\n", "axes[0].legend()\n", "axes[0].tick_params(axis='x', rotation=45)\n", "\n", "# Monthly seasonality\n", "monthly = df.groupby('MONTH')['RENTALS'].mean()\n", "axes[1].bar(monthly.index, monthly.values, color=TEAL, edgecolor='white')\n", "axes[1].set_title('Monthly Seasonality (All Branches)', fontsize=13, fontweight='bold', color=NAVY)\n", "axes[1].set_xlabel('Month')\n", "axes[1].set_ylabel('Avg Daily Rentals')\n", "axes[1].set_xticks(range(1, 13))\n", "axes[1].set_xticklabels(['J','F','M','A','M','J','J','A','S','O','N','D'])\n", "\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"πŸ’‘ Key patterns:\")\n", "print(\" β€’ Airport branches peak MIDWEEK (business travelers)\")\n", "print(\" β€’ Suburban branches peak on WEEKENDS (leisure rentals)\")\n", "print(\" β€’ Summer months (Jun-Aug) show ~30-40% higher demand than winter\")" ] }, { "cell_type": "code", "execution_count": null, "id": "039e122c", "metadata": {}, "outputs": [], "source": [ "# ── Time series view for two contrasting branches ──\n", "fig, axes = plt.subplots(2, 1, figsize=(14, 7), sharex=True)\n", "\n", "for ax, bid, label in zip(axes, ['BR-003', 'BR-004'], ['Pittsburgh Airport (Large)', 'Erie Suburban (Small)']):\n", " branch = df[df['BRANCH_ID'] == bid].sort_values('DATE')\n", " ax.plot(branch['DATE'], branch['RENTALS'], linewidth=0.8, color=TEAL, alpha=0.6)\n", " # Rolling average overlay\n", " ax.plot(branch['DATE'], branch['RENTALS'].rolling(14).mean(), linewidth=2, color=CORAL, label='14-day rolling avg')\n", " ax.set_title(f'{label} β€” Daily Rentals Over Time', fontsize=12, fontweight='bold', color=NAVY)\n", " ax.set_ylabel('Rentals')\n", " ax.legend()\n", "\n", "plt.xlabel('Date')\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "0a1349f7", "metadata": {}, "outputs": [], "source": [ "# ── Event impact ──\n", "fig, ax = plt.subplots(figsize=(10, 4))\n", "event_order = ['None', 'Small', 'Medium', 'Large']\n", "event_means = df.groupby('LOCAL_EVENT')['RENTALS'].mean().reindex(event_order)\n", "colors_evt = [TEAL, GOLD, 'orange', CORAL]\n", "ax.bar(event_means.index, event_means.values, color=colors_evt, edgecolor='white')\n", "for i, v in enumerate(event_means.values):\n", " mult = v / event_means['None']\n", " ax.text(i, v + 1, f'{v:.0f}\\n({mult:.1f}x)', ha='center', fontsize=10, fontweight='bold')\n", "ax.set_title('Impact of Local Events on Demand', fontsize=13, fontweight='bold', color=NAVY)\n", "ax.set_ylabel('Avg Rentals')\n", "plt.tight_layout()\n", "plt.show()\n", "print(\"πŸ’‘ Large events create 2x+ demand β€” but they're rare and hard to predict structurally.\")" ] }, { "cell_type": "markdown", "id": "dd6faada", "metadata": {}, "source": [ "---\n", "## Phase 3: Data Preparation\n", "\n", "### ⚠️ Data Leakage Warning\n", "\n", "**Same-day AVAILABLE_INVENTORY must NOT be a predictor.** When inventory is low, rentals are mechanically capped. The model would learn the cap, not the demand. Use **yesterday's inventory** instead.\n", "\n", "**WALK_AWAYS is a consequence of demand**, not a predictor. Exclude it.\n", "\n", "### ⏰ Time-Based Split Warning\n", "\n", "**You CANNOT randomly split time series data.** Random splitting lets the model \"see the future\" β€” training on November to predict August. In real forecasting, you only ever predict FORWARD.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "25f32fa9", "metadata": {}, "outputs": [], "source": [ "data = df.copy()\n", "\n", "# 1. Remove duplicates\n", "data = data.drop_duplicates(subset=['BRANCH_ID', 'DATE'], keep='first')\n", "\n", "# 2. Remove rental outliers (>300)\n", "data = data[data['RENTALS'] <= 300]\n", "\n", "# 3. Fix negative walk-aways\n", "data.loc[data['WALK_AWAYS'] < 0, 'WALK_AWAYS'] = 0\n", "\n", "# 4. Fix holiday miscoding\n", "data.loc[(data['DATE'].dt.month == 7) & (data['DATE'].dt.day.isin([3,4,5,6])) & \n", " (data['DATE'].dt.year == 2025), 'IS_HOLIDAY'] = 1\n", "\n", "# 5. Impute missing COMPETITOR_RATE with branch median\n", "data['COMPETITOR_RATE'] = pd.to_numeric(data['COMPETITOR_RATE'], errors='coerce')\n", "data['COMPETITOR_RATE'] = data.groupby('BRANCH_ID')['COMPETITOR_RATE'].transform(\n", " lambda x: x.fillna(x.median()))\n", "\n", "# 6. Impute missing FLIGHT_ARRIVALS (0 for non-airport, branch median for airport)\n", "data['FLIGHT_ARRIVALS'] = pd.to_numeric(data['FLIGHT_ARRIVALS'], errors='coerce')\n", "data.loc[(data['BRANCH_TYPE'] != 'Airport') & (data['FLIGHT_ARRIVALS'].isnull()), 'FLIGHT_ARRIVALS'] = 0\n", "data['FLIGHT_ARRIVALS'] = data.groupby('BRANCH_ID')['FLIGHT_ARRIVALS'].transform(\n", " lambda x: x.fillna(x.median()))\n", "\n", "print(f\"βœ… Cleaned dataset: {len(data):,} rows\")\n", "print(f\" Remaining nulls: {data.isnull().sum().sum()}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "1b1cb0de", "metadata": {}, "outputs": [], "source": [ "# ── Feature Engineering ──\n", "data = data.sort_values(['BRANCH_ID', 'DATE']).reset_index(drop=True)\n", "\n", "# Lag features (per branch)\n", "for lag in [1, 7]:\n", " data[f'RENTALS_LAG_{lag}'] = data.groupby('BRANCH_ID')['RENTALS'].shift(lag)\n", "\n", "# Rolling averages\n", "data['ROLLING_AVG_7'] = data.groupby('BRANCH_ID')['RENTALS'].transform(\n", " lambda x: x.shift(1).rolling(7, min_periods=1).mean())\n", "data['ROLLING_AVG_30'] = data.groupby('BRANCH_ID')['RENTALS'].transform(\n", " lambda x: x.shift(1).rolling(30, min_periods=1).mean())\n", "\n", "# Lagged inventory (yesterday's, NOT today's)\n", "data['INVENTORY_LAG_1'] = data.groupby('BRANCH_ID')['AVAILABLE_INVENTORY'].shift(1)\n", "\n", "# Utilization lag\n", "data['UTILIZATION_LAG_1'] = data.groupby('BRANCH_ID').apply(\n", " lambda g: g['RENTALS'].shift(1) / g['FLEET_CAPACITY']).reset_index(level=0, drop=True)\n", "\n", "# Days to next holiday\n", "holidays_2024_2025 = pd.to_datetime([\n", " '2024-07-04','2024-09-02','2024-11-28','2024-12-25','2025-01-01',\n", " '2025-05-26','2025-07-04','2025-09-01','2025-11-27','2025-12-25'])\n", "\n", "def days_to_holiday(date):\n", " future = holidays_2024_2025[holidays_2024_2025 >= date]\n", " return (future[0] - date).days if len(future) > 0 else 30\n", "\n", "data['DAYS_TO_HOLIDAY'] = data['DATE'].apply(days_to_holiday).clip(upper=30)\n", "\n", "# Event intensity (ordinal)\n", "event_map = {'None': 0, 'Small': 1, 'Medium': 2, 'Large': 3}\n", "data['EVENT_INTENSITY'] = data['LOCAL_EVENT'].map(event_map)\n", "\n", "# Day-of-week dummies\n", "data = pd.get_dummies(data, columns=['DAY_OF_WEEK'], drop_first=True, prefix='DOW')\n", "\n", "# Weather dummies\n", "data = pd.get_dummies(data, columns=['WEATHER_CONDITION'], drop_first=True, prefix='WX')\n", "\n", "# Branch type dummies\n", "data = pd.get_dummies(data, columns=['BRANCH_TYPE'], drop_first=True, prefix='BT')\n", "\n", "# Drop rows with NaN from lag features (first few days per branch)\n", "data = data.dropna(subset=['RENTALS_LAG_7', 'ROLLING_AVG_30'])\n", "\n", "print(f\"βœ… Feature-engineered dataset: {len(data):,} rows Γ— {len(data.columns)} columns\")\n", "new_feats = ['RENTALS_LAG_1','RENTALS_LAG_7','ROLLING_AVG_7','ROLLING_AVG_30',\n", " 'INVENTORY_LAG_1','UTILIZATION_LAG_1','DAYS_TO_HOLIDAY','EVENT_INTENSITY']\n", "for f in new_feats:\n", " print(f\" {f}: mean={data[f].mean():.1f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "93cfffe7", "metadata": {}, "outputs": [], "source": [ "# ── Temporal Train/Validation/Test Split ──\n", "# Train: Jul 2024 – Sep 2025 | Val: Oct 2025 | Test: Nov–Dec 2025\n", "\n", "train = data[data['DATE'] < '2025-10-01']\n", "val = data[(data['DATE'] >= '2025-10-01') & (data['DATE'] < '2025-11-01')]\n", "test = data[data['DATE'] >= '2025-11-01']\n", "\n", "exclude = ['BRANCH_ID', 'BRANCH_NAME', 'DATE', 'RENTALS', # IDs + target\n", " 'AVAILABLE_INVENTORY', 'RETURNS_TODAY', 'WALK_AWAYS', # Leakage\n", " 'LOCAL_EVENT', 'LOCAL_EVENT_NAME', 'REGION', 'BRANCH_SIZE', # Already encoded\n", " 'DAY_OF_WEEK', 'WEATHER_CONDITION', 'BRANCH_TYPE', 'QUARTER'] # Already encoded or redundant\n", "\n", "# Some of these might not exist if already dummified; filter safely\n", "feature_cols = [c for c in data.columns if c not in exclude and c in train.columns]\n", "\n", "X_train, y_train = train[feature_cols], train['RENTALS']\n", "X_val, y_val = val[feature_cols], val['RENTALS']\n", "X_test, y_test = test[feature_cols], test['RENTALS']\n", "\n", "print(f\"Train: {len(X_train):,} rows ({train['DATE'].min().date()} to {train['DATE'].max().date()})\")\n", "print(f\"Validation: {len(X_val):,} rows ({val['DATE'].min().date()} to {val['DATE'].max().date()})\")\n", "print(f\"Test: {len(X_test):,} rows ({test['DATE'].min().date()} to {test['DATE'].max().date()})\")\n", "print(f\"\\nπŸ’‘ Temporal split: we only ever predict FORWARD in time. No peeking at the future.\")" ] }, { "cell_type": "markdown", "id": "e0ff9fd6", "metadata": {}, "source": [ "---\n", "## Phase 4: Build and Compare Models\n", "\n", "### Key Difference from ShieldScore\n", "We're doing **regression** now β€” predicting a *number* (how many rentals), not a *category* (yes/no claim). So we use MAE, RMSE, and RΒ² instead of AUC and confusion matrices.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "28efc680", "metadata": {}, "outputs": [], "source": [ "# ── This can take a couple of minutes, wait till you see the output before running the next block ──\n", "\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n", "from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n", "import numpy as np\n", "import pandas as pd\n", "\n", "def evaluate(name, y_true, y_pred):\n", " mae = mean_absolute_error(y_true, y_pred)\n", " rmse = np.sqrt(mean_squared_error(y_true, y_pred))\n", " mape = np.mean(np.abs((y_true - y_pred) / y_true.clip(lower=1))) * 100\n", " r2 = r2_score(y_true, y_pred)\n", " return {\n", " 'Model': name,\n", " 'MAE': round(mae, 2),\n", " 'RMSE': round(rmse, 2),\n", " 'MAPE': f'{mape:.1f}%',\n", " 'RΒ²': round(r2, 3)\n", " }\n", "\n", "# ---- Remove rows with missing values ----\n", "train_mask = X_train.notna().all(axis=1)\n", "val_mask = X_val.notna().all(axis=1)\n", "\n", "X_train_clean = X_train.loc[train_mask]\n", "y_train_clean = y_train.loc[train_mask]\n", "\n", "X_val_clean = X_val.loc[val_mask]\n", "y_val_clean = y_val.loc[val_mask]\n", "\n", "results = []\n", "\n", "# ── Baseline: Same day last week ──\n", "baseline_pred = X_val_clean['RENTALS_LAG_7']\n", "results.append(evaluate('NaΓ―ve Baseline (lag 7)', y_val_clean, baseline_pred))\n", "\n", "# ── Model 1: Linear Regression ──\n", "lr = LinearRegression()\n", "lr.fit(X_train_clean, y_train_clean)\n", "lr_pred = lr.predict(X_val_clean)\n", "results.append(evaluate('Linear Regression', y_val_clean, lr_pred))\n", "\n", "# ── Model 2: Decision Tree ──\n", "dt = DecisionTreeRegressor(max_depth=8, random_state=42)\n", "dt.fit(X_train_clean, y_train_clean)\n", "dt_pred = dt.predict(X_val_clean)\n", "results.append(evaluate('Decision Tree (d=8)', y_val_clean, dt_pred))\n", "\n", "# ── Model 3: Random Forest ──\n", "rf = RandomForestRegressor(n_estimators=300, max_depth=12, random_state=42, n_jobs=-1)\n", "rf.fit(X_train_clean, y_train_clean)\n", "rf_pred = rf.predict(X_val_clean)\n", "results.append(evaluate('Random Forest (300)', y_val_clean, rf_pred))\n", "\n", "# ── Model 4: Gradient Boosting ──\n", "gb = GradientBoostingRegressor(n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42)\n", "gb.fit(X_train_clean, y_train_clean)\n", "gb_pred = gb.predict(X_val_clean)\n", "results.append(evaluate('Gradient Boosting', y_val_clean, gb_pred))\n", "\n", "results_df = pd.DataFrame(results)\n", "print(\"πŸ“Š Model Comparison (Validation Set)\")\n", "print(\"=\" * 70)\n", "print(results_df.to_string(index=False))\n", "print(f\"\\nπŸ† Best MAE: {results_df.loc[results_df['MAE'].idxmin(), 'Model']}\")\n", "print(f\"\\nπŸ’‘ Every ML model beats the naΓ―ve baseline β€” the analytics adds real value.\")" ] }, { "cell_type": "markdown", "id": "5ed0a8ad", "metadata": {}, "source": [ "---\n", "## Phase 5: Evaluate the Champion\n" ] }, { "cell_type": "code", "execution_count": null, "id": "21f2b96b", "metadata": {}, "outputs": [], "source": [ "# ── Actual vs. Predicted scatter plot ──\n", "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", "\n", "# Gradient Boosting (champion)\n", "axes[0].scatter(y_val, gb_pred, alpha=0.4, s=20, color=TEAL, edgecolors='none')\n", "axes[0].plot([0, y_val.max()], [0, y_val.max()], 'k--', linewidth=1, label='Perfect forecast')\n", "axes[0].set_xlabel('Actual Rentals')\n", "axes[0].set_ylabel('Predicted Rentals')\n", "axes[0].set_title('Gradient Boosting: Actual vs. Predicted', fontsize=13, fontweight='bold', color=NAVY)\n", "axes[0].legend()\n", "\n", "# Residual distribution\n", "residuals = y_val.values - gb_pred\n", "axes[1].hist(residuals, bins=30, color=TEAL, edgecolor='white', alpha=0.8)\n", "axes[1].axvline(x=0, color=CORAL, linewidth=2, label=f'Mean residual: {residuals.mean():.1f}')\n", "axes[1].set_xlabel('Residual (Actual βˆ’ Predicted)')\n", "axes[1].set_ylabel('Count')\n", "axes[1].set_title('Residual Distribution', fontsize=13, fontweight='bold', color=NAVY)\n", "axes[1].legend()\n", "\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(f\"πŸ“ Mean residual: {residuals.mean():.2f} (near zero = unbiased)\")\n", "print(f\" Std of residuals: {residuals.std():.2f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "818928c2", "metadata": {}, "outputs": [], "source": [ "# ── Feature Importance ──\n", "importance = pd.Series(gb.feature_importances_, index=X_train.columns).sort_values(ascending=True)\n", "\n", "fig, ax = plt.subplots(figsize=(10, 8))\n", "importance.tail(15).plot(kind='barh', ax=ax, color=TEAL, edgecolor='white')\n", "ax.set_title('Top 15 Features β€” Gradient Boosting', fontsize=14, fontweight='bold', color=NAVY)\n", "ax.set_xlabel('Feature Importance')\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"\\nπŸ’‘ Top demand drivers:\")\n", "for feat, imp in importance.tail(5).items():\n", " print(f\" {feat}: {imp:.3f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "983a0618", "metadata": {}, "outputs": [], "source": [ "# ── Performance by Branch Type ──\n", "val_with_info = val.loc[X_val_clean.index, ['BRANCH_ID', 'DATE', 'RENTALS']].copy()\n", "val_with_info['PREDICTED'] = gb_pred\n", "\n", "if 'BRANCH_TYPE' in val.columns:\n", " val_with_info['BRANCH_TYPE'] = val.loc[X_val_clean.index, 'BRANCH_TYPE']\n", "else:\n", " branch_map = df[['BRANCH_ID', 'BRANCH_TYPE']].drop_duplicates().set_index('BRANCH_ID')['BRANCH_TYPE']\n", " val_with_info['BRANCH_TYPE'] = val_with_info['BRANCH_ID'].map(branch_map)\n", "\n", "bt_results = []\n", "for bt in val_with_info['BRANCH_TYPE'].dropna().unique():\n", " subset = val_with_info[val_with_info['BRANCH_TYPE'] == bt]\n", " mae = mean_absolute_error(subset['RENTALS'], subset['PREDICTED'])\n", " bt_results.append({\n", " 'Branch Type': bt,\n", " 'MAE': round(mae, 1),\n", " 'Count': len(subset)\n", " })\n", "\n", "bt_df = pd.DataFrame(bt_results).sort_values('MAE')\n", "print(\"πŸ“Š MAE by Branch Type\")\n", "print(bt_df.to_string(index=False))\n", "print(\"\\nπŸ’‘ If one branch type has much higher MAE, consider building separate models per type.\")" ] }, { "cell_type": "markdown", "id": "cac784b3", "metadata": {}, "source": [ "---\n", "## Phase 6: Deploy FleetPulse\n", "\n", "### 6.1 Scoring a Single Branch-Day\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7c110131", "metadata": {}, "outputs": [], "source": [ "def fleetpulse_forecast(branch_data, model=gb, feature_columns=X_train.columns):\n", " \"\"\"Generate a demand forecast for a single branch-day.\"\"\"\n", " input_df = pd.DataFrame([branch_data])\n", " for col in feature_columns:\n", " if col not in input_df.columns:\n", " input_df[col] = 0\n", " input_df = input_df[feature_columns]\n", " \n", " forecast = max(0, int(round(model.predict(input_df)[0])))\n", " capacity = branch_data.get('FLEET_CAPACITY', 100)\n", " utilization = forecast / capacity\n", " \n", " if utilization >= 0.90: level = \"πŸ”΄ SURGE\"\n", " elif utilization >= 0.75: level = \"🟠 HIGH\"\n", " elif utilization >= 0.50: level = \"🟒 NORMAL\"\n", " else: level = \"πŸ”΅ LOW\"\n", " \n", " return {\n", " 'Forecast': forecast,\n", " 'Capacity': capacity,\n", " 'Utilization': f'{utilization:.0%}',\n", " 'Level': level,\n", " 'Action': 'Request transfers' if utilization >= 0.85 else \n", " 'Transfer out excess' if utilization < 0.40 else 'Hold steady'\n", " }\n", "\n", "# Example forecasts\n", "airport_tuesday = {\n", " 'FLEET_CAPACITY': 180, 'MONTH': 7, 'IS_HOLIDAY': 0, 'IS_WEEKEND': 0,\n", " 'EVENT_INTENSITY': 0, 'RENTALS_LAG_7': 105, 'RENTALS_LAG_1': 98,\n", " 'ROLLING_AVG_7': 100, 'ROLLING_AVG_30': 95, 'INVENTORY_LAG_1': 80,\n", " 'UTILIZATION_LAG_1': 0.58, 'DAYS_TO_HOLIDAY': 15,\n", " 'AVG_DAILY_RATE': 68, 'COMPETITOR_RATE': 62, 'HIGH_TEMP_F': 82,\n", " 'FLIGHT_ARRIVALS': 250, 'UNEMPLOYMENT_RATE': 3.8, 'GAS_PRICE': 3.45,\n", " 'PCT_ECONOMY': 0.15, 'PCT_MIDSIZE': 0.35, 'PCT_SUV': 0.30,\n", " 'PCT_PREMIUM': 0.10, 'PCT_TRUCK': 0.10,\n", " 'BT_Downtown': 0, 'BT_Highway': 0, 'BT_Suburban': 0, 'BT_College': 0,\n", " 'DOW_Tuesday': 1, 'DOW_Wednesday': 0, 'DOW_Thursday': 0,\n", " 'DOW_Friday': 0, 'DOW_Saturday': 0, 'DOW_Sunday': 0,\n", " 'WX_Rain': 0, 'WX_Severe': 0, 'WX_Snow': 0,\n", "}\n", "\n", "suburban_saturday = airport_tuesday.copy()\n", "suburban_saturday.update({\n", " 'FLEET_CAPACITY': 45, 'RENTALS_LAG_7': 35, 'RENTALS_LAG_1': 20,\n", " 'ROLLING_AVG_7': 22, 'ROLLING_AVG_30': 20, 'INVENTORY_LAG_1': 30,\n", " 'UTILIZATION_LAG_1': 0.44, 'FLIGHT_ARRIVALS': 0, 'AVG_DAILY_RATE': 42,\n", " 'COMPETITOR_RATE': 38,\n", " 'BT_Suburban': 1, 'BT_Downtown': 0,\n", " 'DOW_Tuesday': 0, 'DOW_Saturday': 1,\n", "})\n", "\n", "print(\"πŸ›« Airport branch, Tuesday in July:\")\n", "for k, v in fleetpulse_forecast(airport_tuesday).items():\n", " print(f\" {k}: {v}\")\n", "\n", "print(\"\\n🏑 Suburban branch, Saturday in July:\")\n", "for k, v in fleetpulse_forecast(suburban_saturday).items():\n", " print(f\" {k}: {v}\")" ] }, { "cell_type": "markdown", "id": "a827747c", "metadata": {}, "source": [ "### 6.2 Batch Forecast β€” All Branches\n" ] }, { "cell_type": "code", "execution_count": null, "id": "129dfdc6", "metadata": {}, "outputs": [], "source": [ "# ── Score the entire test set ──\n", "test_mask = X_test.notna().all(axis=1)\n", "X_test_clean = X_test.loc[test_mask]\n", "\n", "test_preds = gb.predict(X_test_clean)\n", "\n", "test_results = test.loc[X_test_clean.index, \n", " ['BRANCH_ID', 'BRANCH_NAME', 'DATE', 'RENTALS', 'FLEET_CAPACITY']\n", "].copy()\n", "\n", "test_results['FORECAST'] = np.clip(np.round(test_preds).astype(int), 0, None)" ] }, { "cell_type": "code", "execution_count": null, "id": "766bcf7e", "metadata": {}, "outputs": [], "source": [ "# ── Forecast accuracy over time ──\n", "daily_mae = test_results.groupby('DATE').apply(\n", " lambda g: mean_absolute_error(g['RENTALS'], g['FORECAST'])).reset_index()\n", "daily_mae.columns = ['DATE', 'MAE']\n", "\n", "fig, ax = plt.subplots(figsize=(12, 4))\n", "ax.plot(daily_mae['DATE'], daily_mae['MAE'], color=TEAL, linewidth=1.5)\n", "ax.axhline(y=daily_mae['MAE'].mean(), color=CORAL, linestyle='--', \n", " label=f'Average MAE: {daily_mae[\"MAE\"].mean():.1f}')\n", "ax.fill_between(daily_mae['DATE'], daily_mae['MAE'], alpha=0.1, color=TEAL)\n", "ax.set_title('Forecast Error Over Time (Test Period)', fontsize=13, fontweight='bold', color=NAVY)\n", "ax.set_ylabel('MAE (vehicles)')\n", "ax.set_xlabel('Date')\n", "ax.legend()\n", "plt.tight_layout()\n", "plt.show()\n", "print(\"πŸ’‘ Stable error over time = no concept drift. If MAE trends upward, the model needs retraining.\")" ] }, { "cell_type": "markdown", "id": "15a3e57e", "metadata": {}, "source": [ "---\n", "## Key Takeaways\n", "\n", "1. **Same framework, different technique** β€” PAIR and BizML work identically for regression (FleetPulse) and classification (ShieldScore)\n", "\n", "2. **Temporal splits are mandatory for time series** β€” random splitting lets the model cheat by seeing the future\n", "\n", "3. **Data leakage looks different here** β€” same-day inventory caps rentals mechanically; use lagged versions instead\n", "\n", "4. **MAE is the business metric** β€” \"off by 7 vehicles\" is what the operations VP understands, not RΒ²\n", "\n", "5. **The naΓ―ve baseline matters** β€” if your ML model can't beat \"same as last Tuesday,\" it's not adding value\n", "\n", "6. **87% of ML projects never reach production** β€” the Excel demand planner that a branch manager uses daily beats a perfect API nobody logs into\n", "\n", "> *\"Methodology over tools. The thinking stays the same regardless of the industry.\"*\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.12" } }, "nbformat": 4, "nbformat_minor": 5 }