Why this decision matters
Every other decision in this course depends on this one. If you frame the wrong problem, or you miss something hidden in the data, no amount of fancy modeling will save you. Real data scientists spend 60–80% of their time here — and the project lives or dies based on it.
By the end of this topic you'll be able to
Translate a vague business request into a precise analytics question; identify whether you have the right data to answer it; profile a dataset systematically; clean common data issues (missing values, duplicates, type mismatches); build the basic features that downstream models need.
Materials
Key concepts to know
- Problem framing — turning "we want more revenue" into "predict which existing customers will spend more if contacted in the next 30 days."
- Data profiling — every variable: type, range, missing %, distribution, weird values.
- Wide vs. long format — most ML algorithms expect wide; many data sources hand you long. You'll learn to reshape both ways.
- Missing data strategies — drop, impute, flag, or model. The right answer depends on why it's missing.
- Outliers — rare event you care about, or data quality problem? Investigation, not deletion.
- Feature engineering — derived columns (ratios, dates, encodings) that make patterns easier to learn.
Cheat sheets & class notes
- Statistics & ML Cheat SheetOne-page reference: distributions, hypothesis tests, ML terminology you'll reuse throughout the course.
- When Models Go Wrong — Case StudiesFive real-world stories of analyses that failed because the data prep was wrong.
- Predictive Analytics & Data Mining HandbookReference handbook covering the entire course; chapters 1–3 cover this module.
Hands-on: data prep demos
Three demo notebooks walk you through common prep scenarios. Unzip each and follow the README inside.
- Demo: Prepping a Churn DatasetCustomer churn data from the wild — types, missing, encodings, outliers.
- Demo: Prepping Panel DataRepeated observations per entity — how to handle the time dimension.
- Demo: Reshaping to Wide FormatPivot long-format data into the wide layout most ML libraries expect.
Practice datasets
- SP26 Class Practice WorkbookMaster practice workbook used throughout the data-prep weeks.
- Telco Churn DatasetA classic — practice profiling and cleaning before we use it for modeling.
Practice with games · Big picture & orientation
Short browser games and explainers that build intuition for what analytics is, what kinds there are, and which decision fits which type.
- Analytics Big PictureThe whole discipline on one page — descriptive, diagnostic, predictive, prescriptive.
- Four Types of AnalyticsTour the four big categories with concrete examples.
- Analytics Terminology GuideThe vocabulary you'll hear in interviews — batch, real-time, edge, deployment types.
- Different Decisions Need Different ToolsMatch the decision (strategic / tactical / operational) to the right analytics technique.
- Analytics Use Cases Across IndustriesA walking tour of how different industries put analytics to work.
Practice with games · Data shape & problem framing
- Analytics-Ready Datasets (Long vs. Wide)Practice spotting which layout you need before you model.
- Analytics: The Moneyball StoryHow asking a better question changed an entire industry.
- The "Glue Player" SimulationBe the analytics translator who connects business and data — the role employers actually hire for.
Optional SQL warm-up
If your data lives in a database, you'll need a little SQL to pull it. These two short demos cover the basics.
- SQL Demo #1 — BasicsSELECT, WHERE, ORDER BY, basic aggregation.
- SQL Demo #2 — JoinsJOINs and how to pull from multiple tables.
Homework
HW-1: EDA — due end of Week 3. Pick one of the practice datasets, profile it end-to-end, and write a short memo identifying three things you'd want to investigate before modeling. Submission instructions.