Decision 6 — Text Analytics

Why this decision matters

An estimated 80% of enterprise data is unstructured text — and most of it goes unanalyzed. The team that can systematically read 10,000 customer reviews and report what they say has a real advantage. We'll cover both the no-API approach (you do the work in Python) and the API approach (you call OpenAI / Anthropic / Google) so you understand the trade-offs.

By the end of this topic you'll be able to

Tokenize and clean text data; turn text into numeric features (TF-IDF, embeddings); score sentiment; extract topics; choose between a no-API workflow and a hosted-LLM workflow; pull text data from external APIs (Census, news, reviews).

Materials

Key concepts to know

Tokenization — splitting text into words / subwords. Where most preprocessing bugs live.
Stop words, stemming, lemmatization — classic preprocessing steps; not always needed with modern models.
TF-IDF — the workhorse representation: how often each term appears, weighted by how rare it is across documents.
Embeddings — modern approach: each word/sentence becomes a vector that captures meaning.
Sentiment scoring — rule-based vs. model-based; valence vs. emotion.
Topic modeling — LDA, BERTopic. Discover the themes in a corpus without labels.
API vs. local — calling a hosted LLM is faster to build but has cost, latency, and data-privacy implications.

Readings & class notes

The Text RefineryA practitioner's guide to turning raw text into business-ready features.
Text Analytics EssentialsReference handout — the techniques you need to know cold.
Self-Study Notes (Text + Time Series)Combined notes for the last two modules.

Worksheets

Unilever — Case Study WorksheetA real Unilever text-analytics case to think through before lab.
Number or Text? — Decision WorksheetPractice deciding when text needs converting to numeric features.
Text Analytics — Overview Worksheet
Customer Reviews — Worksheet
Census API — Worksheet
Text Analytics (No API) — Worksheet
Text Analytics (With API) — Worksheet

Python notebooks

Four parallel notebooks that show the same end goal reached different ways.

1 — Text Reviews (no API)Score sentiment in customer reviews using only local Python.
2 — Census APIPull demographic data from the Census API to enrich your text analysis.
3 — Text API (no key)Use a hosted text API that doesn't require authentication.
4 — Text API (key required)Call a commercial LLM endpoint with an API key — the modern enterprise pattern.
0 — All-in-One ReferenceEverything from notebooks 1–4 stitched together for end-to-end reference.

Datasets

Customer Reviews — Demo DatasetA small set of customer reviews to practice on.

"What are they saying?"

Why this decision matters

By the end of this topic you'll be able to

Materials

Stay Ahead of the Curve