Why this decision matters
An estimated 80% of enterprise data is unstructured text — and most of it goes unanalyzed. The team that can systematically read 10,000 customer reviews and report what they say has a real advantage. We'll cover both the no-API approach (you do the work in Python) and the API approach (you call OpenAI / Anthropic / Google) so you understand the trade-offs.
By the end of this topic you'll be able to
Tokenize and clean text data; turn text into numeric features (TF-IDF, embeddings); score sentiment; extract topics; choose between a no-API workflow and a hosted-LLM workflow; pull text data from external APIs (Census, news, reviews).
Materials
Key concepts to know
- Tokenization — splitting text into words / subwords. Where most preprocessing bugs live.
- Stop words, stemming, lemmatization — classic preprocessing steps; not always needed with modern models.
- TF-IDF — the workhorse representation: how often each term appears, weighted by how rare it is across documents.
- Embeddings — modern approach: each word/sentence becomes a vector that captures meaning.
- Sentiment scoring — rule-based vs. model-based; valence vs. emotion.
- Topic modeling — LDA, BERTopic. Discover the themes in a corpus without labels.
- API vs. local — calling a hosted LLM is faster to build but has cost, latency, and data-privacy implications.
Readings & class notes
- The Text RefineryA practitioner's guide to turning raw text into business-ready features.
- Text Analytics EssentialsReference handout — the techniques you need to know cold.
- Self-Study Notes (Text + Time Series)Combined notes for the last two modules.
Worksheets
- Unilever — Case Study WorksheetA real Unilever text-analytics case to think through before lab.
- Number or Text? — Decision WorksheetPractice deciding when text needs converting to numeric features.
- Text Analytics — Overview Worksheet
- Customer Reviews — Worksheet
- Census API — Worksheet
- Text Analytics (No API) — Worksheet
- Text Analytics (With API) — Worksheet
Python notebooks
Four parallel notebooks that show the same end goal reached different ways.
- 1 — Text Reviews (no API)Score sentiment in customer reviews using only local Python.
- 2 — Census APIPull demographic data from the Census API to enrich your text analysis.
- 3 — Text API (no key)Use a hosted text API that doesn't require authentication.
- 4 — Text API (key required)Call a commercial LLM endpoint with an API key — the modern enterprise pattern.
- 0 — All-in-One ReferenceEverything from notebooks 1–4 stitched together for end-to-end reference.
Datasets
- Customer Reviews — Demo DatasetA small set of customer reviews to practice on.