Decision 6 · Week 14

"What are they saying?"

Reviews, support tickets, surveys, contracts, social posts — most business data is text. Text analytics turns it into something you can count, compare, and act on.

NLP Sentiment Topic Modeling

Why this decision matters

An estimated 80% of enterprise data is unstructured text — and most of it goes unanalyzed. The team that can systematically read 10,000 customer reviews and report what they say has a real advantage. We'll cover both the no-API approach (you do the work in Python) and the API approach (you call OpenAI / Anthropic / Google) so you understand the trade-offs.

By the end of this topic you'll be able to

Tokenize and clean text data; turn text into numeric features (TF-IDF, embeddings); score sentiment; extract topics; choose between a no-API workflow and a hosted-LLM workflow; pull text data from external APIs (Census, news, reviews).

Materials

Key concepts to know
  • Tokenization — splitting text into words / subwords. Where most preprocessing bugs live.
  • Stop words, stemming, lemmatization — classic preprocessing steps; not always needed with modern models.
  • TF-IDF — the workhorse representation: how often each term appears, weighted by how rare it is across documents.
  • Embeddings — modern approach: each word/sentence becomes a vector that captures meaning.
  • Sentiment scoring — rule-based vs. model-based; valence vs. emotion.
  • Topic modeling — LDA, BERTopic. Discover the themes in a corpus without labels.
  • API vs. local — calling a hosted LLM is faster to build but has cost, latency, and data-privacy implications.
Readings & class notes
Worksheets
Python notebooks

Four parallel notebooks that show the same end goal reached different ways.

Datasets

Stay Ahead of the Curve

Subscribe to our bi-weekly newsletter for the latest insights on AI, data, and business strategy.