7 Overview of Natural Language Processing

In previous chapters, we have dealt with structured or tabular data, where each observation is represented by a fixed set of features (columns). However, a significant portion of the data generated today is unstructured text data, such as emails, news articles, social media posts, or policy reports. Industry estimates suggest that around 80% of enterprise data is unstructured (e.g., text, images, videos, etc.),¹ highlighting the importance of being able to process and analyze such data effectively.

In this part of the course, we will explore the fundamental concepts and techniques of Natural Language Processing (NLP), which will enable us to extract meaningful information from unstructured text data.

7.1 What is Natural Language Processing (NLP)?

Natural Language Processing (NLP) is the branch of artificial intelligence concerned with enabling computers to understand, interpret, and generate human language. As we saw in Chapter 1, NLP is one of the core application areas of machine learning, powering technologies like chatbots, translation systems, and sentiment analysis tools.

NLP is often divided into two broad categories:

Natural Language Understanding (NLU): The task of comprehending text. This includes extracting meaning, identifying entities, determining sentiment, or classifying documents. It is what allows a search engine to understand your query or a spam filter to recognize unwanted emails.
Natural Language Generation (NLG): The task of producing text. This includes generating summaries, writing responses, or translating between languages. It is what powers chatbots like ChatGPT and machine translation systems.

In this part of the course, we will primarily focus on NLU tasks: transforming raw text into structured representations that machine learning models can analyze. The generative side of NLP will be covered in more depth in the final part of the course on generative AI.

7.2 What makes NLP difficult?

Natural language is inherently complex and ambiguous, making it challenging for computers to process. Unlike programming languages, which have strict syntax and unambiguous meanings, human language is full of nuances that we navigate effortlessly but that pose significant hurdles for machines. Here are some examples:

Lexical Ambiguity: A single word can have multiple meanings. “Bank” could refer to a financial institution or the side of a river. We can only infer through context which one is meant.
Syntactic Ambiguity: A sentence can be parsed in multiple ways. “I saw the man with the telescope” leaves unclear whether you used the telescope or he was holding it.
Referential Ambiguity: Pronouns like “it” or “they” require understanding what was mentioned earlier. In “The trophy didn’t fit in the suitcase because it was too big,” humans use physics to know “it” is the trophy.
Figurative Language: Literal meanings fail entirely. “Kick the bucket” has nothing to do with buckets, and “this is the shit” means the opposite of “this is shit.”
Sarcasm and Irony: “Oh great, another meeting” probably doesn’t mean what it literally says.

Humans resolve these ambiguities unconsciously using world knowledge, common sense, and social context. These are capabilities that are difficult to encode in algorithms.

While we focus on text data in this course, speech recognition adds further challenges such as accents, background noise, homophones (“there” vs. “their”), and intonation (e.g., the same words can be a statement or question depending on tone).

More Examples of Ambiguous Language

Leslie Nielsen’s comedy films are a treasure trove of examples, as much of their humor relies on deliberately misinterpreting ambiguous language.

Syntactic Ambiguity (from “Police Squad!” - watch clip)

Frank Drebin: Now do you think you can beat The Champ?
Buddy: I can take him blindfolded.
Frank Drebin: What if he’s not blindfolded?
Buddy: I can still beat him.

The modifier “blindfolded” could attach to either “I” or “him.”

Lexical Ambiguity (from “Airplane!” - watch clip)

Rumack: You’d better tell the Captain we’ve got to land as soon as we can. This woman has to be gotten to a hospital.
Elaine Dickinson: A hospital? What is it?
Rumack: It’s a big building with patients, but that’s not important right now.

“What is it?” asks about the situation, but Rumack interprets “it” as asking for a definition.

Phonetic Ambiguity (from “Police Squad!” - watch clip)

[Frank and Ed are interviewing a witness to a shooting]
Sally: Well, I first heard the shot, and as I turned, Jim fell.
Frank: Jim Fell’s the teller?
Sally: No, Jim Johnson.
Frank: Who’s Jim Fell?
Ed: He’s the auditor, Frank.
Sally: He had the flu, so Jim… filled in.
Frank: Phil who?
Ed: Phil Din. He’s the night watchman.
Sally: Oh, if only Phil had been here…

“Jim fell” sounds like “Jim Fell” (a name), and “filled in” sounds like “Phil Din.”

We will not attempt to dive deeply into all the linguistic challenges of natural language in this course. But it is important to be aware that these difficulties exist. Recent advances in NLP, particularly with transformer-based models like BERT and GPT, have made significant advances in handling some of these complexities.

7.3 Applications

NLP powers a wide range of real-world applications that we encounter daily. Here are some of the most common:

Text Classification assigns predefined categories to documents. Spam detection is a classic example: your email provider uses NLP to automatically filter unwanted messages into your spam folder. Other examples include routing customer support tickets to the appropriate department or categorizing news articles by topic.
Sentiment Analysis determines the emotional tone of text, such as whether a review is positive, negative, or neutral. Companies use this to monitor brand perception on social media, analyze customer feedback at scale, or gauge public opinion on policy issues. For economists, sentiment analysis of news articles or earnings calls can serve as indicators for market movements or economic confidence.
Machine Translation automatically converts text from one language to another. Services like Google Translate and DeepL have made it possible to read content in languages we don’t speak. While not perfect, modern neural machine translation has dramatically improved quality, making cross-language communication more accessible than ever.
Information Retrieval powers search engines. When you type a query into Google, NLP techniques help match your question to relevant documents, even when the exact words don’t appear in the text. This involves understanding synonyms, query intent, and document relevance.
Text Summarization condenses long documents into shorter versions while preserving key information. This can be extractive (selecting the most important sentences) or abstractive (generating new sentences that capture the main points). Applications range from summarizing news articles to condensing legal documents or research papers.
Named Entity Recognition (NER) identifies and classifies entities in text, such as people, organizations, locations, dates, and monetary values. For example, given the sentence “Apple announced new products in Cupertino on Tuesday,” NER would identify “Apple” as an organization, “Cupertino” as a location, and “Tuesday” as a date. This is invaluable for extracting structured information from unstructured text.
Part-of-Speech Tagging labels each word in a sentence with its grammatical role (noun, verb, adjective, etc.). While this may seem like a purely linguistic exercise, it’s a fundamental building block for many higher-level NLP tasks and helps models understand sentence structure.
Topic Modeling discovers abstract themes that occur across a collection of documents. Given thousands of news articles, topic modeling might automatically identify clusters around politics, sports, technology, and entertainment, without being told these categories exist in advance.
Chatbots and Virtual Assistants combine multiple NLP capabilities to understand user requests and generate appropriate responses. From customer service bots that answer FAQs to virtual assistants like Siri and Alexa, these systems rely on intent recognition, entity extraction, and natural language generation.
Keyword Extraction automatically identifies the most important terms in a document. This is useful for tagging content, generating metadata, or quickly understanding what a document is about without reading it in full.

In this course, we will focus on text classification, sentiment analysis, topic modeling, and named entity recognition, as these tasks provide an excellent foundation for understanding NLP concepts while being directly applicable in economics and finance.

7.4 Brief History

The history of NLP can be broadly divided into three eras, each defined by the dominant approach to processing language.

Rule-based systems (1950s-1980s) relied on hand-crafted linguistic rules and pattern matching. One famous example is ELIZA, created by Joseph Weizenbaum at MIT in 1966. ELIZA simulated a Rogerian psychotherapist by using simple pattern-matching rules to transform user input into questions. Despite its simplicity, users often attributed human-like understanding to the program. Here is an excerpt from Weizenbaum’s original paper:

User: Men are all alike.
ELIZA: IN WHAT WAY
User: They’re always bugging us about something or other.
ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE
User: Well, my boyfriend made me come here.
ELIZA: YOUR BOYFRIEND MADE YOU COME HERE
User: He says I’m depressed much of the time.
ELIZA: I AM SORRY TO HEAR YOU ARE DEPRESSED

ELIZA had no actual understanding of language; it simply matched patterns like “my boyfriend” and reflected them back as “your boyfriend.” Yet this was enough to create the illusion of comprehension, a phenomenon now known as the “ELIZA effect.”
Statistical approaches (1990s-2000s) shifted focus from hand-written rules to learning patterns from data. Techniques like n-gram language models, hidden Markov models, and later support vector machines allowed systems to learn from large text corpora. This era saw major advances in machine translation, speech recognition, and information retrieval.
Neural approaches (2010s-present) use deep learning to automatically learn representations of language. Word embeddings (Word2Vec, GloVe) captured semantic relationships, recurrent neural networks handled sequential data, and most recently, transformer architectures (BERT, GPT) have achieved state-of-the-art results across virtually all NLP tasks.

7.5 Basic Concepts

Before diving into NLP methods, it helps to establish some core terminology that you will encounter throughout these chapters.

Corpus (plural: corpora): A collection of texts used for analysis, such as news articles, tweets, or legal documents.
Document: A single text unit within a corpus. Depending on context, this could be a sentence, paragraph, article, or entire book.
Token: The basic unit of text after splitting. Usually words, but can also be subwords, characters, or punctuation.
Tokenization: The process of splitting text into tokens. For example, “I can’t wait!” might become ["I", "can't", "wait", "!"] or ["I", "can", "'t", "wait", "!"] depending on the tokenizer.
Vocabulary: The set of unique tokens in a corpus.
Metadata: Information about documents that is not part of the text itself, such as author, publication date, source, or category. Metadata is often used alongside text for filtering, grouping, or as additional features in analysis.

The central challenge of NLP is turning raw text into something a machine learning model can work with. Unlike tabular data, where each feature is already a number, text must first be transformed into numerical representations. In the next chapter, we will walk through a typical NLP pipeline: preprocessing text (tokenization, normalization, stopword removal), representing it numerically using methods ranging from simple bag-of-words and TF-IDF to dense word embeddings like Word2Vec and BERT, and then applying these representations to tasks such as text classification, named entity recognition, and topic modeling.

The extent to which this figure is accurate can be debated, but it is widely accepted that unstructured data constitutes a large majority of all data.↩︎