9 Practice Session II

This practice session builds on the sentiment analysis exercise discussed in class. We will use the same dataset of labelled sentences from ECB, FED, and BIS speeches.

9.1 Problem Setup

We want to build a machine learning model that can predict the sentiment of a sentence from a FED speech. The sentiment can be either positive (1) or negative (0). We will use the text of the sentences as features to train our model.

9.2 Dataset

We will use a pre-labeled dataset for sentence-level sentiment analysis of central bank speeches (Pfeifer and Marohl 2023), which is available on Hugging Face (Central Bank Communication Dataset). The dataset contains sentences from ECB, FED, and BIS speeches that have been labeled as positive or negative in terms of sentiment.

9.3 Putting the Problem into the Context of the Course

Given the description of the dataset, we can see that this is a supervised learning problem. We have a target variable that we want to predict, and we have features that we can use to predict this target variable. The target variable is binary, i.e., it can take two values: 0 or 1. The value 0 indicates that the sentiment is negative, while the value 1 indicates that the sentiment is positive. Thus, this is a binary classification problem.

9.4 Setting up the Environment

Let’s initialize our environment by importing the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from huggingface_hub import login
import spacy

Let’s load the dataset for this practice session. You can change the central_bank variable to load the dataset for FED or BIS speeches instead of ECB speeches if you want to work with a different central bank

central_bank = "FED" # Change this to "FED" or "BIS" if you want to use the dataset for FED or BIS speeches instead

df = pd.read_csv(f"hf://datasets/Moritz-Pfeifer/CentralBankCommunication/Sentiment/{central_bank}_prelabelled_sent.csv")

9.5 Exercises

Note that the exercises build on each other. You can sometimes skip exercises but the results for later exercises will depend on the previous ones. If you get stuck, you can skip to the next exercise and try to come back to the previous one later.

9.5.1 Exercise 1: Familiarization with the Dataset

Tasks:

Display the first 10 rows of the dataset. What columns does it contain?
Use .info() to check the data types and see if there are any missing values
Print a few sample sentences (5-10) along with their sentiment labels to get a feel for the data
How many sentences are in the dataset in total?

Hints:

Use .head(10) to display the first rows
.info() shows column types and non-null counts
Use .sample(n) to randomly select a few rows for inspection

# Your code here. Add additional code cells as needed.

9.5.2 Exercise 2: Understanding the Target Variable

Tasks:

What is the proportion of positive vs negative sentences? Use value_counts(normalize=True) on the sentiment column
Create a bar chart showing the distribution of sentiment labels
Based on this distribution, would you say the dataset is balanced or imbalanced?
Why is understanding class balance important for sentiment analysis?

Hints:

The target variable is sentiment (0 = negative, 1 = positive)
Use df['sentiment'].value_counts() and plot with .plot(kind='bar')
Consider what would happen if 90% of sentences were positive

# Your code here. Add additional code cells as needed.

9.5.3 Exercise 3: Text Preprocessing

Tasks:

Load the spaCy English model: nlp = spacy.load('en_core_web_sm')
Create a preprocessing function that:
- Takes a list/Series of texts as input
- Uses nlp.pipe() to process texts efficiently
- Disables unnecessary components: ["parser", "ner"]
- Returns lemmatized, lowercase tokens while removing stopwords, punctuation, whitespace, and numbers
- Returns a list of preprocessed texts (strings joined with spaces)
Apply your function to the entire dataset and create a new column called processed_text
Compare an original text with its preprocessed version. What changed?

Hints:

Look at the preprocess_texts() function in the lecture notes as reference
Use token.lemma_.lower() for lemmatization and lowercasing
Filter tokens with: not token.is_stop and not token.is_punct and not token.is_space and not token.like_num
Join tokens back: " ".join(tokens)

# Your code here. Add additional code cells as needed.

9.5.4 Exercise 4: Text Vectorization - Count Vectorizer

Tasks:

Split your data into features (X) and target (y)
Create train/test splits (80/20) using train_test_split with random_state=42 and stratify=y
Import and create a CountVectorizer with max_features=1000
Fit the vectorizer on the training data only and transform both train and test sets
Print the shape of the resulting feature matrices. What do the dimensions represent?

Hints:

Use from sklearn.feature_extraction.text import CountVectorizer
Use from sklearn.model_selection import train_test_split
Fit only on training data: vectorizer.fit(X_train), then transform: X_train_vec = vectorizer.transform(X_train)
Never fit on test data (data leakage!)

# Your code here. Add additional code cells as needed.

9.5.5 Exercise 5: Training a Logistic Regression Model

Tasks:

Import LogisticRegression from sklearn
Train a logistic regression model with max_iter=1000 and random_state=42
Make predictions on both train and test sets
Calculate and print the following metrics for both sets:
- Accuracy
- Precision
- Recall
- ROC AUC score
Create a confusion matrix for the test set predictions using confusion_matrix and visualize it with seaborn’s heatmap

Hints:

Import from: sklearn.linear_model, sklearn.metrics
Use clf.predict() for predictions and clf.predict_proba() for probabilities
For ROC AUC with binary classification, use the probability of the positive class: y_proba[:, 1]
Transpose the confusion matrix if needed to match lecture conventions

# Your code here. Add additional code cells as needed.

9.5.6 Exercise 6: Training a Decision Tree Model

Tasks:

Train a Decision Tree classifier with random_state=42
Evaluate using the same metrics as Exercise 5 (accuracy, precision, recall, ROC AUC)
Print and visualize the confusion matrix
How does the Decision Tree compare to Logistic Regression? Look at both training and test performance
Is there evidence of overfitting? How can you tell?

Hints:

Import DecisionTreeClassifier from sklearn.tree
Compare train vs test metrics - large gaps suggest overfitting
Decision trees tend to overfit without constraints

# Your code here. Add additional code cells as needed.

9.5.7 Exercise 7: Training a Random Forest Model

Tasks:

Import and train a Random Forest classifier with n_estimators=100 and random_state=42
Evaluate using the same metrics as previous exercises (accuracy, precision, recall, ROC AUC)
Print and visualize the confusion matrix
How does Random Forest compare to Logistic Regression and Decision Tree? Look at both training and test performance
Does Random Forest show less overfitting than a single Decision Tree? Why might this be?

Hints:

Import RandomForestClassifier from sklearn.ensemble
Random Forest is an ensemble method that combines multiple decision trees
Ensemble methods typically reduce overfitting compared to single models
Compare the gap between train and test metrics across all three models

# Your code here. Add additional code cells as needed.

9.5.8 Exercise 8: TF-IDF Vectorization and Model Retraining

Tasks:

Import and create a TfidfVectorizer with max_features=1000
Fit the vectorizer on the training data only and transform both train and test sets
Retrain all three models (Logistic Regression, Decision Tree, Random Forest) using the TF-IDF features
Evaluate each model and print metrics for test sets
How does TF-IDF compare to Count Vectorizer? Which vectorization method works better for each model?

Hints:

Import TfidfVectorizer from sklearn.feature_extraction.text
TF-IDF weights terms by their importance across documents
You should end up with 6 total model-vectorizer combinations (3 models × 2 vectorizers)
TF-IDF often performs better than counts for text classification

# Your code here. Add additional code cells as needed.

9.5.9 Exercise 9: Model Comparison and Analysis

Tasks:

Create a comparison DataFrame with columns: Model, Vectorizer, Accuracy (Test), Precision (Test), Recall (Test), ROC AUC (Test)
Include all model-vectorizer combinations you tested
Which model performed best overall? Which metric did you use to decide?
Test your best model on a few custom sentences:
- “The economic outlook is very positive and growth is strong.”
- “The central bank faces significant challenges and uncertainty.”
- “Inflation remains stable and within target.”
Do the predictions make sense?

Hints:

Use pd.DataFrame() to create the comparison table
To predict on new sentences: preprocess → vectorize → predict
Create a small function to make prediction testing easier

# Your code here. Add additional code cells as needed.

9.5.10 Exercise 10: Reflection and Discussion

Tasks:

How did the different vectorization methods (Count vs TF-IDF) affect model performance? Why do you think this is?
Which model architecture (Logistic Regression, Decision Tree, Random Forest) seems most suitable for this sentiment analysis task? Consider both performance and interpretability
What are some potential issues with using this model in practice? Think about:
- Domain specificity (trained on central bank speeches)
- Temporal aspects (language changes over time)
- Neutral sentences (the model only predicts positive/negative)
How could you improve the model further? Consider:
- More sophisticated preprocessing (n-grams, domain-specific stopwords)
- Handling negations better
- Using pre-trained language models (BERT, etc.)
- Collecting more training data

Hints:

No code required; reflect on practical and methodological aspects
Think about the trade-offs between model complexity and interpretability
Consider real-world deployment challenges

# Your reflections here (as comments or markdown)