9  Practice Session II

This practice session builds on the sentiment analysis exercise discussed in class. We will use the same dataset of labelled sentences from ECB, FED, and BIS speeches.

9.1 Problem Setup

We want to build a machine learning model that can predict the sentiment of a sentence from a FED speech. The sentiment can be either positive (1) or negative (0). We will use the text of the sentences as features to train our model.

9.2 Dataset

We will use a pre-labeled dataset for sentence-level sentiment analysis of central bank speeches (Pfeifer and Marohl 2023), which is available on Hugging Face (Central Bank Communication Dataset). The dataset contains sentences from ECB, FED, and BIS speeches that have been labeled as positive or negative in terms of sentiment.

9.3 Putting the Problem into the Context of the Course

Given the description of the dataset, we can see that this is a supervised learning problem. We have a target variable that we want to predict, and we have features that we can use to predict this target variable. The target variable is binary, i.e., it can take two values: 0 or 1. The value 0 indicates that the sentiment is negative, while the value 1 indicates that the sentiment is positive. Thus, this is a binary classification problem.

9.4 Setting up the Environment

Let’s initialize our environment by importing the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from huggingface_hub import login
import spacy

Let’s load the dataset for this practice session. You can change the central_bank variable to load the dataset for FED or BIS speeches instead of ECB speeches if you want to work with a different central bank

central_bank = "FED" # Change this to "FED" or "BIS" if you want to use the dataset for FED or BIS speeches instead

df = pd.read_csv(f"hf://datasets/Moritz-Pfeifer/CentralBankCommunication/Sentiment/{central_bank}_prelabelled_sent.csv")

9.5 Exercises

Note that the exercises build on each other. You can sometimes skip exercises but the results for later exercises will depend on the previous ones. If you get stuck, you can skip to the next exercise and try to come back to the previous one later.

9.5.1 Exercise 1: Familiarization with the Dataset

Tasks:

  1. Display the first 10 rows of the dataset. What columns does it contain?
  2. Use .info() to check the data types and see if there are any missing values
  3. Print a few sample sentences (5-10) along with their sentiment labels to get a feel for the data
  4. How many sentences are in the dataset in total?

Hints:

  • Use .head(10) to display the first rows
  • .info() shows column types and non-null counts
  • Use .sample(n) to randomly select a few rows for inspection
# Your code here. Add additional code cells as needed.

9.5.2 Exercise 2: Understanding the Target Variable

Tasks:

  1. What is the proportion of positive vs negative sentences? Use value_counts(normalize=True) on the sentiment column
  2. Create a bar chart showing the distribution of sentiment labels
  3. Based on this distribution, would you say the dataset is balanced or imbalanced?
  4. Why is understanding class balance important for sentiment analysis?

Hints:

  • The target variable is sentiment (0 = negative, 1 = positive)
  • Use df['sentiment'].value_counts() and plot with .plot(kind='bar')
  • Consider what would happen if 90% of sentences were positive
# Your code here. Add additional code cells as needed.

9.5.3 Exercise 3: Text Preprocessing

Tasks:

  1. Load the spaCy English model: nlp = spacy.load('en_core_web_sm')
  2. Create a preprocessing function that:
    • Takes a list/Series of texts as input
    • Uses nlp.pipe() to process texts efficiently
    • Disables unnecessary components: ["parser", "ner"]
    • Returns lemmatized, lowercase tokens while removing stopwords, punctuation, whitespace, and numbers
    • Returns a list of preprocessed texts (strings joined with spaces)
  3. Apply your function to the entire dataset and create a new column called processed_text
  4. Compare an original text with its preprocessed version. What changed?

Hints:

  • Look at the preprocess_texts() function in the lecture notes as reference
  • Use token.lemma_.lower() for lemmatization and lowercasing
  • Filter tokens with: not token.is_stop and not token.is_punct and not token.is_space and not token.like_num
  • Join tokens back: " ".join(tokens)
# Your code here. Add additional code cells as needed.

9.5.4 Exercise 4: Text Vectorization - Count Vectorizer

Tasks:

  1. Split your data into features (X) and target (y)
  2. Create train/test splits (80/20) using train_test_split with random_state=42 and stratify=y
  3. Import and create a CountVectorizer with max_features=1000
  4. Fit the vectorizer on the training data only and transform both train and test sets
  5. Print the shape of the resulting feature matrices. What do the dimensions represent?

Hints:

  • Use from sklearn.feature_extraction.text import CountVectorizer
  • Use from sklearn.model_selection import train_test_split
  • Fit only on training data: vectorizer.fit(X_train), then transform: X_train_vec = vectorizer.transform(X_train)
  • Never fit on test data (data leakage!)
# Your code here. Add additional code cells as needed.

9.5.5 Exercise 5: Training a Logistic Regression Model

Tasks:

  1. Import LogisticRegression from sklearn
  2. Train a logistic regression model with max_iter=1000 and random_state=42
  3. Make predictions on both train and test sets
  4. Calculate and print the following metrics for both sets:
    • Accuracy
    • Precision
    • Recall
    • ROC AUC score
  5. Create a confusion matrix for the test set predictions using confusion_matrix and visualize it with seaborn’s heatmap

Hints:

  • Import from: sklearn.linear_model, sklearn.metrics
  • Use clf.predict() for predictions and clf.predict_proba() for probabilities
  • For ROC AUC with binary classification, use the probability of the positive class: y_proba[:, 1]
  • Transpose the confusion matrix if needed to match lecture conventions
# Your code here. Add additional code cells as needed.

9.5.6 Exercise 6: Training a Decision Tree Model

Tasks:

  1. Train a Decision Tree classifier with random_state=42
  2. Evaluate using the same metrics as Exercise 5 (accuracy, precision, recall, ROC AUC)
  3. Print and visualize the confusion matrix
  4. How does the Decision Tree compare to Logistic Regression? Look at both training and test performance
  5. Is there evidence of overfitting? How can you tell?

Hints:

  • Import DecisionTreeClassifier from sklearn.tree
  • Compare train vs test metrics - large gaps suggest overfitting
  • Decision trees tend to overfit without constraints
# Your code here. Add additional code cells as needed.

9.5.7 Exercise 7: Training a Random Forest Model

Tasks:

  1. Import and train a Random Forest classifier with n_estimators=100 and random_state=42
  2. Evaluate using the same metrics as previous exercises (accuracy, precision, recall, ROC AUC)
  3. Print and visualize the confusion matrix
  4. How does Random Forest compare to Logistic Regression and Decision Tree? Look at both training and test performance
  5. Does Random Forest show less overfitting than a single Decision Tree? Why might this be?

Hints:

  • Import RandomForestClassifier from sklearn.ensemble
  • Random Forest is an ensemble method that combines multiple decision trees
  • Ensemble methods typically reduce overfitting compared to single models
  • Compare the gap between train and test metrics across all three models
# Your code here. Add additional code cells as needed.

9.5.8 Exercise 8: TF-IDF Vectorization and Model Retraining

Tasks:

  1. Import and create a TfidfVectorizer with max_features=1000
  2. Fit the vectorizer on the training data only and transform both train and test sets
  3. Retrain all three models (Logistic Regression, Decision Tree, Random Forest) using the TF-IDF features
  4. Evaluate each model and print metrics for test sets
  5. How does TF-IDF compare to Count Vectorizer? Which vectorization method works better for each model?

Hints:

  • Import TfidfVectorizer from sklearn.feature_extraction.text
  • TF-IDF weights terms by their importance across documents
  • You should end up with 6 total model-vectorizer combinations (3 models × 2 vectorizers)
  • TF-IDF often performs better than counts for text classification
# Your code here. Add additional code cells as needed.

9.5.9 Exercise 9: Model Comparison and Analysis

Tasks:

  1. Create a comparison DataFrame with columns: Model, Vectorizer, Accuracy (Test), Precision (Test), Recall (Test), ROC AUC (Test)
  2. Include all model-vectorizer combinations you tested
  3. Which model performed best overall? Which metric did you use to decide?
  4. Test your best model on a few custom sentences:
    • “The economic outlook is very positive and growth is strong.”
    • “The central bank faces significant challenges and uncertainty.”
    • “Inflation remains stable and within target.”
  5. Do the predictions make sense?

Hints:

  • Use pd.DataFrame() to create the comparison table
  • To predict on new sentences: preprocess → vectorize → predict
  • Create a small function to make prediction testing easier
# Your code here. Add additional code cells as needed.

9.5.10 Exercise 10: Reflection and Discussion

Tasks:

  1. How did the different vectorization methods (Count vs TF-IDF) affect model performance? Why do you think this is?
  2. Which model architecture (Logistic Regression, Decision Tree, Random Forest) seems most suitable for this sentiment analysis task? Consider both performance and interpretability
  3. What are some potential issues with using this model in practice? Think about:
    • Domain specificity (trained on central bank speeches)
    • Temporal aspects (language changes over time)
    • Neutral sentences (the model only predicts positive/negative)
  4. How could you improve the model further? Consider:
    • More sophisticated preprocessing (n-grams, domain-specific stopwords)
    • Handling negations better
    • Using pre-trained language models (BERT, etc.)
    • Collecting more training data

Hints:

  • No code required; reflect on practical and methodological aspects
  • Think about the trade-offs between model complexity and interpretability
  • Consider real-world deployment challenges
# Your reflections here (as comments or markdown)