12 Practice Session III

This practice session builds on the sentiment analysis exercises from Practice Session II. We will use the same dataset of labelled sentences from central bank speeches but now apply methods from the Generative AI chapter, including sentence embeddings, pre-trained Hugging Face pipelines, and LLM-based classification via the Ollama API.

12.1 Problem Setup

We want to classify the sentiment of sentences from central bank speeches as positive (1) or negative (0). In Practice Session II, we used hand-crafted text features like TF-IDF as input for machine learning models. Now, we will use pre-trained language models to encode sentences and classify sentiment.

12.2 Dataset

We will use a pre-labeled dataset for sentence-level sentiment analysis of central bank speeches (Pfeifer and Marohl 2023), which is available on Hugging Face (Central Bank Communication Dataset). The dataset contains sentences from ECB, FED, and BIS speeches that have been labeled as positive or negative in terms of sentiment.

12.3 Setting up the Environment

Let’s initialize our environment by importing the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from huggingface_hub import login
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, recall_score, precision_score

Let’s load the dataset for this practice session

central_bank = "FED"

df = pd.read_csv(f"hf://datasets/Moritz-Pfeifer/CentralBankCommunication/Sentiment/{central_bank}_prelabelled_sent.csv")

12.4 Exercises

Note that the exercises build on each other. You can sometimes skip exercises but the results for later exercises will depend on the previous ones. If you get stuck, you can skip to the next exercise and try to come back to the previous one later.

12.4.1 Exercise 1: Sentence Embeddings

In Practice Session II, we used TF-IDF to represent sentences as numerical vectors. Now we will use a pre-trained sentence transformer model to generate dense vector representations (embeddings) of our sentences.

Tasks:

Load the pre-trained sentence transformer model "all-MiniLM-L6-v2" using SentenceTransformer
Pick two sentences from the dataset that you would expect to have similar sentiment and two that you would expect to have different sentiment. Encode all four sentences using model.encode()
Compute the pairwise cosine similarity between the four sentences using model.similarity(). Do the results match your expectations?
Encode all sentences in the dataset using the sentence transformer model and store the resulting embeddings in a new column called "embedding"

Hints:

Use SentenceTransformer("all-MiniLM-L6-v2") to load the model
Use model.encode(list_of_sentences) to get embeddings
Use model.similarity(embeddings, embeddings) to compute pairwise cosine similarity
To encode the full dataset: df["embedding"] = list(model.encode(df["text"].tolist()))
The embeddings have 384 dimensions, which is much smaller than a typical TF-IDF vocabulary

# Your code here. Add additional code cells as needed.

12.4.2 Exercise 2: Sentiment Classification with Embeddings

Now that we have sentence embeddings, we can use them as input features for a machine learning model instead of TF-IDF vectors.

Tasks:

Split the data into training (80%) and test (20%) sets using train_test_split with random_state=42 and stratify=y. Use the embeddings as features (X) and the sentiment column as the target (y)
Train a Logistic Regression model with max_iter=1000 and random_state=42 on the training data
Train a Random Forest classifier with n_estimators=100 and random_state=42 on the training data
Evaluate both models on the test set using accuracy, precision, recall, and ROC AUC
How do these results compare to the TF-IDF-based models from Practice Session II? Why might embeddings perform differently?

Hints:

Convert embeddings to a list for sklearn: X = df["embedding"].to_list()
Use clf.predict() for class predictions and clf.predict_proba()[:, 1] for ROC AUC
Sentence embeddings capture semantic meaning, while TF-IDF captures word frequency patterns
Think about what information each representation preserves and what it loses

# Your code here. Add additional code cells as needed.

12.4.3 Exercise 3: Pre-trained Sentiment Analysis Pipeline

Instead of training our own classifier, we can use a pre-trained model from Hugging Face that is already fine-tuned for sentiment analysis.

Tasks:

Create a sentiment analysis pipeline using pipeline("sentiment-analysis")
Test the pipeline on the following two sentences and inspect the output format:
- “The economy is growing at a strong and sustainable pace.”
- “Inflation risks remain elevated and economic uncertainty persists.”
Apply the pipeline to all sentences in the test set (from Exercise 2) using batch_size=32. Map the output labels to match our dataset format: POSITIVE = 1, NEGATIVE = 0
Evaluate the pre-trained model on the test set using accuracy, precision, recall, and ROC AUC. For ROC AUC, use the model’s confidence score as the predicted probability (use 1 - score when the label is NEGATIVE)
How does this pre-trained model compare to the classifiers you trained in Exercise 2? What might explain the differences?

Hints:

The pipeline returns a list of dictionaries with label and score keys
Map labels: int(r['label'] == 'POSITIVE') for predicted sentiment
For the probability score: r['score'] if r['label'] == 'POSITIVE' else 1 - r['score']
The default model is DistilBERT fine-tuned on SST-2 (movie reviews), not central bank speeches

# Your code here. Add additional code cells as needed.

12.4.4 Exercise 4: Zero-Shot Classification with Hugging Face

Zero-shot classification allows us to classify text into categories without any task-specific training data. The model uses its general language understanding to match text to candidate labels.

Tasks:

Create a zero-shot classification pipeline using pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
Test the pipeline on a few example sentences from the dataset. Use ["positive", "negative"] as candidate labels
Apply the zero-shot classifier to the test set. Map the predictions to our dataset format: positive = 1, negative = 0. Since zero-shot classification can be slow, you may want to test on a smaller subset first (e.g., the first 100 sentences of the test set)
Evaluate the zero-shot classifier using accuracy, precision, recall, and ROC AUC
Try different candidate labels (e.g., ["optimistic", "pessimistic"]). Does the choice of labels affect the results?

Hints:

The classifier returns a dictionary with labels (sorted by score) and scores
The top label is result['labels'][0] and its score is result['scores'][0]
For ROC AUC, use the score for the “positive” label as the predicted probability
Careful phrasing of candidate labels can significantly improve zero-shot performance
If applying to many sentences, consider using a loop with progress tracking

# Your code here. Add additional code cells as needed.

12.4.5 Exercise 5: LLM-Based Classification via Ollama [Optional]

Note on Computational Resources

This exercise is optional since the models you can run through Nuvolos may not be powerful enough to achieve good performance on this task. If you run this on a personal computer with a GPU, you are likely to see much better results.

We can also use a locally hosted LLM via Ollama and the OpenAI-compatible API to perform sentiment classification using prompt engineering.

Before starting, make sure Ollama is running (ollama serve in the terminal) and that you have pulled a model (e.g., ollama pull gemma3:1b).

Tasks:

Set up the OpenAI client to connect to your local Ollama server at http://localhost:11434/v1
Write a function classify_zero_shot(sentence) that sends a prompt to the LLM asking it to classify the sentiment of a sentence as “positive” or “negative”. The function should return the model’s response as a lowercase string
Write a function classify_few_shot(sentence) that includes 4-5 example sentences with their correct labels in the prompt before asking the model to classify the given sentence
Apply both functions to a small subset of the test set (e.g., 20-30 sentences) and evaluate the results using accuracy. How do zero-shot and few-shot compare?

Hints:

Use OpenAI(base_url="http://localhost:11434/v1", api_key="") to create the client
Structure your prompt clearly: describe the task, provide the sentence, and specify the exact output format
For few-shot, include examples in the prompt before the sentence to classify
Use response.choices[0].message.content.strip().lower() to clean the output
Small models may struggle with this task, so do not expect perfect results. Consider using a larger model (e.g., gemma3:4b) if available

# Your code here. Add additional code cells as needed.