import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from huggingface_hub import login
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, recall_score, precision_score12 Practice Session III
This practice session builds on the sentiment analysis exercises from Practice Session II. We will use the same dataset of labelled sentences from central bank speeches but now apply methods from the Generative AI chapter, including sentence embeddings, pre-trained Hugging Face pipelines, and LLM-based classification via the Ollama API.
12.1 Problem Setup
We want to classify the sentiment of sentences from central bank speeches as positive (1) or negative (0). In Practice Session II, we used hand-crafted text features like TF-IDF as input for machine learning models. Now, we will use pre-trained language models to encode sentences and classify sentiment.
12.2 Dataset
We will use a pre-labeled dataset for sentence-level sentiment analysis of central bank speeches (Pfeifer and Marohl 2023), which is available on Hugging Face (Central Bank Communication Dataset). The dataset contains sentences from ECB, FED, and BIS speeches that have been labeled as positive or negative in terms of sentiment.
12.3 Setting up the Environment
Let’s initialize our environment by importing the necessary libraries
Let’s load the dataset for this practice session
central_bank = "FED"
df = pd.read_csv(f"hf://datasets/Moritz-Pfeifer/CentralBankCommunication/Sentiment/{central_bank}_prelabelled_sent.csv")12.4 Exercises
Note that the exercises build on each other. You can sometimes skip exercises but the results for later exercises will depend on the previous ones. If you get stuck, you can skip to the next exercise and try to come back to the previous one later.
12.4.1 Exercise 1: Sentence Embeddings
In Practice Session II, we used TF-IDF to represent sentences as numerical vectors. Now we will use a pre-trained sentence transformer model to generate dense vector representations (embeddings) of our sentences.
Tasks:
- Load the pre-trained sentence transformer model
"all-MiniLM-L6-v2"usingSentenceTransformer - Pick two sentences from the dataset that you would expect to have similar sentiment and two that you would expect to have different sentiment. Encode all four sentences using
model.encode() - Compute the pairwise cosine similarity between the four sentences using
model.similarity(). Do the results match your expectations? - Encode all sentences in the dataset using the sentence transformer model and store the resulting embeddings in a new column called
"embedding"
Hints:
- Use
SentenceTransformer("all-MiniLM-L6-v2")to load the model - Use
model.encode(list_of_sentences)to get embeddings - Use
model.similarity(embeddings, embeddings)to compute pairwise cosine similarity - To encode the full dataset:
df["embedding"] = list(model.encode(df["text"].tolist())) - The embeddings have 384 dimensions, which is much smaller than a typical TF-IDF vocabulary
# Your code here. Add additional code cells as needed.12.4.2 Exercise 2: Sentiment Classification with Embeddings
Now that we have sentence embeddings, we can use them as input features for a machine learning model instead of TF-IDF vectors.
Tasks:
- Split the data into training (80%) and test (20%) sets using
train_test_splitwithrandom_state=42andstratify=y. Use the embeddings as features (X) and the sentiment column as the target (y) - Train a Logistic Regression model with
max_iter=1000andrandom_state=42on the training data - Train a Random Forest classifier with
n_estimators=100andrandom_state=42on the training data - Evaluate both models on the test set using accuracy, precision, recall, and ROC AUC
- How do these results compare to the TF-IDF-based models from Practice Session II? Why might embeddings perform differently?
Hints:
- Convert embeddings to a list for sklearn:
X = df["embedding"].to_list() - Use
clf.predict()for class predictions andclf.predict_proba()[:, 1]for ROC AUC - Sentence embeddings capture semantic meaning, while TF-IDF captures word frequency patterns
- Think about what information each representation preserves and what it loses
# Your code here. Add additional code cells as needed.12.4.3 Exercise 3: Pre-trained Sentiment Analysis Pipeline
Instead of training our own classifier, we can use a pre-trained model from Hugging Face that is already fine-tuned for sentiment analysis.
Tasks:
- Create a sentiment analysis pipeline using
pipeline("sentiment-analysis") - Test the pipeline on the following two sentences and inspect the output format:
- “The economy is growing at a strong and sustainable pace.”
- “Inflation risks remain elevated and economic uncertainty persists.”
- Apply the pipeline to all sentences in the test set (from Exercise 2) using
batch_size=32. Map the output labels to match our dataset format:POSITIVE= 1,NEGATIVE= 0 - Evaluate the pre-trained model on the test set using accuracy, precision, recall, and ROC AUC. For ROC AUC, use the model’s confidence score as the predicted probability (use
1 - scorewhen the label isNEGATIVE) - How does this pre-trained model compare to the classifiers you trained in Exercise 2? What might explain the differences?
Hints:
- The pipeline returns a list of dictionaries with
labelandscorekeys - Map labels:
int(r['label'] == 'POSITIVE')for predicted sentiment - For the probability score:
r['score'] if r['label'] == 'POSITIVE' else 1 - r['score'] - The default model is DistilBERT fine-tuned on SST-2 (movie reviews), not central bank speeches
# Your code here. Add additional code cells as needed.12.4.4 Exercise 4: Zero-Shot Classification with Hugging Face
Zero-shot classification allows us to classify text into categories without any task-specific training data. The model uses its general language understanding to match text to candidate labels.
Tasks:
- Create a zero-shot classification pipeline using
pipeline("zero-shot-classification", model="facebook/bart-large-mnli") - Test the pipeline on a few example sentences from the dataset. Use
["positive", "negative"]as candidate labels - Apply the zero-shot classifier to the test set. Map the predictions to our dataset format:
positive= 1,negative= 0. Since zero-shot classification can be slow, you may want to test on a smaller subset first (e.g., the first 100 sentences of the test set) - Evaluate the zero-shot classifier using accuracy, precision, recall, and ROC AUC
- Try different candidate labels (e.g.,
["optimistic", "pessimistic"]). Does the choice of labels affect the results?
Hints:
- The classifier returns a dictionary with
labels(sorted by score) andscores - The top label is
result['labels'][0]and its score isresult['scores'][0] - For ROC AUC, use the score for the “positive” label as the predicted probability
- Careful phrasing of candidate labels can significantly improve zero-shot performance
- If applying to many sentences, consider using a loop with progress tracking
# Your code here. Add additional code cells as needed.12.4.5 Exercise 5: LLM-Based Classification via Ollama [Optional]
This exercise is optional since the models you can run through Nuvolos may not be powerful enough to achieve good performance on this task. If you run this on a personal computer with a GPU, you are likely to see much better results.
We can also use a locally hosted LLM via Ollama and the OpenAI-compatible API to perform sentiment classification using prompt engineering.
Before starting, make sure Ollama is running (ollama serve in the terminal) and that you have pulled a model (e.g., ollama pull gemma3:1b).
Tasks:
- Set up the OpenAI client to connect to your local Ollama server at
http://localhost:11434/v1 - Write a function
classify_zero_shot(sentence)that sends a prompt to the LLM asking it to classify the sentiment of a sentence as “positive” or “negative”. The function should return the model’s response as a lowercase string - Write a function
classify_few_shot(sentence)that includes 4-5 example sentences with their correct labels in the prompt before asking the model to classify the given sentence - Apply both functions to a small subset of the test set (e.g., 20-30 sentences) and evaluate the results using accuracy. How do zero-shot and few-shot compare?
Hints:
- Use
OpenAI(base_url="http://localhost:11434/v1", api_key="")to create the client - Structure your prompt clearly: describe the task, provide the sentence, and specify the exact output format
- For few-shot, include examples in the prompt before the sentence to classify
- Use
response.choices[0].message.content.strip().lower()to clean the output - Small models may struggle with this task, so do not expect perfect results. Consider using a larger model (e.g.,
gemma3:4b) if available
# Your code here. Add additional code cells as needed.