3 Basic Concepts in Supervised Learning

We have seen in the previous chapter that supervised learning is a type of machine learning where the model is trained on a labeled dataset, meaning that each input data point \(x\) has a corresponding output label \(y\). The goal of supervised learning is to learn a mapping from inputs to outputs that can be used to make predictions on new, unseen data. We can distinguish between two main types of supervised learning tasks:

Regression tasks, where the output variable \(y\) is continuous (e.g., predicting house prices, stock prices, etc.).
Classification tasks, where the output variable \(y\) is categorical (e.g., spam detection, image recognition, etc.).

Depending on the type of task, different models and algorithms are used in supervised learning. This section provides an overview of some common supervised learning models and the key concepts involved in evaluating and improving their performance.

A key element of supervised learning is the training process, where the model learns from the labeled data by adjusting its parameters to minimize a loss function. The loss function quantifies the difference between the predicted outputs and the true labels, guiding the model to improve its predictions over time. This process is typically done using optimization algorithms such as gradient descent. In some special cases, closed-form solutions exist (e.g., linear regression), but in most cases, numerical optimization methods are required. Furthermore, note that the loss function used for regression tasks is typically different from the one used for classification tasks but the overall goal remains the same: to minimize the loss and improve the model’s predictive accuracy.

We will start this section with placing two supervised learning models with which you are already familiar, linear regression and logistic regression, in a machine-learning context. We then have a brief look at another important supervised learning model, k-nearest neighbors. Later, we will discuss how to evaluate regression and classification models and introduce the concepts of generalization and overfitting. Finally, we will implement some of these concepts in Python.

3.1 Basic Supervised Learning Models

In this subsection, we will first quickly review linear regression and logistic regression in a machine-learning context. Since you are already familiar with these methods, we will use them to introduce some new concepts such as the decision boundary and feature engineering. Then, we will briefly discuss the k-nearest neighbors algorithm, which is a simple yet powerful supervised learning model that can be used for both regression and classification tasks.

3.1.1 Linear Regression in a ML Context

{
    //
    const randomDataRange = [-4, 4];
    const lossRange = [0, 11];

    // Generate random data
    const seed = 0.25160386911120525; // a number in [0,1)
    const source = d3.randomLcg(seed);
    const randn = d3.randomNormal.source(source)(0, 1);
    let xTmp = Array.from({length: 100}, randn);
    let yTmp = Array.from({length: 100}, randn);
    let randomData = prepareData(xTmp, yTmp);

    function prepareData(xTmp, yTmp) {

        let data = [];

        for (let i = 0; i < xTmp.length; i += 1) {
            let dataPoint = {"x": xTmp[i], "y": xTmp[i] + yTmp[i]};
            data.push(dataPoint)
        }

        return data;

    }

    // Loss function
    function sumOfSquares(slope) {

        let sum = 0;

        for (let i = 0; i < randomData.length; i += 1) {
            sum += (randomData[i].y - slope * randomData[i].x) ** 2 / randomData.length;
        }

        return sum;

    }

    function computeLossCurve(inputRange) {

        let loss = [];
        let N = 100;

        for (let i = 0; i <= 100; i += 1) {
            let x = inputRange[0] + i * (inputRange[1] - inputRange[0])/100
            let dataPoint = {"x": x, "y": sumOfSquares(x)};
            loss.push(dataPoint)
        }

        return loss;

    }

    // Data for regression line
    let slopeData = [
        {x: (slope > 1 || slope < -1)  ? randomDataRange[0]/slope : randomDataRange[0], y: (slope > 1 || slope < -1)  ? randomDataRange[0] : randomDataRange[0]*slope},
        {x: (slope > 1 || slope < -1)  ? randomDataRange[1]/slope : randomDataRange[1], y: (slope > 1 || slope < -1)  ? randomDataRange[1] : randomDataRange[1]*slope}
    ];

    // Data for loss marker
    let markerData = [
        {x: slope, y: sumOfSquares(slope) }
    ];

    // Data for loss curve
    let lossCurveData = computeLossCurve(inputRange);

    const sets = [1, 2];
    return html`<div style="display:flex; flex-wrap: wrap;">
    ${sets
    .map((index) =>
        Plot.plot({
            marginLeft: 40,
            marginTop: 20,
            width: width / 2 - 2,
            height: width / 3 - 2,
            x: { domain: index === 1 ? randomDataRange : inputRange, label: index === 1 ? "x" : "Weight"},
            y: { domain: index === 1 ? randomDataRange : lossRange, label: index === 1 ? "y" : "Loss"},
            marks: [
                Plot.frame(),
                index === 1
                    ? Plot.dot(randomData, {x: "x", y: "y", fill: "steelblue"})
                    : Plot.line(lossCurveData, {x: "x", y: "y", stroke: "steelblue"}),
                index === 1
                    ? Plot.line(slopeData, {x: "x", y: "y", stroke: "red"})
                    : Plot.dot(markerData, {x: "x", y: "y", fill: "red", r: 6}),
                Plot.text([index], {
                    x: index === 1 ? (randomDataRange[1] + randomDataRange[0])/2 : (inputRange[1] + inputRange[0])/2,
                    y: index === 1 ? ((randomDataRange[1] - randomDataRange[0]) * 0.1 + randomDataRange[1]) : ((lossRange[1] - lossRange[0]) * 0.1 + lossRange[1]),
                    text: (d) => index === 1 ? "Data + Regression Line" : "Mean Squared Error (MSE)",
                    textAnchor: "middle"
                })
            ],
            nice: true
        })
    )
    .flat()}`;
}

inputRange = [-2, 4];
viewof slope = Inputs.range(inputRange, {label: "Weight (w)", step: 0.01})

Figure 3.1: Interactive Linear Regression With a Single Feature \(x\) (i.e., \(m=1\)) and Bias \(b=0\)

You have already extensively studied linear regressions in courses of the program, so we will not discuss it in much detail. In machine learning, it is common to talk about weights \(w_i\) and biases \(b_i\) instead of coefficients \(\beta_i\) and intercept \(\beta_0\), i.e., the linear regression model would be written as

\[y_n = b + \sum_{i=1}^m w_i x_{i,n} + \varepsilon_n \qquad n=1,\ldots,N\]

where \(w_i\) are the weights, \(b\) is the bias and \(N\) is the sample size. The weights and biases are found by minimizing the empirical risk function or mean squared error (MSE) loss, which is a measure of how well the model fits the data.

\[\text{MSE}(y,x; w, b) = \frac{1}{N}\sum_{n=1}^N (y_n - \hat{y}_n)^2\]

where \(y_n\) is the true value, \(\hat{y}_n\) is the predicted (or fitted) value for observation \(n\).

In the case of linear regression, there is a closed-form solution for the weights and biases that minimize the MSE. However, the weights and biases have to be found numerically in many other machine learning models since there is no closed-form solution. One can think of this, as the machine learning algorithm automatically moving a slider for the slope Figure 3.2 until the loss is minimized (i.e., the red dot is at the lowest possible point) and the model fits the data as well as possible.

3.1.2 Logistic Regression in a ML Context

Logistic regression is a widely used classification model \(p(y|x;w,b)\) where \(x\in \mathbb{R}^m\) is an input vector, and \(y\in\{0,1,\ldots,C\}\) is a class label. We will focus on the binary case, meaning that \(y\in\{0,1\}\) but it is also possible to extend this to more than two classes. The probability that \(y_n\) is equal to \(1\) for observation \(n\) is given by \[p(y_n=1|x_n;w,b)= \frac{1}{1+\exp(-b-\sum_{i=1}^m w_i x_{i,n})}\]

where \(w=[w_1, \ldots, w_m]'\in \mathbb{R}^m\) is a weight vector, and \(b\) is a bias term. Combining the probabilities for each observation \(n\), we can write the likelihood function as

\[\mathcal{L}(w,b) = \prod_{n=1}^N p(y_n=1|x_n;w,b)^{y_n} \left(1-p(y_n=1|x_n;w,b)\right)^{1-y_n}\]

or taking the natural logarithm of the likelihood function, we get the log-likelihood function

\[\log\mathcal{L}(w,b) = \sum_{n=1}^N y_n \log p(y_n=1|x_n;w,b) + (1-y_n) \log \left(1-p(y_n=1|x_n;w,b)\right).\]

To find the weights and biases, we need to numerically maximize the log-likelihood function (or minimize \(-\log\mathcal{L}(w,b)\)).

Adding a classification threshold \(t\) to a logistic regression yields a decision rule of the form

\[\hat{y}=1 \Leftrightarrow p(y=1|x;w,b) > t,\]

i.e., the model predicts that \(y=1\) if \(p(y=1|x;w,b) > t\).

{
    //
    const randomDataRange = [-2, 2];
    const loglikelihoodRange = [-20, -8];

    // Generate random data
    const seed = 0.25160386911120525; // a number in [0,1)
    const source = d3.randomLcg(seed);
    const rand = d3.randomUniform.source(source)();
    let xTmp = Array.from({length: 100}, rand).map((x) => 4*x-2);
    let pTmp = xTmp.map((x) => 1/(1+Math.exp(-8*x)));
    let cTmp = xTmp.map((x) => (1/(1+Math.exp(-slope2*x)) >= thresholdValue) ? 1 : 0); // Classification based on threshold and slope2
    let yTmp = pTmp.map((p) => Array.from({length: 1}, rand)[0] < p ? 1 : 0);
    let randomData = prepareData(xTmp, yTmp, cTmp);

    function prepareData(xTmp, yTmp, cTmp) {

        let data = [];

        for (let i = 0; i < xTmp.length; i += 1) {
            let dataPoint = {"x": xTmp[i], "y": yTmp[i], "class": cTmp[i]};
            data.push(dataPoint)
        }

        return data;

    }

    // Regression line
    let regData = computeLogisticRegressionLine(slope2, randomDataRange, 100)

    function computeLogisticRegressionLine(slope, inputRange, N) {

        let data = [];

        for (let i = 0; i <= N; i += 1) {
            let x = inputRange[0] + i * (inputRange[1] - inputRange[0])/100;
            let dataPoint = {"x": x, "y": 1/(1+Math.exp(-slope*x))};
            data.push(dataPoint);
        }

        return data;

    }

    // Log-likelihood function
    function computeLogLikelihood(slope, data) {

        let loglikelihood = 0;

        for (let i = 0; i < data.length; i += 1) {
            loglikelihood += data[i].y * Math.log(1/(1+Math.exp(-slope*data[i].x))) + (1-data[i].y) * Math.log(1-1/(1+Math.exp(-slope*data[i].x)));
        }

        return loglikelihood;

    }

    function computeLossCurve(inputRange, data, N) {

        let loss = [];

        for (let i = 0; i <= N; i += 1) {
            let x = inputRange[0] + i * (inputRange[1] - inputRange[0])/100;
            let dataPoint = {"x": x, "y": computeLogLikelihood(x, data)};
            loss.push(dataPoint);
        }

        return loss;

    }

    let loglikelihoodCurveData = computeLossCurve(inputRange2, randomData, 100);
    let markerData = [
        {x: slope2, y: computeLogLikelihood(slope2, randomData) }
    ];

    // Data for threshold
    let threshData = [
        {x: randomDataRange[0], y: thresholdValue},
        {x: randomDataRange[1], y: thresholdValue}
    ];

    const sets = [1, 2];
    return html`<div style="display:flex; flex-wrap: wrap;">
    ${sets
    .map((index) =>
        Plot.plot({
            marginLeft: 40,
            marginTop: 20,
            width: width / 2 - 2,
            height: width / 3 - 2,
            x: { domain: index === 1 ? [-2, 2] : inputRange2, label: index === 1 ? "X" : "Weight"},
            y: { domain: index === 1 ? [0, 1] : loglikelihoodRange, label: index === 1 ? "Y" : "log(L(w,b))"},
            color: {
                domain: [0, 1],
                scheme: "Paired"
            },
            marks: [
                Plot.frame(),
                index === 1
                    ? Plot.dot(randomData, {x: "x", y: "y", r: 4, fill: threshold.includes("") ? "class" : "steelblue"})
                    : Plot.line(loglikelihoodCurveData, {x: "x", y: "y", stroke: "steelblue"}),
                index === 1
                    ? Plot.line(regData, {x: "x", y: "y", stroke: "red"})
                    : Plot.dot(markerData, {x: "x", y: "y", fill: "red", r: 6}),
                threshold.includes("")
                    ? Plot.line(threshData, {x: "x", y: "y", stroke: "black", strokeDasharray: "10,5"})
                    : null,
                Plot.text([index], {
                    x: index === 1 ? (randomDataRange[1] + randomDataRange[0])/2 : (inputRange2[1] + inputRange2[0])/2,
                    y: index === 1 ? 1.1 : ((loglikelihoodRange[1] - loglikelihoodRange[0]) * 0.1 + loglikelihoodRange[1]),
                    text: (d) => index === 1 ? "Data + Regression Line" : "Log-Likelihood",
                    textAnchor: "middle"
                })
            ],
            
            nice: true
        })
    )
    .flat()}`;
}

inputRange2 = [-2, 18];
viewof slope2 = Inputs.range(inputRange2, {label: "Weight (w)", step: 0.01})
viewof threshold = Inputs.checkbox([""], {label: "Classification Threshold"})
viewof thresholdValue = Inputs.range([0,1], {value: 0.5, label: "Threshold Value (t)", step: 0.01})

Figure 3.3: Interactive Logistic Regression With a Single Feature \(x\) (i.e., \(m=1\)) and Bias \(b=0\) (Classification Legend: dark blue = 1, light blue = 0)

Terminology: Regression vs. Classification

Do not get confused about the fact that it is called logistic regression but is used for classification tasks. Logistic regression provides an estimate of the probability that \(y=1\) for given \(x\), i.e., an estimate for \(p(y=1|x;w,b)\). To turn, this into a classification model, we also need a classification threshold value for \(p(y=1|x;w,b)\) above which we classify an observation as \(y=1\).

Figure 3.4 shows an interactive example of a logistic regression model. The left-hand side shows the data points and the regression line. The right-hand side shows the log-likelihood function with the red dot showing the value of the log-likelihood for the current value of \(w\). The goal is to find the weight \(w\) in the regression line that maximizes the log-likelihood function (we assumed \(b=0\) for simplicity).

If you enable the classification threshold \(t\), a data point is shown as dark blue if \(p(y=1|x;w,b) > t\), otherwise, it is shown in light blue. Note how the value of the threshold affects the classification of the data points for points in the middle. Essentially, for each classification threshold, we have a different classification model. But how do we choose the classification threshold? This is a topic that we will discuss in the next section.

Logistic regression belongs to the class of generalized linear models with logit as the link function. We could write

\[\log\left(\frac{p}{1-p}\right)= b+\sum_{i=1}^m w_i x_{i,n}\]

where \(p=p(y_n=1|x_n;w,b)\), which separates the linear part on the right-hand side from the logit on the left-hand side.

{
    //
    const randomDataRange = [-2, 2];
    const loglikelihoodRange = [-20, -8];

    // Generate random data
    const seed = 0.25160386911120525; // a number in [0,1)
    const source = d3.randomLcg(seed);
    const rand = d3.randomUniform.source(source)();
    let x1Tmp = Array.from({length: 200}, rand).map((x) => (randomDataRange[1] - randomDataRange[0]) * x + randomDataRange[0]);
    let x2Tmp = Array.from({length: 200}, rand).map((x) => (randomDataRange[1] - randomDataRange[0]) * x + randomDataRange[0]);
    let pTmp = x1Tmp.map((x1, i) => 1/(1+Math.exp(-0.5 - 3 * x1 - 4 * x2Tmp[i] - 2 * Math.pow(x1, 2)))); // "True" model
    //(usSquare.includes("") ? 2 * Math.pow(x1, 2) : 0)
    let yTmp = pTmp.map((p) => Array.from({length: 1}, rand)[0] < p ? 1 : 0);
    let randomData = prepareData(yTmp, x1Tmp, x2Tmp);

    function prepareData(yTmp, x1Tmp, x2Tmp) {

        let data = [];

        for (let i = 0; i < x1Tmp.length; i += 1) {
            let dataPoint = {"x1": x1Tmp[i], "x2": x2Tmp[i], "y": yTmp[i]};
            data.push(dataPoint)
        }

        return data;

    }

    // Decision boundary
    let regData = computeDecisionBoundary(weight1, weight2, weight3, bias, 0.5, randomDataRange, 100)

    function computeDecisionBoundary(w1, w2, w3, b, t, inputRange, N) {

        let data = [];

        for (let i = 0; i <= N; i += 1) {
            let x1 = inputRange[0] + i * (inputRange[1] - inputRange[0])/100;
            let x2 = 1/w2 * (-b - w1 * x1 - (useSquare.includes("") ? w3 * Math.pow(x1,2)  : 0) - Math.log((1-t)/t));
            x2 = Math.min(Math.max(x2, inputRange[0]), inputRange[1])
            let c = 1/(1+Math.exp(-b - w1 * x1 - w2 * x2 - 2 * Math.pow(x1, 2) + 3 * Math.pow(x1, 3)) < t ? 0 : 1);
            let dataPoint = {"x1": x1, "x2": x2, "lb": inputRange[0], "ub": inputRange[1], "c0": 0, "c1": 1};
            data.push(dataPoint);
        }

        return data;

    }


    return Plot.plot({
            marginLeft: 40,
            marginTop: 20,
            width: width - 2,
            height: width * 2/3 - 2,
            x: { domain: randomDataRange, label: "x₁"},
            y: { domain: randomDataRange, label: "x₂"},
            color: {
                domain: [0, 1],
                scheme: "Category10"
            },
            marks: [
                Plot.frame(),
                Plot.dot(randomData, {x: "x1", y: "x2", r: 4, fill: "y"}),
                Plot.areaY(regData, {x: "x1", y: "x2", y2: "lb", fill: "c0", fillOpacity: 0.2}),
                Plot.areaY(regData, {x: "x1", y: "x2", y2: "ub", fill: "c1", fillOpacity: 0.2}),
                Plot.line(regData, {x: "x1", y: "x2"})
            ],
            
            nice: true
        })
}

viewof weight1 = Inputs.range([-5, 5], {value: 3, label: "Weight w₁", step: 0.01})
viewof weight2 = Inputs.range([0.1, 10], {value: 4, label: "Weight w₂", step: 0.01})
viewof bias = Inputs.range([-10, 10], {value: 0.5, label: "Bias b", step: 0.01})
viewof useSquare = Inputs.checkbox([""], {label: "Include square of x₁"})
viewof weight3 = Inputs.range([-5, 5], {value: 2, label: "w₃", step: 0.01})

Figure 3.5: Decision Boundary - Logistic Regression with Features \(x_1\) and \(x_2\) (and optionally \(x_1^2\))

This linearity also shows up in the linear decision boundary produced by a logistic regression in Figure 3.6. A decision boundary shows how a machine-learning model separates different classes in our data, i.e, how it would classify an arbitrary combination of \((x_1,x_2)\). This linearity of the decision boundary can pose a problem if the two classes are not linearly separable as in Figure 3.6. We can remedy this issue by including higher order terms for \(x_1\) and \(x_2\) such as \(x_1^2\) or \(x_2^3\), which is a type of feature engineering. However, there are many forms of non-linearity that the decision boundary can have and we cannot try all of them. You might know the following phrase from a Tolstoy book

“Happy families are all alike; every unhappy family is unhappy in its own way.”

In the context of non-linear functions, people sometimes say

“Linear functions are all alike; every non-linear function is non-linear in its own way.”

During the course, we will learn more advanced machine-learning techniques that can produce non-linear decision boundaries without the need for feature engineering.

3.1.3 K-Nearest Neighbors

The K-Nearest Neighbors (KNN) algorithm is a simple and intuitive method for classification and regression. The KNN algorithm uses the \(K\) nearest neighbors of a data point to make a prediction. For example, in the case of a regression task, the prediction \(\hat{y}\) for a new data point \(x\) is

\[\hat{y} = \frac{1}{K}\sum_{x_i\in N_k(x)} y_i\]

i.e., the average of the \(K\) nearest neighbors of \(x\). In the case of a classification task, the prediction \(\hat{y}\) is the majority class of the \(K\) nearest neighbors of \(x\).

Figure 3.7: K-Nearest Neighbors Classification with \(K=5\) (Classification shown as Shaded Area)

Figure 3.7 shows an example of the K-Nearest Neighbors algorithm applied to the dataset that was used for the logistic regression example in Figure 3.4. For each point in the feature space, the KNN algorithm looks at the \(K=5\) nearest neighbors and classifies the point based on the majority class of these neighbors. Note that the color of the data point is the true value, while the shaded area shows the classification made by the KNN algorithm.

The decision boundary is highly non-linear and can adapt to the shape of the data. However, this flexibility comes at a cost. The KNN algorithm can be computationally expensive, especially for large datasets, since it needs to compute the distance between the new data point and all other data points in the training set. Additionally, the KNN algorithm is sensitive to the choice of \(K\) and the distance metric used to compute the nearest neighbors.

In the example, above we used the Euclidean distance to compute the nearest neighbors. The squared Euclidean distance between two points \(x\) and \(y\) in \(p\)-dimensional space is defined as

\[d(x_i, x_j) = \sum_{n=1}^p (x_{in} - x_{jn})^2=\lVert x_i - x_j \rVert^2\]

where \(x_{in}\) and \(x_{jn}\) are the \(n\)-th feature of the \(i\)-th and \(j\)-th observation in our dataset, respectively.

Note that the Euclidean distance can be used in both regression and classification problems since it only applies to the features \(x\) and not the target variable \(y\). However, if the features include categorical variables, we need to use a different distance metric that can accommodate such variables.

3.2 Model Evaluation

Suppose our machine learning model has learned the weights and biases that minimize the loss function. How do we know if the model is any good? In this section, we will discuss how to evaluate regression and classification models.

3.2.1 Regression Models

In the case of regression models, we can use the mean squared error (MSE) as a measure of how well the model fits the data. The MSE is defined as

\[\text{MSE} = \frac{1}{N}\sum_{n=1}^N (y_n - \hat{y}_n)^2,\]

where \(y_n\) is the true value, \(\hat{y}_n\) is the predicted value for observation \(n\) and \(N\) is the sample size. A low MSE indicates a good fit, while a high MSE indicates a poor fit. In the ideal case, the MSE is zero, meaning that the model perfectly fits the data. Related to the MSE is the root mean squared error (RMSE), which is the square root of the MSE

\[\text{RMSE} = \sqrt{\text{MSE}}.\]

The RMSE is in the same unit as the target variable \(y\) and is easier to interpret than the MSE.

Regression models are sometimes also evaluated based on the coefficient of determination \(R^2\). The \(R^2\) is defined as

\[R^2 = 1 - \frac{\sum_{n=1}^N (y_n - \hat{y}_n)^2}{\sum_{n=1}^N (y_n - \bar{y})^2},\]

where \(\bar{y}\) is the mean of the true values \(y_n\). The \(R^2\) is a measure of how well the model fits the data compared to a simple model that predicts the mean of the true values for all observations. The \(R^2\) can take values between \(-\infty\) and \(1\). A value of \(1\) indicates a perfect fit, while a value of \(0\) indicates that the model does not perform better than the simple model that predicts the mean of the true values for all observations. Note that the \(R^2\) is a normalized version of the MSE

\[R^2 = 1 - \frac{N\times\text{MSE}}{\sum_{n=1}^N (y_n - \bar{y})^2}.\]

Thus, we would rank models based on the \(R^2\) in the same way as we would rank them based on the MSE or the RMSE.

There are many more metrics but at this stage, we will only look at one more: the mean-absolute-error (MAE). The MAE is defined as

\[\text{MAE} = \frac{1}{N}\sum_{n=1}^N |y_n - \hat{y}_n|.\]

The MAE is the average of the absolute differences between the true values and the predicted values. Note that the MAE does not penalize large errors as much as the MSE does.

3.2.2 Classification Models

In the case of classification models, we need different metrics to evaluate the performance of the model. We will discuss some of the most common metrics in the following subsections.

Basic Metrics

A key measure to evaluate a classification model, both binary and multiclass classification, is to look at how often it predicts the correct class. This is called the accuracy of a model \[\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}.\]

Related to this, one could also compute the misclassification rate \[\text{Misclassification Rate} = \frac{\text{Number of incorrect predictions}}{\text{Total number of predictions}}.\]

While these measures are probably the most intuitive measures to assess the performance of a classification model, they can be misleading in some cases. For example, if we have a dataset with 95% of the observations in class 1 and 5% in class 0, a model that always predicts \(y=1\) (class 1) would have an accuracy of 95%. However, this model would not be very useful.

Confusion Matrices

In this and the following subsection, we focus on binary classification problems.

Let \(\hat{y}\) denote the predicted class and \(y\) the true class. In a binary classification problem, we can make two types of errors. First, we can make an error because we predicted \(\hat{y}=1\) when \(y=0\), which is called a false positive (or a “false alarm”). Sometimes this is also called a type I error. Second, we can make an error because we predicted \(\hat{y}=0\) when \(y=1\), which is called a false negative (or a “missed detection”). Sometimes this is referred to as a type II error.

We can summarize the predictions of a classification model in a confusion matrix as seen in Figure 3.8. The confusion matrix is a \(2\times 2\) matrix that shows the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) of a binary classification model.

There is a tradeoff between the two types of errors. For example, you could get fewer false negatives by predicting \(\hat{y}=1\) more often, but this would increase the number of false positives. In the extreme case, if you only predict \(\hat{y}=1\) for all observations, you would have no false negatives at all. However, you would also have no true negatives making the model of questionable usefulness.

Confusion Matrix: Dependence on Classification Threshold \(t\)

The number of true positives, true negatives, false positives, and false negatives in the confusion matrix depends on the classification threshold \(t\).

Note that we can compute the accuracy measure as a function of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) \[\text{Accuracy} = \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}},\]

while the misclassification rate is given by \[\text{Misclassification Rate} = \frac{\text{FP}+\text{FN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}.\]

Another useful measure that can be derived from the confusion matrix is the precision. It measures the fraction of positive predictions that were actually correct, i.e., \[\text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}}\]

The true positive rate (TPR) or recall or sensitivity measures the fraction of actual positives that were correctly predicted, i.e. \[\text{Recall} = \frac{\text{TP}}{\text{TP}+\text{FN}}.\]

Analogously, true negative rate (TNR) or specificity measures the fraction of actual negatives that were correctly predicted, i.e., \[\text{TNR} = \frac{\text{TN}}{\text{FP}+\text{TN}}\]

Finally, the false positive rate (FPR) measures the fraction of actual negatives that were incorrectly predicted to be positive, i.e., \[\text{FPR} = 1-\text{TNR} = \frac{\text{FP}}{\text{FP}+\text{TN}}\]

Note that all of these measures can be computed for a given classification threshold \(t\). They capture different aspects of the quality of the predictions of a classification model.

Multiclass Classification

In the case of multiclass classification, the confusion matrix is a \(K\times K\) matrix, where \(K\) is the number of classes. The diagonal elements of the confusion matrix represent the number of correct predictions for each class, while the off-diagonal elements represent the number of incorrect predictions.

Note that we can binarize multiclass classification problems, which allows us to use the same metrics as in binary classification. Two such binarization schemes are

One-vs-Rest (or One-vs-All): In this scheme, we train \(K\) binary classifiers, one for each class to distinguish it from all other classes. We can then use the class with the highest score as the predicted class for a new observation.
One-vs-One: In this scheme, we train \(K(K-1)/2\) binary classifiers, one for each pair of classes. We can then use a majority vote to determine the class of a new observation.

Receiver Operating Characteristic (ROC) Curves and Area Under the Curve (AUC)

Figure 3.9 shows a Receiver Operating Characteristic (ROC) curve which is a graphical representation of the tradeoff between the true positive rate (TPR) and the false positive rate (FPR) for different classification thresholds. The ROC curve is a useful tool to visualize the performance of a classification model. The diagonal line in the ROC curve represents a random classifier. A classifier that is better than random will have a ROC curve above the diagonal line. The closer the ROC curve is to the top-left corner, the better the classifier.

Figure 3.9: Receiver Operating Characteristic (ROC) Curve

The Area Under the Curve (AUC) of the ROC curve is a measure to compare different classification models. The AUC is a value between 0 and 1, where a value of 1 indicates a perfect classifier and a value of 0.5 indicates a random classifier. Figure 3.10 shows the AUC of a classifier as the shaded area under the ROC curve. Note that the AUC summarizes the ROC curve, which itself represents the quality of predictions of our classification model at different thresholds, in a single number.

3.3 Generalization and Overfitting

Typically, we are not just interested in having a good fit for the dataset on which we are training a classification (or regression) model, after all, we already have the actual classes or realization of predicted variables in our dataset. What we are really interested in is that a classification or regression model generalizes to new data.

However, since the models that we are using are highly flexible, it can be the case that we have a very high accuracy during the training of our model but it does not provide good predictions when used on new data. This situation is called overfitting: we have a very good fit in our training dataset, but predictions for new data inputs are bad.

Figure 3.11: Examples of Overfitting and Underfitting

Figure 3.11 provides examples of overfitting and underfitting. The blue dots represent the training data \(x\) and \(y\), the orange curve represents the fit of the model to the training data. The left plot shows an example of underfitting: the model is too simple to capture the underlying structure of the data. The middle plot shows a “good fit”: the model captures the underlying structure of the data. The right plot shows an example of overfitting: the model is too complex and captures the noise in the data.

ML = require("https://www.lactame.com/lib/ml/6.0.0/ml.min.js")

// Parameters (defined outside to be accessible)
p = 0  // Parameter for true DGP: (x-p)^2
K = 5  // Number of data points
nTest = 50  // Number of test data points
xRange = [-6, 6]
yRange = [-5, 30]
noiseLevel = 2.0

// True data generating process: f(x) = (x - p)^2
trueDGP = (x) => Math.pow(x - p, 2)

// Generate training data with seeded RNG (seed = 42)
trainingSeed = 42 / 2147483647
trainingSource = d3.randomLcg(trainingSeed)
trainingRandn = d3.randomNormal.source(trainingSource)(0, 1)
trainingRandUniform = d3.randomUniform.source(trainingSource)(xRange[0], xRange[1])

// Generate K training data points
xTrainUnsorted = Array.from({length: K}, () => trainingRandUniform())
noiseTrainBase = Array.from({length: K}, () => trainingRandn())

// Create paired training data and sort by x
pairedTrainData = xTrainUnsorted.map((x, i) => ({x: x, noiseBase: noiseTrainBase[i]})).sort((a, b) => a.x - b.x)

xTrain = pairedTrainData.map(d => d.x)
yTrain = pairedTrainData.map(d => trueDGP(d.x) + d.noiseBase * noiseLevel)

// Create training data array for plotting
sampleData = xTrain.map((x, i) => ({x: x, y: yTrain[i]}))

// Generate test data with different seed (seed = 43)
testSeed = 43 / 2147483647
testSource = d3.randomLcg(testSeed)
testRandn = d3.randomNormal.source(testSource)(0, 1)
testRandUniform = d3.randomUniform.source(testSource)(xRange[0], xRange[1])

// Generate test data points
xTestUnsorted = Array.from({length: nTest}, () => testRandUniform())
noiseTestBase = Array.from({length: nTest}, () => testRandn())

// Create paired test data and sort by x
pairedTestData = xTestUnsorted.map((x, i) => ({x: x, noiseBase: noiseTestBase[i]})).sort((a, b) => a.x - b.x)

xTest = pairedTestData.map(d => d.x)
yTest = pairedTestData.map(d => trueDGP(d.x) + d.noiseBase * noiseLevel)

// Generate true DGP curve for plotting
trueCurveData = {
    let data = [];
    for (let i = 0; i <= 100; i++) {
        let x = xRange[0] + i * (xRange[1] - xRange[0]) / 100;
        data.push({x: x, y: trueDGP(x)});
    }
    return data;
}

// Fit polynomial using ml.js PolynomialRegression
fitPoly = (degree) => {
    let powers = [];
    for (let d = 0; d <= degree; d++) {
        powers.push(d);
    }
    return new ML.PolynomialRegression(xTrain, yTrain, powers);
}

// Compute MSE
computeMSE = (yTrue, yPred) => {
    let mse = 0;
    for (let i = 0; i < yTrue.length; i++) {
        mse += Math.pow(yTrue[i] - yPred[i], 2);
    }
    return mse / yTrue.length;
}

// Compute RMSE for all polynomial orders (training and test)
rmseData = {
    let data = [];
    for (let degree = 0; degree < K; degree++) {
        let model = fitPoly(degree);
        let yPredTrain = xTrain.map(x => model.predict(x));
        let yPredTest = xTest.map(x => model.predict(x));
        let mseTrain = computeMSE(yTrain, yPredTrain);
        let mseTest = computeMSE(yTest, yPredTest);
        // Compute RMSE (square root of MSE)
        let rmseTrain = Math.sqrt(mseTrain);
        let rmseTest = Math.sqrt(mseTest);
        data.push({k: degree, rmseTrain: rmseTrain, rmseTest: rmseTest});
    }
    return data;
}

// Compute max RMSE for y-axis scaling (separate for train and test)
maxRMSETrain = Math.max(...rmseData.map(d => d.rmseTrain))
maxRMSETest = Math.max(...rmseData.map(d => d.rmseTest))

// Fit polynomial of current order k (reactive to polyOrder)
currentModel = fitPoly(polyOrder)

// Generate fitted curve (reactive to polyOrder)
fittedCurveData = {
    let data = [];
    for (let i = 0; i <= 200; i++) {
        let x = xRange[0] + i * (xRange[1] - xRange[0]) / 200;
        data.push({x: x, y: currentModel.predict(x)});
    }
    return data;
}

plot1 = Plot.plot({
    marginLeft: 50,
    marginTop: 35,
    marginBottom: 45,
    width: width / 2 - 20,
    height: width / 3,
    x: {
        domain: xRange,
        label: "x"
    },
    y: {
        domain: yRange,
        label: "y"
    },
    marks: [
        Plot.frame(),
        // True DGP curve (dashed gray)
        Plot.line(trueCurveData, {x: "x", y: "y", stroke: "#888", strokeDasharray: "5,5", strokeWidth: 2}),
        // Fitted polynomial (orange)
        Plot.line(fittedCurveData, {x: "x", y: "y", stroke: "orange", strokeWidth: 2.5}),
        // Sample data points (blue)
        Plot.dot(sampleData, {x: "x", y: "y", fill: "steelblue", r: 6}),
        // Title
        Plot.text([`Data & Fitted Polynomial (k=${polyOrder})`], {
            x: (xRange[1] + xRange[0])/2,
            y: yRange[1] * 1.1,
            textAnchor: "middle",
            fontWeight: "bold",
            fontSize: 14
        })
    ],
    //clip: true
})

// Panel 2: RMSE by polynomial order (separate axes for training and test)
rmseDataWithHighlight = rmseData.map(d => ({...d, highlight: d.k === polyOrder}))

// Panel 2a: Training RMSE
plot2a = Plot.plot({
    marginLeft: 50,
    marginTop: 35,
    marginBottom: 45,
    marginRight: 20,
    width: width / 4 - 10,
    height: width / 3,
    x: {
        domain: [0, K-1],
        label: "Polynomial Order (k)",
        ticks: d3.range(0, K)
    },
    y: {
        domain: [0, maxRMSETrain * 1.15],
        label: "Training RMSE"
    },
    marks: [
        Plot.frame(),
        // Training RMSE line
        Plot.line(rmseData, {x: "k", y: "rmseTrain", stroke: "steelblue", strokeWidth: 2}),
        // Training RMSE points
        Plot.dot(rmseDataWithHighlight, {
            x: "k",
            y: "rmseTrain",
            fill: d => d.highlight ? "orange" : "steelblue",
            r: d => d.highlight ? 8 : 0,
            stroke: d => d.highlight ? "orange" : "steelblue",
            strokeWidth: 2
        }),
        // Title
        Plot.text(["Training RMSE"], {
            x: (K - 1)/2,
            y: maxRMSETrain * 1.25,
            textAnchor: "middle",
            fontWeight: "bold",
            fontSize: 14
        })
    ]
})

// Panel 2b: Test RMSE
plot2b = Plot.plot({
    marginLeft: 50,
    marginTop: 35,
    marginBottom: 45,
    marginRight: 20,
    width: width / 4 - 10,
    height: width / 3,
    x: {
        domain: [0, K-1],
        label: "Polynomial Order (k)",
        ticks: d3.range(0, K)
    },
    y: {
        domain: [0, maxRMSETest * 1.15],
        label: "Test RMSE"
    },
    marks: [
        Plot.frame(),
        // Test RMSE line
        Plot.line(rmseData, {x: "k", y: "rmseTest", stroke: "steelblue", strokeWidth: 2}),
        // Test RMSE points
        Plot.dot(rmseDataWithHighlight, {
            x: "k",
            y: "rmseTest",
            fill: d => d.highlight ? "orange" : "steelblue",
            r: d => d.highlight ? 8 : 0,
            stroke: d => d.highlight ? "orange" : "steelblue",
            strokeWidth: 2
        }),
        // Title
        Plot.text(["Test RMSE"], {
            x: (K - 1)/2,
            y: maxRMSETest * 1.25,
            textAnchor: "middle",
            fontWeight: "bold",
            fontSize: 14
        })
    ]
})

// Combine plots side by side
html`<div style="display: flex; flex-wrap: wrap; justify-content: center; gap: 10px;">
    ${plot1}
    ${plot2a}
    ${plot2b}
</div>
<div style="text-align: center; margin-top: 10px; font-size: 0.9em; color: #555;">
    <span style="display: inline-block; width: 20px; height: 0; border-top: 2px dashed #888; margin-right: 5px; vertical-align: middle;"></span> True DGP: f(x) = x²
    <span style="margin-left: 20px; display: inline-block; width: 20px; height: 3px; background: orange; margin-right: 5px; vertical-align: middle;"></span> Fitted Polynomial
    <span style="margin-left: 20px; display: inline-block; width: 10px; height: 10px; background: steelblue; border-radius: 50%; margin-right: 5px; vertical-align: middle;"></span> Training Data (n=${K})
</div>`

viewof polyOrder = Inputs.range([0, K-1], {label: "Polynomial Order (k)", step: 1, value: 2})

Figure 3.12: Interactive Overfitting Example

Figure 3.13 shows an example where the true data generating process (DGP) is a quadratic function but there is some added noise when we sample from the DGP. The blue dots represent the training data, the orange curve represents the fitted polynomial of order \(k\). For example, if \(k=3\), we run a regression of \(y\) on a constant, \(x\), \(x^2\), and \(x^3\). The left panel shows the training data and the fitted polynomial, the middle panel shows the training RMSE as a function of the polynomial order \(k\), and the right panel shows the test RMSE as a function of the polynomial order \(k\). As we increase the polynomial order (a measure of complexity in our model), the training RMSE decreases (we use more and more flexible models to fit the data), but the test RMSE eventually increases, indicating overfitting.

3.3.1 Bias-Variance Tradeoff

The concepts of bias and variance are useful to understand the tradeoff between underfitting and overfitting. Suppose that data is generated from the true model \(Y=f(X)+\epsilon\), where \(\epsilon\) is a random error term such that \(\mathbb{E}[\epsilon]=0\) and \(\text{Var}[\epsilon]=\sigma^2\). Let \(\hat{f}(x)\) be the prediction of the model at \(x\). One can show that the expected prediction error (or generalization error) of a model can be decomposed into three parts

\[\text{EPE}(x_0) = \mathbb{E}[(Y-\hat{f}(x_0))^2|X=x_0] = \text{Bias}^2(\hat{f}(x_0)) + \text{Var}(\hat{f}(x_0)) + \sigma^2,\]

where \(\text{Bias}(\hat{f}(x_0)) = \mathbb{E}[\hat{f}(x_0)]-f(x_0)\) is the bias at \(x_0\), \(\text{Var}(\hat{f}(x_0)) = \mathbb{E}[\hat{f}(x_0)^2]-\mathbb{E}[\hat{f}(x_0)]^2\) is the variance at \(x_0\), and \(\sigma^2\) is the irreducible error, i.e., the error that cannot be reduced by any model. As model complexity increases, the bias tends to decrease, but the variance tends to increase. The following quote from Cornell lecture notes summarizes the bias-variance tradeoff well:

Variance: Captures how much your classifier changes if you train on a different training set. How “over-specialized” is your classifier to a particular training set (overfitting)? If we have the best possible model for our training data, how far off are we from the average classifier?

Bias: What is the inherent error that you obtain from your classifier even with infinite training data? This is due to your classifier being “biased” to a particular kind of solution (e.g. linear classifier). In other words, bias is inherent to your model.

Noise: How big is the data-intrinsic noise? This error measures ambiguity due to your data distribution and feature representation. You can never beat this, it is an aspect of the data.

Figure 3.14: Model Complexity and Generalization Error (adapted from Hastie et al. 2009)

Figure 3.14 shows the relationship between the model complexity and the prediction error. A more complex model can reduce the prediction error only up to a certain point. After this point, the model starts to overfit the training data (it learns noise in the data), and the prediction error for the test data (i.e., data not used for model training) increases. Ideally, we would like to find the model complexity that minimizes the prediction error for the test data. We have seen this behavior in Figure 3.13 where the test RMSE first decreases and then increases as we increase the polynomial order \(k\).

3.3.2 Regularization

One approach to avoid overfitting is to use regularization. Regularization adds a penalty term to the loss function that penalizes large weights. The most common regularization techniques are L1 regularization and L2 regularization. L1 regularization adds the sum of the absolute values of the weights to the loss function, while L2 regularization adds the sum of the squared weights to the loss function.

These techniques are applicable across a large range of ML models and depending on the type of model additional regularization techniques might be available. For example, in neural networks, dropout regularization is a common regularization technique that randomly removes a set of artificial neurons during training.

In the context of linear regressions, L1 regularization is also called LASSO regression. The loss function of LASSO regression is given by

\[\text{Loss} = \text{MSE}(y,x; w) + \lambda \sum_{i=1}^m |w_i|,\]

where \(\text{MSE}(y,x; w)\) refers to the mean squared error (the standard loss function of a linear regression), \(\lambda\) is a hyperparameter that controls the strength of the regularization. Note that LASSO regression can also be used for feature selection, as it tends to set the weights of irrelevant features to zero. Figure 3.15 shows the LASSO regression loss for different levels of \(\lambda\).

Figure 3.15: LASSO Regression Loss for Different Levels of \(\lambda\)

An L2 regularization in a linear regression context is called a Ridge regression. Its loss function is given by

\[\text{Loss} = \text{MSE}(y,x; w) + \lambda \sum_{i=1}^m w_i^2.\]

We will have a closer look at regularization in the application sections. For now, it is important to understand that regularization works by constraining the weights of the model (i.e., keeping the weights small), which can help to avoid overfitting (which might require some weights to be very large). Figure 3.16 shows the Ridge regression loss for different levels of \(\lambda\). Note how the Ridge regression loss is smoother than the LASSO regression loss and that the weights are never set to exactly zero but just get closer and closer to zero.

Figure 3.16: Ridge Regression Loss for Different Levels of \(\lambda\)

3.3.3 Training, Validation, and Test Datasets

Regularization discussed in the previous section is a method to directly prevent overfitting. However, another approach to the issues is to adjust our evaluation procedure in a way that allows us to detect overfitting. To do this, we can split the dataset into several parts. The first option shown in Figure 3.17 is to split the dataset into a training dataset and a test dataset. The training dataset is used to train the model, while the test dataset is used to evaluate the model. Why does this help to detect overfitting? If the model performs well on the training dataset but poorly on the test dataset, this is a sign of overfitting. If the model performs well on the test dataset, this is a sign that the model generalizes well to new data.

Difference with Terminology in Econometrics/Statistics

In econometrics/statistics, it is more common to talk about in-sample and out-of-sample performance. The idea is the same: the in-sample performance is the performance of the model on the training dataset, while the out-of-sample performance is the performance of the model on the test dataset.

Figure 3.17: Option A - Splitting the Whole Dataset into Training, and Test Datasets

The second option shown in Figure 3.18 is to split the dataset into a training dataset, a validation dataset, and a test dataset. The training dataset is used to train the model, the validation dataset is used to tune the hyperparameters of the model, and the test dataset is used to evaluate the model.

Figure 3.18: Option B - Splitting the Whole Dataset into Training, Test, and Validation Datasets

Common splits are 70% training and 30% test, or 80% training and 20% test in Option A. In Option B, a common split is 70% training, 15% validation, and 15% test.

3.3.4 Cross-Validation

Another approach to detect overfitting is to use cross-validation. There are different types of cross-validation but k-fold cross-validation is probably the most common. In k-fold cross-validation, shown in Figure 3.19, the dataset is split into \(k\) parts (called folds). The model is trained on \(k-1\) folds and evaluated on the remaining fold. This process is repeated \(k\) times, each time using a different fold as the test fold. The performance of the model is then averaged over the \(k\) iterations. In practice, \(k=10\) is a common choice. If we set \(k=N\), where \(N\) is the number of observations in the dataset, we call this leave-one-out cross-validation or LOOCV.

The advantage of cross-validation is that it allows us to use all the data for training and testing. The disadvantage is that it is computationally more expensive than a simple training-test split.

3.4 Python Implementation

Let’s have a look at how to implement a logistic regression model in Python. First, we need to import the required packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, recall_score, precision_score, roc_curve
pd.set_option('display.max_columns', 50) # Display up to 50 columns

Let’s download the dataset automatically, unzip it, and place it in a folder called data if you haven’t done so already

from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile
import os.path

# Check if the file exists
if not os.path.isfile('data/card_transdata.csv'):

    print('Downloading dataset...')

    # Define the dataset to be downloaded
    zipurl = 'https://www.kaggle.com/api/v1/datasets/download/dhanushnarayananr/credit-card-fraud'

    # Download and unzip the dataset in the data folder
    with urlopen(zipurl) as zipresp:
        with ZipFile(BytesIO(zipresp.read())) as zfile:
            zfile.extractall('data')

    print('DONE!')

else:

    print('Dataset already downloaded!')

Dataset already downloaded!

Then, we can load the data into a DataFrame using the read_csv function from the pandas library

df = pd.read_csv('data/card_transdata.csv')

Note that it is common to call this variable df which is short for DataFrame.

This is a dataset of credit card transactions from Kaggle.com. The target variable \(y\) is fraud, which indicates whether the transaction is fraudulent or not. The other variables are the features \(x\) of the transactions.

3.4.1 Data Exploration & Preprocessing

The first step whenever you load a new dataset is to familiarize yourself with it. You need to understand what the variables represent, what the target variable is, and what the data looks like. This is called data exploration. Depending on the dataset, you might need to preprocess it (e.g., check for missing values and duplicates, or create new variables) before you can use it to train a machine-learning model. This is called data preprocessing.

Basic Dataframe Operations

Let’s see how many rows and columns the dataset has

df.shape

(1000000, 8)

The dataset has 1 million rows (observations) and 8 columns (variables)! Now, let’s have a look at the first few rows of the dataset with the head() method

df.head().T

                                        0          1         2         3  \
distance_from_home              57.877857  10.829943  5.091079  2.247564   
distance_from_last_transaction   0.311140   0.175592  0.805153  5.600044   
ratio_to_median_purchase_price   1.945940   1.294219  0.427715  0.362663   
repeat_retailer                  1.000000   1.000000  1.000000  1.000000   
used_chip                        1.000000   0.000000  0.000000  1.000000   
used_pin_number                  0.000000   0.000000  0.000000  0.000000   
online_order                     0.000000   0.000000  1.000000  1.000000   
fraud                            0.000000   0.000000  0.000000  0.000000   

                                        4  
distance_from_home              44.190936  
distance_from_last_transaction   0.566486  
ratio_to_median_purchase_price   2.222767  
repeat_retailer                  1.000000  
used_chip                        1.000000  
used_pin_number                  0.000000  
online_order                     1.000000  
fraud                            0.000000

If you would like to see more entries in the dataset, you can use the head() method with an argument corresponding to the number of rows, e.g.,

df.head(20)

    distance_from_home  distance_from_last_transaction  \
0            57.877857                        0.311140   
1            10.829943                        0.175592   
2             5.091079                        0.805153   
3             2.247564                        5.600044   
4            44.190936                        0.566486   
5             5.586408                       13.261073   
6             3.724019                        0.956838   
7             4.848247                        0.320735   
8             0.876632                        2.503609   
9             8.839047                        2.970512   
10           14.263530                        0.158758   
11           13.592368                        0.240540   
12          765.282559                        0.371562   
13            2.131956                       56.372401   
14           13.955972                        0.271522   
15          179.665148                        0.120920   
16          114.519789                        0.707003   
17            3.589649                        6.247458   
18           11.085152                       34.661351   
19            6.194671                        1.142014   

    ratio_to_median_purchase_price  repeat_retailer  used_chip  \
0                         1.945940              1.0        1.0   
1                         1.294219              1.0        0.0   
2                         0.427715              1.0        0.0   
3                         0.362663              1.0        1.0   
4                         2.222767              1.0        1.0   
5                         0.064768              1.0        0.0   
6                         0.278465              1.0        0.0   
7                         1.273050              1.0        0.0   
8                         1.516999              0.0        0.0   
9                         2.361683              1.0        0.0   
10                        1.136102              1.0        1.0   
11                        1.370330              1.0        1.0   
12                        0.551245              1.0        1.0   
13                        6.358667              1.0        0.0   
14                        2.798901              1.0        0.0   
15                        0.535640              1.0        1.0   
16                        0.516990              1.0        0.0   
17                        1.846451              1.0        0.0   
18                        2.530758              1.0        0.0   
19                        0.307217              1.0        0.0   

    used_pin_number  online_order  fraud  
0               0.0           0.0    0.0  
1               0.0           0.0    0.0  
2               0.0           1.0    0.0  
3               0.0           1.0    0.0  
4               0.0           1.0    0.0  
5               0.0           0.0    0.0  
6               0.0           1.0    0.0  
7               1.0           0.0    0.0  
8               0.0           0.0    0.0  
9               0.0           1.0    0.0  
10              0.0           1.0    0.0  
11              0.0           1.0    0.0  
12              0.0           0.0    0.0  
13              0.0           1.0    1.0  
14              0.0           1.0    0.0  
15              1.0           1.0    0.0  
16              0.0           0.0    0.0  
17              0.0           0.0    0.0  
18              0.0           1.0    0.0  
19              0.0           0.0    0.0

Note that analogously you can also use the tail() method to see the last few rows of the dataset.

We can also check what the variables in our dataset are called

df.columns

Index(['distance_from_home', 'distance_from_last_transaction',
       'ratio_to_median_purchase_price', 'repeat_retailer', 'used_chip',
       'used_pin_number', 'online_order', 'fraud'],
      dtype='object')

and the data types of the variables

df.dtypes

distance_from_home                float64
distance_from_last_transaction    float64
ratio_to_median_purchase_price    float64
repeat_retailer                   float64
used_chip                         float64
used_pin_number                   float64
online_order                      float64
fraud                             float64
dtype: object

In this case, all our variables are floating-point numbers (float). This means that they are numbers that have a fractional part such as 1.5, 3.14, etc. The number after float, 64 in this case refers to the number of bits that are used to represent this number in the computer’s memory. With 64 bits you can store more decimals than you could with, for example, 32, meaning that the results of computations can be more precise. But for the topics discussed in this course, this is not very important. Other common data types that you might encounter are integers (int) such as 1, 3, 5, etc., or strings (str) such as 'hello', 'world', etc.

Let’s dig deeper into the dataset and see some summary statistics

df.describe().T

                                    count       mean        std       min  \
distance_from_home              1000000.0  26.628792  65.390784  0.004874   
distance_from_last_transaction  1000000.0   5.036519  25.843093  0.000118   
ratio_to_median_purchase_price  1000000.0   1.824182   2.799589  0.004399   
repeat_retailer                 1000000.0   0.881536   0.323157  0.000000   
used_chip                       1000000.0   0.350399   0.477095  0.000000   
used_pin_number                 1000000.0   0.100608   0.300809  0.000000   
online_order                    1000000.0   0.650552   0.476796  0.000000   
fraud                           1000000.0   0.087403   0.282425  0.000000   

                                     25%       50%        75%           max  
distance_from_home              3.878008  9.967760  25.743985  10632.723672  
distance_from_last_transaction  0.296671  0.998650   3.355748  11851.104565  
ratio_to_median_purchase_price  0.475673  0.997717   2.096370    267.802942  
repeat_retailer                 1.000000  1.000000   1.000000      1.000000  
used_chip                       0.000000  0.000000   1.000000      1.000000  
used_pin_number                 0.000000  0.000000   0.000000      1.000000  
online_order                    0.000000  1.000000   1.000000      1.000000  
fraud                           0.000000  0.000000   0.000000      1.000000

With the describe() method we can see the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values of each variable in the dataset.

Checking for Missing Values and Duplicated Rows

It is also important to check for missing values and duplicated rows in the dataset. Missing values can be problematic for machine learning models, as they might not be able to handle them. Duplicated rows can also be problematic, as they might introduce bias in the model.

We can check for missing values (NA) that are encoded as None or numpy.NaN (Not a Number) with the isna() method. This method returns a boolean DataFrame (i.e., a DataFrame with True and False values) with the same shape as the original DataFrame, where True values indicate missing values.

df.isna()

        distance_from_home  distance_from_last_transaction  \
0                    False                           False   
1                    False                           False   
2                    False                           False   
3                    False                           False   
4                    False                           False   
...                    ...                             ...   
999995               False                           False   
999996               False                           False   
999997               False                           False   
999998               False                           False   
999999               False                           False   

        ratio_to_median_purchase_price  repeat_retailer  used_chip  \
0                                False            False      False   
1                                False            False      False   
2                                False            False      False   
3                                False            False      False   
4                                False            False      False   
...                                ...              ...        ...   
999995                           False            False      False   
999996                           False            False      False   
999997                           False            False      False   
999998                           False            False      False   
999999                           False            False      False   

        used_pin_number  online_order  fraud  
0                 False         False  False  
1                 False         False  False  
2                 False         False  False  
3                 False         False  False  
4                 False         False  False  
...                 ...           ...    ...  
999995            False         False  False  
999996            False         False  False  
999997            False         False  False  
999998            False         False  False  
999999            False         False  False  

[1000000 rows x 8 columns]

or to make it easier to see, we can sum the number of missing values for each variable

df.isna().sum()

distance_from_home                0
distance_from_last_transaction    0
ratio_to_median_purchase_price    0
repeat_retailer                   0
used_chip                         0
used_pin_number                   0
online_order                      0
fraud                             0
dtype: int64

Luckily, there seem to be no missing values. However, you need to be careful! Sometimes missing values are encoded as empty strings '' or numpy.inf (infinity), which are not considered missing values by the isna() method. If you suspect that this might be the case, you need to make additional checks.

As an alternative, we could also look at the info() method, which provides a summary of the DataFrame, including the number of non-null values in each column. If there are missing values, the number of non-null values will be less than the number of rows in the dataset.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  float64
 4   used_chip                       1000000 non-null  float64
 5   used_pin_number                 1000000 non-null  float64
 6   online_order                    1000000 non-null  float64
 7   fraud                           1000000 non-null  float64
dtypes: float64(8)
memory usage: 61.0 MB

We can also check for duplicated rows with the duplicated() method.

df.loc[df.duplicated()]

Empty DataFrame
Columns: [distance_from_home, distance_from_last_transaction, ratio_to_median_purchase_price, repeat_retailer, used_chip, used_pin_number, online_order, fraud]
Index: []

Luckily, there are also no duplicated rows.

Data Visualization

Let’s continue with some data visualization. We can use the matplotlib library to create plots. We have already imported the library at the beginning of the notebook.

Let’s start by plotting the distribution of the target variable fraud which can only take values zero and one. We can type

df['fraud'].value_counts()

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

to get the count of each value. We can also use the normalize=True argument to get the fraction of observations instead of the count

df['fraud'].value_counts(normalize=True)

fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64

We can then plot it as follows

df['fraud'].value_counts(normalize=True).plot(kind='bar')
plt.xlabel('Fraud')
plt.ylabel('Fraction of Observations')
plt.title('Distribution of Fraud')
ax = plt.gca()
ax.set_ylim([0.0, 1.0])

(0.0, 1.0)

plt.show()

Alternatively, we can plot it as a pie chart

df.value_counts("fraud").plot.pie(autopct = "%.1f")
plt.ylabel('')
plt.show()

Our dataset seems to be quite imbalanced, as only 8.7% of the transactions are fraudulent. This is a common problem in fraud detection datasets, as fraudulent transactions are usually very rare. We will need to keep this in mind when evaluating our machine learning model: the accuracy measure will be very high even for bad models, as the model can just predict that all transactions are not fraudulent and still get an accuracy of 91.3%.

Let’s look at some distributions. Most of the variables in the dataset are binary (0 or 1) variables. However, we also have some continuous variables. Let’s plot the distribution of the variable ratio_to_median_purchase_price, which is a continuous variable.

df['ratio_to_median_purchase_price'].hist(bins = 50, range=[0, 30])
plt.xlabel('Ratio to Median Purchase Price')
plt.ylabel('Count')
plt.show()

We can also plot the distribution of the variable ratio_to_median_purchase_price by the target variable fraud to see if there are any differences between fraudulent and non-fraudulent transactions

fig, ax = plt.subplots(1,2)
df['ratio_to_median_purchase_price'].hist(bins = 50, range=[0, 30], by=df['fraud'], ax = ax)

array([<Axes: title={'center': '0.0'}>, <Axes: title={'center': '1.0'}>],
      dtype=object)

ax[0].set_xlabel('Ratio to Median Purchase Price')
ax[1].set_xlabel('Ratio to Median Purchase Price')
ax[0].set_ylabel('Count')
ax[0].set_title('No Fraud')
ax[1].set_title('Fraud')
plt.show()

There are indeed some differences between fraudulent and non-fraudulent transactions. For example, fraudulent transactions seem to have a higher ratio to the median purchase price, which is expected as fraudsters might try to make large transactions to maximize their profit.

We can also look at the correlation between the variables in the dataset. The correlation is a measure of how two variables move together

df.corr() # Pearson correlation (for linear relationships)

                                distance_from_home  \
distance_from_home                        1.000000   
distance_from_last_transaction            0.000193   
ratio_to_median_purchase_price           -0.001374   
repeat_retailer                           0.143124   
used_chip                                -0.000697   
used_pin_number                          -0.001622   
online_order                             -0.001301   
fraud                                     0.187571   

                                distance_from_last_transaction  \
distance_from_home                                    0.000193   
distance_from_last_transaction                        1.000000   
ratio_to_median_purchase_price                        0.001013   
repeat_retailer                                      -0.000928   
used_chip                                             0.002055   
used_pin_number                                      -0.000899   
online_order                                          0.000141   
fraud                                                 0.091917   

                                ratio_to_median_purchase_price  \
distance_from_home                                   -0.001374   
distance_from_last_transaction                        0.001013   
ratio_to_median_purchase_price                        1.000000   
repeat_retailer                                       0.001374   
used_chip                                             0.000587   
used_pin_number                                       0.000942   
online_order                                         -0.000330   
fraud                                                 0.462305   

                                repeat_retailer  used_chip  used_pin_number  \
distance_from_home                     0.143124  -0.000697        -0.001622   
distance_from_last_transaction        -0.000928   0.002055        -0.000899   
ratio_to_median_purchase_price         0.001374   0.000587         0.000942   
repeat_retailer                        1.000000  -0.001345        -0.000417   
used_chip                             -0.001345   1.000000        -0.001393   
used_pin_number                       -0.000417  -0.001393         1.000000   
online_order                          -0.000532  -0.000219        -0.000291   
fraud                                 -0.001357  -0.060975        -0.100293   

                                online_order     fraud  
distance_from_home                 -0.001301  0.187571  
distance_from_last_transaction      0.000141  0.091917  
ratio_to_median_purchase_price     -0.000330  0.462305  
repeat_retailer                    -0.000532 -0.001357  
used_chip                          -0.000219 -0.060975  
used_pin_number                    -0.000291 -0.100293  
online_order                        1.000000  0.191973  
fraud                               0.191973  1.000000

df.corr('spearman') # Spearman correlation (for monotonic relationships)

                                distance_from_home  \
distance_from_home                        1.000000   
distance_from_last_transaction           -0.001068   
ratio_to_median_purchase_price           -0.000152   
repeat_retailer                           0.559724   
used_chip                                -0.000118   
used_pin_number                          -0.000338   
online_order                             -0.001812   
fraud                                     0.095032   

                                distance_from_last_transaction  \
distance_from_home                                   -0.001068   
distance_from_last_transaction                        1.000000   
ratio_to_median_purchase_price                       -0.000111   
repeat_retailer                                      -0.001352   
used_chip                                            -0.000165   
used_pin_number                                       0.000555   
online_order                                         -0.001076   
fraud                                                 0.034661   

                                ratio_to_median_purchase_price  \
distance_from_home                                   -0.000152   
distance_from_last_transaction                       -0.000111   
ratio_to_median_purchase_price                        1.000000   
repeat_retailer                                       0.001202   
used_chip                                            -0.000099   
used_pin_number                                       0.000251   
online_order                                         -0.000376   
fraud                                                 0.342838   

                                repeat_retailer  used_chip  used_pin_number  \
distance_from_home                     0.559724  -0.000118        -0.000338   
distance_from_last_transaction        -0.001352  -0.000165         0.000555   
ratio_to_median_purchase_price         0.001202  -0.000099         0.000251   
repeat_retailer                        1.000000  -0.001345        -0.000417   
used_chip                             -0.001345   1.000000        -0.001393   
used_pin_number                       -0.000417  -0.001393         1.000000   
online_order                          -0.000532  -0.000219        -0.000291   
fraud                                 -0.001357  -0.060975        -0.100293   

                                online_order     fraud  
distance_from_home                 -0.001812  0.095032  
distance_from_last_transaction     -0.001076  0.034661  
ratio_to_median_purchase_price     -0.000376  0.342838  
repeat_retailer                    -0.000532 -0.001357  
used_chip                          -0.000219 -0.060975  
used_pin_number                    -0.000291 -0.100293  
online_order                        1.000000  0.191973  
fraud                               0.191973  1.000000

This is still a bit hard to read. We can visualize the correlation matrix with a heatmap using the Seaborn library, which we have already imported at the beginning of the notebook.

corr = df.corr('spearman')
cmap = sns.diverging_palette(10, 255, as_cmap=True) # Create a color map
mask = np.triu(np.ones_like(corr, dtype=bool)) # Create a mask to only show the lower triangle of the matrix
sns.heatmap(corr, cmap=cmap, vmax=1, center=0, mask=mask) # Create a heatmap of the correlation matrix (Note: vmax=1 makes sure that the color map goes up to 1 and center=0 are used to center the color map at 0)
plt.show()

Note how ratio_to_median_purchase_price is positively correlated with fraud, which is expected as we saw in the previous plot that fraudulent transactions have a higher ratio to the median purchase price. Furthermore, used_chip and used_pin_number are negatively correlated with fraud, which makes sense as transactions, where the chip or the pin is used, are supposed to be more secure.

We can also plot boxplots to visualize the distribution of the variables

selector = ['distance_from_home', 'distance_from_last_transaction', 'ratio_to_median_purchase_price'] # Select the variables we want to plot
plt.figure()
ax = sns.boxplot(data = df[selector], orient = 'h') 
ax.set(xscale = "log") # Set the x-axis to a logarithmic scale to better visualize the data

[None]

plt.show()

Boxplots are a good way to visualize the distribution of a variable, as they show the median, the interquartile range, and the outliers. Each of the distributions shown in the boxplots above has a long right tail, which explains the large number of outliers. However, you have to be careful: you cannot just remove these outliers since these are likely to be fraudulent transactions.

Let’s see how many fraudulent transactions we would remove if we blindly remove the outliers according to the interquartile range

# Compute the interquartile range
Q1 = df['ratio_to_median_purchase_price'].quantile(0.25)
Q3 = df['ratio_to_median_purchase_price'].quantile(0.75)
IQR = Q3 - Q1

# Identify outliers based on the interquartile range
threshold = 1.5
outliers = df[(df['ratio_to_median_purchase_price'] < Q1 - threshold * IQR) | (df['ratio_to_median_purchase_price'] > Q3 + threshold * IQR)]

# Count the number of fraudulent transactions among our selected outliers
outliers['fraud'].value_counts()

fraud
1.0    53092
0.0    31294
Name: count, dtype: int64

df['fraud'].value_counts()

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

53092 of 87403 (more than half!) of our fraudulent transactions would be removed if we would have blindly removed the outliers according to the interquartile range. This is a significant number of observations, which would likely hurt the performance of our machine-learning model. Therefore, we should not remove these outliers. It would make the imbalance of our dataset even worse.

Splitting the Data into Training and Test Sets

Before we can train a machine learning model, we need to split our dataset into a training set and a test set.

X = df.drop('fraud', axis=1) # All variables except `fraud`
y = df['fraud'] # Only our fraud variables

The training set is used to train the model, while the test set is used to evaluate the model. We will use the train_test_split function from the sklearn.model_selection module to split our dataset. We will use 70% of the data for training and 30% for testing. We will also set the stratify argument to y to make sure that the distribution of the target variable is the same in the training and test sets. Otherwise, we might randomly not have any fraudulent transactions in the test set, which would make it impossible to correctly evaluate our model.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size = 0.3, random_state = 42)

Scaling Features

To improve the performance of our machine learning model, we should scale the features. This is especially important for models that are sensitive to the scale of the features, such as logistic regression. We will use the StandardScaler class from the sklearn.preprocessing module to scale the features. The StandardScaler class scales the features so that they have a mean of 0 and a standard deviation of 1. Since we don’t want to scale features that are binary (0 or 1), we will define a small function that scales only the features that we want

def scale_features(scaler, df, col_names, only_transform=False):

    # Extract the features we want to scale
    features = df[col_names] 

    # Fit the scaler to the features and transform them
    if only_transform:
        features = scaler.transform(features.values)
    else:
        features = scaler.fit_transform(features.values)

    # Replace the original features with the scaled features
    df[col_names] = features

Then, we need to run the function

col_names = ['distance_from_home', 'distance_from_last_transaction', 'ratio_to_median_purchase_price'] 
scaler = StandardScaler() 
scale_features(scaler, X_train, col_names)
scale_features(scaler, X_test, col_names, only_transform=True)

Note that we only fit the scaler to the training set and then transform both the training and test set. This ensures that the same values for the features produce the same output in the training and test set. Otherwise, if we fit the scaler to the test data as well, the meaning of certain values in the test set might change, which would make it impossible to evaluate the model correctly.

Mini-Exercise

Try switching to MinMaxScaler instead of StandardScaler and see how it affects the performance of the model. MinMaxScaler scales the features so that they are between 0 and 1.

3.4.2 Implementing Logistic Regression

Now that we have explored and preprocessed our dataset, we can move on to the next step: training a machine learning model. We will use a logistic regression model to predict whether a transaction is fraudulent or not.

Using the LogisticRegression class from the sklearn.linear_model module, fitting the model to the data is straightforward using the fit method

clf = LogisticRegression().fit(X_train, y_train)

We can then use the predict method to predict the class of the test set

clf.predict(X_test.head(5))

array([0., 0., 0., 0., 1.])

The actual classes of the first five observations in the test dataset are

y_test.head(5)

217309    0.0
902387    0.0
175152    0.0
527113    0.0
973041    1.0
Name: fraud, dtype: float64

This seems to match quite well. Let’s have a look at different performance metrics

y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Accuracy: 0.9591233333333333

print(f"Precision: {precision_score(y_test, y_pred)}")

Precision: 0.8956349206349207

print(f"Recall: {recall_score(y_test, y_pred)}")

Recall: 0.6025323214217612

print(f"ROC AUC: {roc_auc_score(y_test, y_proba[:, 1])}")

ROC AUC: 0.9672112715043656

As expected, the accuracy is quite high since we do not have many fraudulent transactions. Recall that the precision (\(\text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}}\)) is the fraction of correctly predicted fraudulent transactions among all transactions predicted to be fraudulent. The recall (\(\text{Recall} = \frac{\text{TP}}{\text{TP}+\text{FN}}\)) is the fraction of correctly predicted fraudulent transactions among the actual fraudulent transactions. The ROC AUC is the area under the curve for the receiver operating characteristic (ROC) curve.

# Compute the ROC curve
y_proba = clf.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_proba[:,1])

# Plot the ROC curve
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], linestyle='--', color='grey')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve')
plt.show()

The confusion matrix for the test set can be computed as follows

conf_mat = confusion_matrix(y_test, y_pred, labels=[1, 0]).transpose() # Transpose the sklearn confusion matrix to match the convention in the lecture
conf_mat

array([[ 15799,   1841],
       [ 10422, 271938]])

We can also plot the confusion matrix as a heatmap

sns.heatmap(conf_mat, annot=True, cmap='Blues', fmt='g', xticklabels=['Fraud', 'No Fraud'], yticklabels=['Fraud', 'No Fraud'])
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.show()

As you can see, we have mostly true negatives and true positives. However, there is still a significant number of false negatives, which means that we are missing fraudulent transactions, and a significant number of false positives, which means that we are predicting transactions as fraudulent that are not fraudulent.

If we would like to use a threshold other than 0.5 to predict the class of the test set, we can do so as follows

# Alternative threshold
threshold = 0.1

# Predict the class of the test set
y_pred_alt = (y_proba[:, 1] >= threshold).astype(int)

# Show the performance metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred_alt)}")

Accuracy: 0.9110566666666666

print(f"Precision: {precision_score(y_test, y_pred_alt)}")

Precision: 0.49535342157138834

print(f"Recall: {recall_score(y_test, y_pred_alt)}")

Recall: 0.9391708935585981

Setting a lower threshold increases the recall but decreases the precision. This is because we are more likely to predict a transaction as fraudulent, which increases the number of true positives but also the number of false positives.

What the correct threshold is depends on the problem at hand. For example, if the cost of missing a fraudulent transaction is very high, you might want to set a lower threshold to increase the recall. If the cost of falsely predicting a transaction as fraudulent is very high, you might want to set a higher threshold to increase the precision.

We can also plot the performance metrics for different thresholds

N = 50
thresholds_array = np.linspace(0.0, 0.999, N)
accuracy_array = np.zeros(N)
precision_array = np.zeros(N)
recall_array = np.zeros(N)

# Compute the performance metrics for different thresholds
for ii, thresh in enumerate(thresholds_array):
    y_pred_alt_tmp = (y_proba[:, 1] > thresh).astype(int)
    accuracy_array[ii] = accuracy_score(y_test, y_pred_alt_tmp)
    precision_array[ii] = precision_score(y_test, y_pred_alt_tmp)
    recall_array[ii] = recall_score(y_test, y_pred_alt_tmp)

# Plot the performance metrics
plt.plot(thresholds_array, accuracy_array, label='Accuracy')
plt.plot(thresholds_array, precision_array, label='Precision')
plt.plot(thresholds_array, recall_array, label='Recall')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.legend()
plt.show()

3.4.3 Implementing K-Nearest Neighbors

As an alternative to logistic regression, we can also use a K-Nearest Neighbors (KNN) classifier. The KNN classifier is a simple and intuitive machine learning model that classifies a new observation based on the classes of its k-nearest neighbors in the training set.

Let’s restrict ourselves to variables that are continuous for the KNN classifier

continuous_vars = ['distance_from_home', 'distance_from_last_transaction', 'ratio_to_median_purchase_price']
X_train_knn = X_train[continuous_vars]
X_test_knn = X_test[continuous_vars]

We can create a KNN classifier with k=5 and fit it to the training data

knn = KNeighborsClassifier(n_neighbors=5).fit(X_train_knn, y_train)
y_pred_knn = knn.predict(X_test_knn)

We can then evaluate the performance of the KNN classifier using the same performance metrics as before

print(f"Accuracy: {accuracy_score(y_test, y_pred_knn)}")

Accuracy: 0.9303866666666667

print(f"Precision: {precision_score(y_test, y_pred_knn)}")

Precision: 0.5987930842989893

print(f"Recall: {recall_score(y_test, y_pred_knn)}")

Recall: 0.6168338354753823

print(f"ROC AUC: {roc_auc_score(y_test, knn.predict_proba(X_test_knn)[:, 1])}")

ROC AUC: 0.9523727213955416

The KNN classifier seems to perform slightly worse than the logistic regression model. Part of the reason for this is that we only used three features for the KNN classifier, while we used all features for the logistic regression model.

Mini-Exercise

Try different values of n_neighbors (k) and plot how it affects the performance of the KNN classifier. Hint: You can use a loop to train the KNN classifier for different values of k and store the performance metrics in arrays. Then, you can plot the performance metrics as a function of k.

Mini-Exercise

Implement a 5-fold cross-validation for the logistic regression and K-Nearest Neighbors classifiers. Use the cross_val_score function from the sklearn.model_selection module.

# Import the cross_val_score function
from sklearn.model_selection import cross_val_score

# Apply 5-fold cross-validation to the classifier clf
cv_scores = cross_val_score(clf, X, y, cv=5, scoring='roc_auc')

# Mean of the cross-validation scores
cv_scores.mean()

3.4.4 Conclusions

In this section, we have seen how to implement a logistic regression model and a K-Nearest Neighbors classifier in Python. We have loaded a dataset, explored and preprocessed it, and trained a logistic regression model and a K-Nearest Neighbors classifier to predict whether a transaction is fraudulent or not. We have evaluated the model using different performance metrics and have seen how the choice of threshold affects the performance of the model.

There are many ways to improve the performance of the models. For example, we could try different machine learning models, or engineer new features. We could also try to deal with the imbalanced dataset by using techniques such as oversampling or undersampling. However, this is beyond the scope of this section.