{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "format:\n", " html:\n", " code-links:\n", " - text: \"Notebook: Practice Session I\"\n", " icon: file-code\n", " href: /notebooks/practice_session_I.ipynb\n", " target: _blank\n", " - text: \"Google Colab: Practice Session I\"\n", " icon: file-code\n", " href: https://colab.research.google.com/github/jmarbet/ai-big-data-course/blob/main/notebooks/practice_session_I.ipynb\n", "---\n", "\n", "# Practice Session I\n", "\n", "The application in this practice session is inspired by the empirical example in \"Measuring the model risk-adjusted performance of machine learning algorithms in credit default prediction\" by @AlonsoRobisco2022. However, since we are not interested in model risk-adjusted performance, the application will purely focus on the implementation of machine learning algorithms for loan default prediction.\n", "\n", "## Problem Setup\n", "\n", "The dataset that we will be using was used in the Kaggle competition [\"Give Me Some Credit\"](https://www.kaggle.com/c/GiveMeSomeCredit). The description of the competition reads as follows:\n", "\n", ">Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. \n", ">\n", ">Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.\n", ">\n", ">The goal of this competition is to build a model that borrowers can use to help make the best financial decisions.\n", ">\n", ">Historical data are provided on 250,000 borrowers and the prize pool is $5,000 ($3,000 for first, $1,500 for second and $500 for third).\n", "\n", "Unfortunately, there won't be any prize money today. However, the experience that you can gain from working through an application like this can be invaluable. So, in a way, you are still winning!\n", "\n", "\n", "## Dataset\n", "\n", "Let's download the dataset automatically, unzip it, and place it in a folder called `data` if you haven't done so already" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2026-01-19T18:49:52.722444Z", "iopub.status.busy": "2026-01-19T18:49:52.722186Z", "iopub.status.idle": "2026-01-19T18:49:52.728608Z", "shell.execute_reply": "2026-01-19T18:49:52.728017Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset already downloaded!\n" ] } ], "source": [ "from io import BytesIO\n", "from urllib.request import urlopen\n", "from zipfile import ZipFile\n", "import os.path\n", "\n", "# Check if the file exists\n", "if not os.path.isfile('data/Data Dictionary.xls') or not os.path.isfile('data/cs-training.csv'):\n", "\n", " print('Downloading dataset...')\n", "\n", " # Define the dataset to be downloaded\n", " zipurl = 'https://www.kaggle.com/api/v1/datasets/download/brycecf/give-me-some-credit-dataset'\n", "\n", " # Download and unzip the dataset in the data folder\n", " with urlopen(zipurl) as zipresp:\n", " with ZipFile(BytesIO(zipresp.read())) as zfile:\n", " zfile.extractall('data')\n", "\n", " print('DONE!')\n", "\n", "else:\n", "\n", " print('Dataset already downloaded!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we can have a look at the data dictionary that is provided with the dataset. This will give us an idea of the variables that are available in the dataset and what they represent" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2026-01-19T18:49:52.769750Z", "iopub.status.busy": "2026-01-19T18:49:52.769479Z", "iopub.status.idle": "2026-01-19T18:49:53.972534Z", "shell.execute_reply": "2026-01-19T18:49:53.972048Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "
| Variable Name | \n", "Description | \n", "Type | \n", "
|---|---|---|
| SeriousDlqin2yrs | \n", "Person experienced 90 days past due delinquency or worse | \n", "Y/N | \n", "
| RevolvingUtilizationOfUnsecuredLines | \n", "Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits | \n", "percentage | \n", "
| age | \n", "Age of borrower in years | \n", "integer | \n", "
| NumberOfTime30-59DaysPastDueNotWorse | \n", "Number of times borrower has been 30-59 days past due but no worse in the last 2 years. | \n", "integer | \n", "
| DebtRatio | \n", "Monthly debt payments, alimony,living costs divided by monthy gross income | \n", "percentage | \n", "
| MonthlyIncome | \n", "Monthly income | \n", "real | \n", "
| NumberOfOpenCreditLinesAndLoans | \n", "Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) | \n", "integer | \n", "
| NumberOfTimes90DaysLate | \n", "Number of times borrower has been 90 days or more past due. | \n", "integer | \n", "
| NumberRealEstateLoansOrLines | \n", "Number of mortgage and real estate loans including home equity lines of credit | \n", "integer | \n", "
| NumberOfTime60-89DaysPastDueNotWorse | \n", "Number of times borrower has been 60-89 days past due but no worse in the last 2 years. | \n", "integer | \n", "
| NumberOfDependents | \n", "Number of dependents in family excluding themselves (spouse, children etc.) | \n", "integer | \n", "