{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "format:\n",
    "  html:\n",
    "    code-links:\n",
    "      - text: \"Notebook: Practice Session I\"\n",
    "        icon: file-code\n",
    "        href: /notebooks/practice_session_I.ipynb\n",
    "        target: _blank\n",
    "      - text: \"Google Colab: Practice Session I\"\n",
    "        icon: file-code\n",
    "        href: https://colab.research.google.com/github/jmarbet/ai-big-data-course/blob/main/notebooks/practice_session_I.ipynb\n",
    "---\n",
    "\n",
    "# Practice Session I\n",
    "\n",
    "The application in this practice session is inspired by the empirical example in \"Measuring the model risk-adjusted performance of machine learning algorithms in credit default prediction\" by @AlonsoRobisco2022. However, since we are not interested in model risk-adjusted performance, the application will purely focus on the implementation of machine learning algorithms for loan default prediction.\n",
    "\n",
    "## Problem Setup\n",
    "\n",
    "The dataset that we will be using was used in the Kaggle competition [\"Give Me Some Credit\"](https://www.kaggle.com/c/GiveMeSomeCredit). The description of the competition reads as follows:\n",
    "\n",
    ">Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. \n",
    ">\n",
    ">Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.\n",
    ">\n",
    ">The goal of this competition is to build a model that borrowers can use to help make the best financial decisions.\n",
    ">\n",
    ">Historical data are provided on 250,000 borrowers and the prize pool is $5,000 ($3,000 for first, $1,500 for second and $500 for third).\n",
    "\n",
    "Unfortunately, there won't be any prize money today. However, the experience that you can gain from working through an application like this can be invaluable. So, in a way, you are still winning!\n",
    "\n",
    "\n",
    "## Dataset\n",
    "\n",
    "Let's download the dataset automatically, unzip it, and place it in a folder called `data` if you haven't done so already"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:52.722444Z",
     "iopub.status.busy": "2026-01-19T18:49:52.722186Z",
     "iopub.status.idle": "2026-01-19T18:49:52.728608Z",
     "shell.execute_reply": "2026-01-19T18:49:52.728017Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset already downloaded!\n"
     ]
    }
   ],
   "source": [
    "from io import BytesIO\n",
    "from urllib.request import urlopen\n",
    "from zipfile import ZipFile\n",
    "import os.path\n",
    "\n",
    "# Check if the file exists\n",
    "if not os.path.isfile('data/Data Dictionary.xls') or not os.path.isfile('data/cs-training.csv'):\n",
    "\n",
    "    print('Downloading dataset...')\n",
    "\n",
    "    # Define the dataset to be downloaded\n",
    "    zipurl = 'https://www.kaggle.com/api/v1/datasets/download/brycecf/give-me-some-credit-dataset'\n",
    "\n",
    "    # Download and unzip the dataset in the data folder\n",
    "    with urlopen(zipurl) as zipresp:\n",
    "        with ZipFile(BytesIO(zipresp.read())) as zfile:\n",
    "            zfile.extractall('data')\n",
    "\n",
    "    print('DONE!')\n",
    "\n",
    "else:\n",
    "\n",
    "    print('Dataset already downloaded!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we can have a look at the data dictionary that is provided with the dataset. This will give us an idea of the variables that are available in the dataset and what they represent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:52.769750Z",
     "iopub.status.busy": "2026-01-19T18:49:52.769479Z",
     "iopub.status.idle": "2026-01-19T18:49:53.972534Z",
     "shell.execute_reply": "2026-01-19T18:49:53.972048Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "</style>\n",
       "<table id=\"T_065c5\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th id=\"T_065c5_level0_col0\" class=\"col_heading level0 col0\" >Variable Name</th>\n",
       "      <th id=\"T_065c5_level0_col1\" class=\"col_heading level0 col1\" >Description</th>\n",
       "      <th id=\"T_065c5_level0_col2\" class=\"col_heading level0 col2\" >Type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td id=\"T_065c5_row0_col0\" class=\"data row0 col0\" >SeriousDlqin2yrs</td>\n",
       "      <td id=\"T_065c5_row0_col1\" class=\"data row0 col1\" >Person experienced 90 days past due delinquency or worse </td>\n",
       "      <td id=\"T_065c5_row0_col2\" class=\"data row0 col2\" >Y/N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_065c5_row1_col0\" class=\"data row1 col0\" >RevolvingUtilizationOfUnsecuredLines</td>\n",
       "      <td id=\"T_065c5_row1_col1\" class=\"data row1 col1\" >Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits</td>\n",
       "      <td id=\"T_065c5_row1_col2\" class=\"data row1 col2\" >percentage</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_065c5_row2_col0\" class=\"data row2 col0\" >age</td>\n",
       "      <td id=\"T_065c5_row2_col1\" class=\"data row2 col1\" >Age of borrower in years</td>\n",
       "      <td id=\"T_065c5_row2_col2\" class=\"data row2 col2\" >integer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_065c5_row3_col0\" class=\"data row3 col0\" >NumberOfTime30-59DaysPastDueNotWorse</td>\n",
       "      <td id=\"T_065c5_row3_col1\" class=\"data row3 col1\" >Number of times borrower has been 30-59 days past due but no worse in the last 2 years.</td>\n",
       "      <td id=\"T_065c5_row3_col2\" class=\"data row3 col2\" >integer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_065c5_row4_col0\" class=\"data row4 col0\" >DebtRatio</td>\n",
       "      <td id=\"T_065c5_row4_col1\" class=\"data row4 col1\" >Monthly debt payments, alimony,living costs divided by monthy gross income</td>\n",
       "      <td id=\"T_065c5_row4_col2\" class=\"data row4 col2\" >percentage</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_065c5_row5_col0\" class=\"data row5 col0\" >MonthlyIncome</td>\n",
       "      <td id=\"T_065c5_row5_col1\" class=\"data row5 col1\" >Monthly income</td>\n",
       "      <td id=\"T_065c5_row5_col2\" class=\"data row5 col2\" >real</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_065c5_row6_col0\" class=\"data row6 col0\" >NumberOfOpenCreditLinesAndLoans</td>\n",
       "      <td id=\"T_065c5_row6_col1\" class=\"data row6 col1\" >Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)</td>\n",
       "      <td id=\"T_065c5_row6_col2\" class=\"data row6 col2\" >integer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_065c5_row7_col0\" class=\"data row7 col0\" >NumberOfTimes90DaysLate</td>\n",
       "      <td id=\"T_065c5_row7_col1\" class=\"data row7 col1\" >Number of times borrower has been 90 days or more past due.</td>\n",
       "      <td id=\"T_065c5_row7_col2\" class=\"data row7 col2\" >integer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_065c5_row8_col0\" class=\"data row8 col0\" >NumberRealEstateLoansOrLines</td>\n",
       "      <td id=\"T_065c5_row8_col1\" class=\"data row8 col1\" >Number of mortgage and real estate loans including home equity lines of credit</td>\n",
       "      <td id=\"T_065c5_row8_col2\" class=\"data row8 col2\" >integer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_065c5_row9_col0\" class=\"data row9 col0\" >NumberOfTime60-89DaysPastDueNotWorse</td>\n",
       "      <td id=\"T_065c5_row9_col1\" class=\"data row9 col1\" >Number of times borrower has been 60-89 days past due but no worse in the last 2 years.</td>\n",
       "      <td id=\"T_065c5_row9_col2\" class=\"data row9 col2\" >integer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_065c5_row10_col0\" class=\"data row10 col0\" >NumberOfDependents</td>\n",
       "      <td id=\"T_065c5_row10_col1\" class=\"data row10 col1\" >Number of dependents in family excluding themselves (spouse, children etc.)</td>\n",
       "      <td id=\"T_065c5_row10_col2\" class=\"data row10 col2\" >integer</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x117b11be0>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "data_dict = pd.read_excel('data/Data Dictionary.xls', header=1)\n",
    "data_dict.style.hide()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The variable $y$ that we want to predict is `SeriousDlqin2yrs` which indicates whether a person has been 90 days past due on a loan payment (serious delinquency) in the past two years. This target variable is $1$ if the loan defaults (i.e., serious delinquency occurred) and $0$ if the loan does not default (i.e., no serious delinquency occurred). The other variables are features that we can use to predict this target variable such as the age of the borrower and the monthly income of the borrower.\n",
    "\n",
    "\n",
    "## Putting the Problem into the Context of the Course\n",
    "\n",
    "Given the description of the competition and the dataset, we can see that this is a **supervised learning problem**. We have a target variable that we want to predict, and we have features that we can use to predict this target variable. The target variable is binary, i.e., it can take two values: 0 or 1. The value 0 indicates that the loan will not default, while the value 1 indicates that the loan will default. Thus, this is a **binary classification problem**.\n",
    "\n",
    "\n",
    "## Setting up the Environment\n",
    "\n",
    "We will start by setting up the environment by importing the necessary libraries "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:53.975194Z",
     "iopub.status.busy": "2026-01-19T18:49:53.974726Z",
     "iopub.status.idle": "2026-01-19T18:49:55.316089Z",
     "shell.execute_reply": "2026-01-19T18:49:55.315499Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and loading the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:55.318837Z",
     "iopub.status.busy": "2026-01-19T18:49:55.318512Z",
     "iopub.status.idle": "2026-01-19T18:49:55.461063Z",
     "shell.execute_reply": "2026-01-19T18:49:55.460405Z"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('data/cs-training.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "Note that the exercises build on each other. You can sometimes skip exercises but the results for later exercises will depend on the previous ones. If you get stuck, you can skip to the next exercise and try to come back to the previous one later.\n",
    "\n",
    "\n",
    "### Exercise 1: Familiarization with the Dataset\n",
    "\n",
    "**Tasks:**\n",
    "\n",
    "1. Display the first 5 rows of the dataset. What do you notice about the column names?\n",
    "2. There appears to be an unnecessary index column. Identify it and remove it from the DataFrame\n",
    "3. Use `.info()` to check the data types and identify which columns have missing values\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "- The `.head()` method shows the first rows\n",
    "- Look for columns that seem to duplicate the index\n",
    "- The `axis` parameter in `.drop()` specifies whether you're dropping rows or columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:55.463721Z",
     "iopub.status.busy": "2026-01-19T18:49:55.463490Z",
     "iopub.status.idle": "2026-01-19T18:49:55.466126Z",
     "shell.execute_reply": "2026-01-19T18:49:55.465541Z"
    }
   },
   "outputs": [],
   "source": [
    "# Your code here. Add additional code cells as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2: Understanding the Target Variable\n",
    "\n",
    "**Tasks:**\n",
    "\n",
    "1. What is the proportion of defaulted vs non-defaulted loans in the dataset? Use `value_counts(normalize=True)`\n",
    "2. Based on this distribution, would you say the dataset is balanced or imbalanced?\n",
    "3. Why might class imbalance be problematic for machine learning? What evaluation metrics should we be careful about?\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "- The target variable is `SeriousDlqin2yrs`\n",
    "- Think about what accuracy would be if a model just predicted the majority class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:55.468753Z",
     "iopub.status.busy": "2026-01-19T18:49:55.468505Z",
     "iopub.status.idle": "2026-01-19T18:49:55.471164Z",
     "shell.execute_reply": "2026-01-19T18:49:55.470591Z"
    }
   },
   "outputs": [],
   "source": [
    "# Your code here. Add additional code cells as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3: Handling Missing Values and Data Quality Issues\n",
    "\n",
    "**Tasks:**\n",
    "\n",
    "1. Use `.dropna()` combined with `value_counts()` to check if dropping missing values significantly changes the target variable distribution\n",
    "2. Drop the rows with missing values for the rest of the exercises.\n",
    "3. How many rows were dropped due to missing values?\n",
    "4. Verify that there are no missing values remaining in the dataset.\n",
    "5. Check for duplicate rows. How many are there? Should you remove them?\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "- Use `df.loc[df.isna().any(axis=1)]` to select rows with any missing values\n",
    "- Pay attention to the mean and standard deviation differences\n",
    "\n",
    ":::{.callout-note}\n",
    "\n",
    "Note that in a real application, you would want to carefully consider how to handle missing data rather than just dropping rows. Imputation methods or models that can handle missing data directly might be more appropriate depending on the context. Furthermore, in this specific dataset, dropping some of the missing values also removes some of the other data quality issues by chance. In practice, you would want to investigate and address these issues separately.\n",
    "\n",
    ":::"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:55.473479Z",
     "iopub.status.busy": "2026-01-19T18:49:55.473251Z",
     "iopub.status.idle": "2026-01-19T18:49:55.475752Z",
     "shell.execute_reply": "2026-01-19T18:49:55.475246Z"
    }
   },
   "outputs": [],
   "source": [
    "# Your code here. Add additional code cells as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 4: Exploratory Data Analysis\n",
    "\n",
    "**Tasks:**\n",
    "\n",
    "1. Create a pie chart (or histogram) showing the distribution of the target variable in your cleaned dataset\n",
    "2. Generate a pair plot for `age`, `MonthlyIncome`, `DebtRatio`, and `SeriousDlqin2yrs` using seaborn's `pairplot()` with `hue='SeriousDlqin2yrs'`\n",
    "3. Calculate and visualize correlation matrices using a heatmap\n",
    "4. Which features appear most correlated with loan default?\n",
    "5. Are there any features that are highly correlated with each other? What issues could this cause?\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "- Use `sns.pairplot()` with the `hue` parameter for coloring by class\n",
    "- Use `df.corr()` for Pearson correlation\n",
    "- `sns.heatmap()` can visualize correlation matrices\n",
    "- Use `np.triu()` to create a mask for the upper triangle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:55.478332Z",
     "iopub.status.busy": "2026-01-19T18:49:55.478099Z",
     "iopub.status.idle": "2026-01-19T18:49:55.480633Z",
     "shell.execute_reply": "2026-01-19T18:49:55.480131Z"
    }
   },
   "outputs": [],
   "source": [
    "# Your code here. Add additional code cells as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 5: Preparing Data for Machine Learning Algorithms\n",
    "\n",
    "**Tasks:**\n",
    "\n",
    "1. Separate features (`X`) from the target variable (`y`) \n",
    "2. Split the data into training (80%) and test (20%) sets using `train_test_split`. Use `stratify=y` to maintain class proportions and `random_state=42` for reproducibility\n",
    "3. Apply `MinMaxScaler` to normalize the features. **Important:** Fit the scaler only on training data, then transform both training and test data\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "- Use `df.drop('column_name', axis=1)` for features\n",
    "- The `stratify` parameter ensures balanced splits\n",
    "- Fitting on test data causes \"data leakage\" - avoid this!\n",
    "- Create a helper function for scaling if you want cleaner code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:55.482750Z",
     "iopub.status.busy": "2026-01-19T18:49:55.482537Z",
     "iopub.status.idle": "2026-01-19T18:49:55.484964Z",
     "shell.execute_reply": "2026-01-19T18:49:55.484478Z"
    }
   },
   "outputs": [],
   "source": [
    "# Your code here. Add additional code cells as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 6: Defining Evaluation Metrics\n",
    "\n",
    "**Tasks:**\n",
    "\n",
    "1. Write a function `evaluate_model(clf, X_train, y_train, X_test, y_test, label='')` that:\n",
    "   - Computes predictions and predicted probabilities\n",
    "   - Prints Accuracy, Precision, Recall, and ROC AUC for both training and test sets\n",
    "   - Plots the ROC curve for both training and test sets\n",
    "2. Why is it important to evaluate on both training and test data?\n",
    "3. Given our imbalanced dataset, which metric(s) should we focus on and why?\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "- Use `clf.predict()` for class predictions and `clf.predict_proba()` for probabilities\n",
    "- Import metrics from `sklearn.metrics`: `accuracy_score`, `precision_score`, `recall_score`, `roc_auc_score`, `roc_curve`\n",
    "- Plot both curves on the same figure for comparison\n",
    "- Add a diagonal reference line for the ROC plot\n",
    "- Use `label` parameter to differentiate models in outputs, e.g., `label='Logistic Regression'`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:55.487140Z",
     "iopub.status.busy": "2026-01-19T18:49:55.486939Z",
     "iopub.status.idle": "2026-01-19T18:49:55.489338Z",
     "shell.execute_reply": "2026-01-19T18:49:55.488825Z"
    }
   },
   "outputs": [],
   "source": [
    "# Your code here. Add additional code cells as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 7: Training Classification Models\n",
    "\n",
    "**Tasks:**\n",
    "\n",
    "Train the following models and evaluate each using your evaluation function:\n",
    "\n",
    "1. **Logistic Regression**: Use `penalty=None`, `solver='lbfgs'`, `max_iter=5000`\n",
    "2. **Decision Tree**: Use `max_depth=7`\n",
    "3. **Random Forest**: Use `max_depth=20`, `n_estimators=100`\n",
    "4. **XGBoost**: Use `max_depth=5`, `n_estimators=40`, `random_state=0`\n",
    "5. **Neural Network**: Use `MLPClassifier` with `activation='relu'`, `solver='adam'`, `hidden_layer_sizes=(300, 200, 100)`, `max_iter=300`, `random_state=42`\n",
    "\n",
    "For each model:\n",
    "\n",
    "- Fit on training data\n",
    "- Evaluate using your evaluation function\n",
    "- Note the training vs test performance\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "- Import from: `sklearn.linear_model`, `sklearn.tree`, `sklearn.ensemble`, `xgboost`\n",
    "- Use `.fit(X_train, y_train)` to train each model\n",
    "- Watch for signs of overfitting (training >> test performance)\n",
    "- Training the neural network may take several minutes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:55.491679Z",
     "iopub.status.busy": "2026-01-19T18:49:55.491464Z",
     "iopub.status.idle": "2026-01-19T18:49:55.493990Z",
     "shell.execute_reply": "2026-01-19T18:49:55.493433Z"
    }
   },
   "outputs": [],
   "source": [
    "# Your code here. Add additional code cells as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 8: Results Comparison\n",
    "\n",
    "**Tasks:**\n",
    "\n",
    "1. Create a DataFrame comparing all models with columns: Model, ROC AUC (Train), ROC AUC (Test)\n",
    "2. Which model performed best on the test set?\n",
    "3. Which model showed the largest gap between training and test performance? What does this suggest?\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "- Use `pd.DataFrame()` with a dictionary\n",
    "- The gap between train/test performance indicates overfitting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:55.496171Z",
     "iopub.status.busy": "2026-01-19T18:49:55.495961Z",
     "iopub.status.idle": "2026-01-19T18:49:55.498419Z",
     "shell.execute_reply": "2026-01-19T18:49:55.497935Z"
    }
   },
   "outputs": [],
   "source": [
    "# Your code here. Add additional code cells as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 9: Feature Engineering\n",
    "\n",
    "**Tasks:**\n",
    "\n",
    "1. Create squared versions of all features and add them to the dataset (use `.pow(2)` and `.add_suffix('_sq')`)\n",
    "2. Re-split and re-scale the data with the new features\n",
    "3. Retrain all models with the expanded feature set\n",
    "4. Compare the new results with the original. Did feature engineering help?\n",
    "\n",
    "**Optional:** Add a Logistic Regression with L1 (LASSO) penalty using `penalty='l1'` and `solver='liblinear'`. How does it perform?\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "- Use `X.assign(**X.pow(2).add_suffix('_sq'))` for compact feature creation\n",
    "- Remember to fit a new scaler on the new training data\n",
    "- LASSO can help with feature selection when you have many features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-01-19T18:49:55.500577Z",
     "iopub.status.busy": "2026-01-19T18:49:55.500354Z",
     "iopub.status.idle": "2026-01-19T18:49:55.502802Z",
     "shell.execute_reply": "2026-01-19T18:49:55.502331Z"
    }
   },
   "outputs": [],
   "source": [
    "# Your code here. Add additional code cells as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 10: Reflection and Discussion\n",
    "\n",
    "**Tasks:**\n",
    "\n",
    "1. What additional steps could improve model performance (e.g., hyperparameter tuning, handling class imbalance, more feature engineering)?\n",
    "2. In a real banking context, would you prefer a model with higher precision or higher recall? Why?\n",
    "3. What are the ethical considerations when deploying such a model for loan decisions?\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "- No code required; reflect on practical and ethical aspects"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3",
   "path": "/usr/local/share/jupyter/kernels/python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}