10 Overview of Generative AI

In previous chapters, we have learned about various machine learning techniques and their applications. In this chapter, we will focus on a specific subset of artificial intelligence known as Generative AI, which has been a primary driver of the recent surge in interest in artificial intelligence. We will explore what Generative AI is, how it differs from other types of AI, and some of its applications.

10.1 What is Generative AI?

Generative AI refers to a class of artificial intelligence models that are designed to generate new content, such as text, images, audio, or even video. These models learn from existing data and can create new data that is similar in style and content to the training data. Generative AI has been used in various applications, including natural language processing, computer vision, and creative arts.

Generative vs. Discriminative Models

In machine learning, one typically distinguishes between discriminative and generative models. All of the models we have seen so far (e.g., logistic regression, decision trees, random forests, and feedforward neural networks) are discriminative. They learn the conditional distribution \(P(y \mid x)\), i.e. the mapping from inputs to outputs. A discriminative model can tell you whether an email is spam or not, but it cannot write you a new email.

Generative models, by contrast, learn the joint distribution of the data, \(P(x, y)\), or simply \(P(x)\) when there are no labels. Because they capture how the data itself is structured, they can produce new samples that resemble the training data. A generative language model, for instance, learns how words and sentences are distributed in natural language, which enables it to generate coherent new text.

10.2 From Task-Specific Models to Foundation Models

In the previous chapters, we followed a common workflow: for each new task such as credit-default prediction or sentiment analysis, we trained a dedicated model from scratch on a task-specific dataset. This task-specific paradigm works well when labeled data is plentiful, but it has limitations. For example, every new task requires its own large labeled dataset. Since we are starting from scratch each time, the model has no prior knowledge and must learn everything from the task-specific data, which might be inefficient if we have another related task with a similar structure (e.g., predicting loan defaults in a different sector or country). Moreover, training a new model for every task can be computationally expensive and time-consuming.

Recent years have seen a fundamental shift. Instead of training small, task-specific models, the idea is to first train a single, very large model on a massive, general-purpose dataset and then adapt it to specific tasks. These broadly capable models are known as foundation models (Bommasani et al. 2021). The training process is typically divided into two stages:

Pre-training: The model learns from a large, diverse dataset (e.g., all of Wikipedia, or a large portion of the internet) using a self-supervised learning objective (e.g., predicting the next word in a sentence). This allows the model to acquire a broad understanding of language, concepts, and even some reasoning abilities.
Fine-tuning: The pre-trained model is further trained on a smaller, task-specific dataset to adapt it to a particular application (e.g., sentiment analysis, question answering).

This is enabled by what is called transfer learning: the knowledge acquired during pre-training can be transferred to new tasks, often with much less data and computational resources than training from scratch. In some cases, the pre-trained model can perform well on new tasks with little or no fine-tuning (e.g., just by providing the right prompts), which is known as zero-shot or few-shot learning.

The same three factors we discussed in the introductory chapter (see Section 1.5.2) have converged to make foundation models possible: the scale of available data, advances in compute (GPUs, TPUs), and improvements in model architectures, in particular the Transformer. Because pre-training is extremely computationally expensive, it is typically carried out by large technology companies or research labs. The resulting models are then released, either as commercial APIs or as openly available model weights, so that the broader community can fine-tune and deploy them without bearing the cost of training from scratch.

Well-known examples include BERT (Google, 2018), the GPT series (OpenAI, 2018–), Claude (Anthropic, 2023–), LLaMA (Meta, 2023–), and Gemini (Google, 2023–). The pace of development has been remarkable: within just a few years, these models have gone from research prototypes to widely deployed products. In the next chapter, we will take a closer look at large language models (LLMs) which are the family of foundation models specialized in generating and understanding text, and arguably the most impactful class of foundation models to date.

10.3 Applications of Generative AI

Generative AI has found applications across a wide range of domains. While large language models have received the most public attention, generative models for images, audio, video, and structured data are equally transformative. Below we highlight some broad application areas.

Text generation and analysis. LLMs can summarize documents, translate between languages, classify text, extract structured information, and draft new content. In economics and central banking, this includes tasks like condensing policy reports, analyzing the sentiment of central bank communication, or assisting with literature reviews.
Image and video generation. Models such as DALL-E, Midjourney, and Stable Diffusion can generate realistic images from text descriptions. Video generation models (e.g., Sora) extend this to moving images. Applications range from creative industries and marketing to generating synthetic training data for computer vision systems.
Code generation. Specialized models (e.g., GitHub Copilot) and general-purpose LLMs can write, explain, and debug code. This has transformed software development workflows, enabling faster prototyping, automated testing, and making programming more accessible to domain experts who are not professional software engineers.
Audio and speech. Generative models can synthesize realistic speech from text (text-to-speech), transcribe audio recordings (speech-to-text), and even generate music. These capabilities are used in virtual assistants, accessibility tools, and media production.