AI Coach | What Are Large Language Models? A Comprehensive Guide To Foundation Models

Large Language Models (LLMs) are revolutionizing the world of artificial intelligence—from powering chatbots and search engines to transforming how we write, code, and make decisions. But what exactly are these models, and why are they so powerful? In this beginner-friendly guide, we break down the core ideas behind foundation models, how they are built using pre-training and self-supervised learning, and what makes them so effective across a wide range of tasks. Whether you’re a developer, entrepreneur, or simply curious about AI, this post will give you a clear and practical understanding of the technologies shaping the future of language and intelligence.

What Are Large Language Models?

In recent years, Large Language Models (LLMs) have emerged as the driving force behind some of the most transformative applications in artificial intelligence—enabling machines to understand, generate, and even reason with human language. But how do these models actually work? What makes them so powerful? And why are they referred to as “foundation models”?

To answer these questions, we turn to the influential 2025 publication Foundations of Large Language Models by Tong Xiao and Jingbo Zhu — two leading researchers from the NLP Lab at Northeastern University and NiuTrans Research. Their book offers a clear, structured introduction to the core concepts behind LLMs, covering topics such as pre-training, prompting, model alignment, and scaling.

In this blog post, we distill the key insights from their work into a beginner-friendly guide. Whether you’re a developer, tech enthusiast, or AI newcomer, you’ll learn what LLMs are, how they’re trained, and why they matter in today’s AI-powered world.

Main Ideas, Arguments, and Concepts

1. The Paradigm Shift in NLP
The book underscores a fundamental shift in NLP: from task-specific, supervised models to a pre-training–fine-tuning paradigm. Large language models (LLMs) learn generalized linguistic and world knowledge from massive unlabeled datasets through self-supervised pre-training.

2. Architectures of LLMs
Three core model types are outlined: encoder-only, decoder-only, and encoder-decoder architectures. Each serves different purposes—understanding, generation, or both—and supports diverse applications such as classification, translation, and question answering.

3. Pre-training and Fine-tuning
The book details how LLMs are initially trained on massive corpora using objectives like masked language modeling or autoregressive generation. These models are then fine-tuned or prompted for downstream tasks.

4. Prompting and In-context Learning
Prompting is presented as a flexible alternative to fine-tuning, allowing models to be steered using natural language instructions. Techniques like zero-shot and few-shot learning via in-context examples are emphasized.

5. Scaling and Efficiency
Chapters explain the practical challenges of training massive models, including data preparation, distributed computing, and handling long sequences using memory and architecture optimizations.

6. Alignment with Human Intent
Aligning LLMs to human values and instructions is critical. This is done through instruction fine-tuning and reinforcement learning from human feedback (RLHF), ensuring the models are safe and useful in real-world applications.

3. Practical Lessons for Leaders and Entrepreneurs

1. Invest in Generalization, Not Specialization
Building on foundation models enables rapid adaptation across tasks without rebuilding from scratch, saving time and resources.

2. Prioritize Data Quality and Diversity
The effectiveness of LLMs hinges on the scale and variety of training data. Entrepreneurs should prioritize data acquisition strategies.

3. Utilize Prompting for Agility
Instead of investing in multiple task-specific models, leaders can use prompting techniques to pivot models toward new use cases rapidly.

4. Embrace Scalable Infrastructure
Training and deploying LLMs require scalable cloud or distributed infrastructure. Investing in this early supports long-term growth.

5. Focus on Human Alignment
User trust and product effectiveness depend on models behaving in expected, safe ways. Incorporating human feedback mechanisms is key.

6. Prepare for Multi-lingual and Cross-domain Applications
Foundation models are inherently capable of generalizing across languages and domains, opening international and cross-sector opportunities.

Chapter 1: Pre-training

Chapter 1 of Foundations of Large Language Models presents a comprehensive overview of the pre-training phase in natural language processing (NLP), which underpins the functionality of modern large language models (LLMs). The chapter explores various pre-training methodologies, model types, and techniques, culminating in a focused discussion on BERT (Bidirectional Encoder Representations from Transformers) as a representative example. It emphasizes the transition from task-specific supervised models to a more efficient and general paradigm based on large-scale pre-training and fine-tuning.

1.1 Pre-training NLP Models

Pre-training is described as a foundational process that enables models to learn general-purpose language features from large unlabeled datasets. The process involves two primary steps. First, the model parameters are optimized on a pre-training task. Unlike traditional supervised learning, these tasks are not directly tied to specific downstream objectives. Second, the pre-trained model is adapted for various applications using labeled data or prompts.

Three major types of pre-training are introduced:

Unsupervised Pre-training
This involves optimizing a model without explicit task labels, using objectives such as minimizing reconstruction errors in autoencoders. While historically useful, its role today is largely preparatory, aiding in stability and convergence of subsequent training.
Supervised Pre-training
This approach uses labeled data for a task different from the target application. The pre-trained model is later adapted by fine-tuning on the target task. It is straightforward but limited by the availability of large-scale labeled data.
Self-supervised Pre-training
Self-supervision uses the data itself to generate labels, allowing the model to learn from raw text. It has become the dominant strategy, enabling models to generalize well across diverse tasks with minimal additional supervision.

1.2 Self-supervised Pre-training Tasks

Self-supervised tasks vary depending on the architecture—decoder-only, encoder-only, or encoder-decoder. Each approach structures the pre-training objective differently.

Decoder-only Pre-training
A decoder-only model predicts the next token given preceding tokens. The objective is to maximize the likelihood of a sequence, making it suitable for language generation tasks. This setup uses a cross-entropy loss over each position in the sequence and aligns with traditional language modeling.
Encoder-only Pre-training
These models aim to build strong sentence or document representations. Masked Language Modeling (MLM) is the core task, where some tokens in a sequence are masked and the model must predict them. This allows the model to learn from both left and right contexts, offering a bidirectional understanding.
Permuted Language Modeling
To address limitations of MLM, such as the discrepancy between training and inference, this method predicts tokens in a randomized order. It enhances the model’s contextual awareness and can be implemented through attention masks without changing token order.
Discriminative Training
This strategy trains encoders as classifiers using self-generated labels. Two examples are Next Sentence Prediction (NSP), where the model determines if one sentence follows another, and ELECTRA-style replaced token detection, where a discriminator identifies whether a token is original or altered.
Encoder-Decoder Pre-training
Encoder-decoder models combine input comprehension with output generation. In masked encoder-decoder training, the model fills in masked spans or tokens. Denoising autoencoding is another variant, where corrupted input sequences are reconstructed, teaching both understanding and generation.

1.3 Example: BERT

BERT exemplifies the encoder-only pre-training approach. Its standard model uses MLM and NSP as joint objectives, making it capable of capturing deep bidirectional context. The model includes several innovations:

The Standard Model
BERT uses a Transformer encoder and is trained on MLM and NSP. Input sequences are tokenized, masked, and processed to predict missing elements and sentence relationships. It supports fine-tuning across various tasks.
More Training and Larger Models
Extended training on larger datasets and scaling model size have shown significant performance improvements, demonstrating the benefits of over-parameterization when coupled with sufficient data.
More Efficient Models
Variants like DistilBERT and MobileBERT aim to reduce model size and latency while retaining performance. Techniques include knowledge distillation and architectural modifications.
Multi-lingual Models
Multilingual BERT (mBERT) is trained on text from over 100 languages, allowing for cross-lingual transfer learning and applications in multilingual contexts.

1.4 Applying BERT Models

BERT can be applied to downstream tasks in two primary ways:

Fine-tuning
The pre-trained encoder is combined with a task-specific head (e.g., a classifier). The model is then fine-tuned on labeled data for that task. This can involve updating all parameters or freezing the encoder and only training the task-specific head.
Prompting
Although more common with generative models, prompting techniques can be adapted for encoder-based models. Inputs are structured as questions or instructions to guide the model’s outputs without additional training.

1.5 Summary

Chapter 1 concludes by summarizing the centrality of pre-training in modern NLP. It highlights the effectiveness of self-supervised learning, especially masked language modeling, in enabling flexible and powerful language models. The chapter also emphasizes that architectural choices and training strategies must be tailored to the intended use case, whether it’s language understanding, generation, or translation.

Chapter 2: Generative Models

Generative language models represent one of the most transformative advancements in modern NLP. These models, particularly large language models (LLMs), have led to the emergence of systems capable of understanding, generating, and even reasoning with natural language in a human-like manner. Chapter 2 begins by tracing the historical evolution from probabilistic models such as n-grams to powerful Transformer-based architectures. It emphasizes that the probabilistic modeling goal—predicting the likelihood of a token sequence—remains unchanged, even as the tools have drastically evolved.

The chapter introduces the idea that LLMs, like GPT, aim to model sequences by learning the conditional probabilities of tokens. These models are trained using deep neural networks to output probability distributions over vocabularies conditioned on preceding tokens, forming the basis of text generation.

2.1. Building LLMs: Core Concepts and Architectures

The foundational architecture discussed is the decoder-only Transformer. In this model, the next token is predicted based on all previous tokens in the sequence. This autoregressive behavior enables natural language generation. The Transformer blocks are composed of attention layers and feed-forward layers, and training is typically supervised through maximum likelihood estimation, where the objective is to minimize the difference between predicted and actual tokens in a massive corpus.

Training LLMs involves the following steps:

Tokenize Text Inputs
Raw text is divided into tokens, which serve as the model’s basic processing units.
Apply the Transformer Decoder
At each position in a sequence, the decoder receives all previous tokens and computes a probability distribution for the next token.
Optimize with Cross-Entropy Loss
The difference between predicted and true tokens is measured and used to update the model’s parameters through gradient descent.

2.2. Fine-tuning LLMs for Applications

Once a language model has been pre-trained, it can be fine-tuned for specific applications. The fine-tuning process uses a much smaller labeled dataset and allows the model to specialize in tasks like text classification, summarization, or dialogue generation.

The process includes:

Define Input-Output Pairs
The dataset is reformatted into context (input) and response (output) sequences.
Initialize with Pre-trained Weights
The model starts with weights from the pre-training phase
Optimize for Target Task
Fine-tuning adjusts parameters to optimize performance on the new data

This allows models to generalize even to tasks not explicitly seen during fine-tuning, due to the broad knowledge acquired in pre-training.

2.3. Aligning LLMs with Human Intent

A critical component in making LLMs useful and safe is aligning them with human goals. Pre-training does not guarantee models behave as expected, so additional supervision is often applied to tune their behavior.

This section emphasizes:

Instruction Tuning
Models are fine-tuned on examples containing explicit instructions and desired responses. This helps activate general capabilities developed during pre-training.
Supervised Learning on Diverse Tasks
Using labeled instruction datasets, LLMs are exposed to a wide variety of tasks, enhancing their zero-shot and few-shot learning capabilities.
Balancing Pre-training and Fine-tuning
Although fine-tuning uses less data, it plays a vital role in shaping model behavior, especially for instruction-following.

2.4. Prompting LLMs for Task Execution

Prompting is the process of formatting inputs to guide a model to produce desired outputs without changing its parameters. It’s a lightweight, flexible alternative to fine-tuning and is central to the usability of modern LLMs.

The prompting process involves:

Constructing Natural Language Prompts
A task is described in plain text.
In-context Learning
Few-shot learning can be achieved by including a few task examples in the prompt. This helps the model generalize to unseen examples.
Zero-shot and One-shot Approaches
LLMs can often perform new tasks without explicit examples, provided that the prompt is well-crafted.

Prompting reveals the emergent capabilities of LLMs and offers a powerful method to steer models without training overhead.

5. Training LLMs at Scale

Scaling up training is essential for high performance but comes with significant challenges in computation and data management. The section outlines four key areas:

Data Preparation
Trillions of tokens are needed, and data quality is crucial. Filtering out low-quality or toxic content is necessary to ensure safe and reliable outputs.
Model Modifications
Transformer architectures are adapted to enable stable training at scale. Examples include pre-norm residual connections and specialized normalization techniques like RMSNorm.
Distributed Training
Massive parallelism is used to train models across many GPUs. This involves strategies for gradient synchronization and memory optimization.
Scaling Laws
Empirical laws guide how performance scales with model size, dataset size, and compute. These insights help plan efficient training regimens.

6. Handling Long Sequences

Traditional Transformers struggle with long sequences due to quadratic attention costs. This section explores solutions:

Optimization from HPC Perspectives
Memory-efficient architectures and reduced precision computations help manage large sequence lengths.
Efficient Architectures
Approaches such as sparse attention and linear attention are introduced to reduce computational overhead.
KV Cache Management
Memory usage is improved by using fixed-size or compressed caches. This limits the memory growth during inference.
External Memory Integration
k-nearest neighbor (k-NN) and other memory-based models augment the attention mechanism with past contexts stored in a database.
Rotary Positional Encoding
Tokens are rotated in vector space to encode positions while preserving semantic similarity, enabling position extrapolation and interpolation.

Chapter 2 builds a clear understanding of how generative LLMs work—from modeling language probabilities to training massive networks and guiding them via fine-tuning or prompts. It highlights that scaling, prompting, and aligning are all essential facets of modern LLMs. These models are not just statistical tools but foundational systems capable of general reasoning, multi-tasking, and understanding instructions across domains.

Chapter 3: Prompting

Prompting is the method of crafting an input to a large language model (LLM) in order to produce a specific output. Chapter 3 emphasizes that prompting is foundational for interacting with LLMs, as it shapes how models interpret and respond to tasks. The chapter outlines a comprehensive taxonomy of prompt strategies, from basic input formats to advanced techniques like chain-of-thought and self-refinement. It explores prompt engineering as both a practical tool and a scientific field, bridging model capabilities and human intent.

3.1. General Prompt Design

The chapter begins by defining a prompt as any input text that conditions an LLM’s predictions. The goal is to maximize the likelihood of the correct output, Pr(y|x), where x is the prompt and y is the model’s response. Effective prompt design can improve accuracy, contextual relevance, and interpretability.

Here are steps for designing effective prompts:

Clarify the Task
Clearly state the task using explicit instructions. Ambiguous or vague prompts often produce unreliable results. Specificity helps focus the model’s generative capabilities on the intended task.
Provide Context or Role-based Instruction
Assign a role to the model for better guidance, such as “You are a computer scientist” or “You are a grammar expert.” This primes the model for the appropriate domain of knowledge and tone.
Add Demonstrations for In-context Learning
Present example inputs and outputs to teach the model the structure of a task. This is especially useful for few-shot and one-shot learning where patterns matter.
Use Structured Formats
Organize prompts using labels or code-style formatting. This reduces ambiguity and aligns the input with the model’s training data distribution.

3.2. In-context Learning

In-context learning (ICL) is a powerful feature of LLMs where they learn how to perform a task by observing examples in the prompt, without any weight updates or additional training.

Three primary types of ICL are discussed:

Zero-shot Learning
The model performs a new task with only a task description, no examples. This method tests the generalization capabilities of the LLM.
One-shot Learning
A single input-output pair is added as an example in the prompt. This serves as a guide for the model to infer task expectations.
Few-shot Learning
Multiple demonstrations are included. This approach often yields the best performance, especially when examples span the diversity of the task.

These methods highlight the LLM’s emergent ability to adapt quickly using just contextual cues.

3.3. Prompt Engineering Strategies

Prompt engineering refers to empirically crafting prompts to improve model performance. It includes several strategies:

Guide Reasoning with Instructions
Prompts like “Let’s think step-by-step” activate reasoning capabilities. This simple heuristic has been shown to dramatically improve performance on reasoning-heavy tasks.
Include External Information
By integrating reference documents or retrieved context, prompts can steer models to focus on specific knowledge, as in retrieval-augmented generation (RAG).
Refine Prompt Format
Experimenting with phrasing, order, and structure can yield better results. Small changes like adjusting the sentence order or tone may lead to improved accuracy or alignment.
Simplify When Needed
Prompts can be pruned to remove unnecessary information. This reduces cognitive load for the model and improves clarity.

3.4. Advanced Prompting Methods

The chapter then introduces powerful advanced techniques designed to enhance task-solving capabilities.

Chain of Thought (CoT) Prompting
This technique prompts models to generate intermediate reasoning steps before reaching a conclusion. CoT improves performance on arithmetic, logic, and multi-step reasoning tasks. It mirrors human problem-solving by decomposing a task into smaller parts.
Problem Decomposition
Users can manually structure prompts to break a task into segments. For instance, writing an article can be scaffolded with a pre-defined outline. The model then fills in content for each section, improving coherence.
Self-refinement
After producing an initial answer, the model is prompted again to evaluate and refine its own response. This encourages self-correction and enhances answer quality.
Ensembling via Multiple Prompts
Generating multiple answers using different prompts or random seeds allows users to select or aggregate the best response. This technique reduces the variance in LLM outputs.
Retrieval-augmented Generation (RAG)
Prompts include externally retrieved information, guiding models to base responses on provided sources. This is useful for fact-intensive applications like QA and summarization.

3.5. Learning to Prompt

The final section explores methods to automatically learn or optimize prompts, rather than hand-crafting them.

Prompt Optimization
Techniques like reinforcement learning or gradient-based optimization can be used to discover prompts that yield better performance.
Soft Prompting
Instead of discrete text, continuous vectors are used as prompts. These soft prompts can be learned using backpropagation and often outperform manually written ones.
Prompt Compression
When prompt length becomes a constraint, methods for reducing the number of tokens without sacrificing performance are introduced. These include removing redundancies or using more abstract representations.

Chapter 3 reveals that prompting is both an art and a science. It empowers users to unlock the full potential of LLMs without retraining or model modification. From designing basic prompts to leveraging advanced reasoning scaffolds, prompting transforms how models engage with tasks. As models grow more capable, prompt engineering continues to be a key lever in deploying safe, useful, and effective AI systems.

Chapter 4: Alignment

The concept of alignment in natural language processing (NLP) has evolved from traditional tasks like sentence alignment in machine translation to the broader objective of ensuring that the outputs of large language models (LLMs) are consistent with human values, goals, and expectations. Chapter 4 explores alignment not merely as an accuracy challenge, but as a multifaceted problem that encompasses safety, ethics, interpretability, and user intent. Simply pre-training LLMs is insufficient to guarantee they behave responsibly or helpfully; therefore, alignment remains a crucial post-training step.

4.1. Understanding LLM Alignment

The chapter begins by identifying three primary methods for aligning LLMs:

Supervised Fine-tuning (SFT)
Fine-tuning involves training pre-trained LLMs on instruction-response pairs. Unlike general pre-training, which focuses on predicting the next token, SFT optimizes for generating appropriate responses based on specific inputs. This method teaches LLMs to follow instructions accurately and is highly effective when tasks are clearly defined.
Reinforcement Learning from Human Feedback (RLHF)
RLHF uses a reward model trained on human preference data to refine LLM behavior. Human evaluators compare model outputs to express preferences, and these comparisons train the reward model. The LLM is then optimized to maximize the reward signal, encouraging responses that align more closely with nuanced human values.
Inference-time Alignment
Rather than modifying the model’s parameters, inference-time alignment methods rerank or select the most aligned output from several candidates. Techniques like Best-of-N (BoN) sampling fall under this approach, offering a lightweight alternative when fine-tuning is too resource-intensive.

4.2. Instruction Alignment

Instruction alignment, a major theme of the chapter, involves guiding LLMs to follow user instructions explicitly. This requires the model to not only understand task descriptions but also generate responses that satisfy user expectations.

To achieve instruction alignment, the following steps are typically followed:

Create a Dataset of Instruction-response Pairs
Each pair includes a clear instruction and the desired model output. This dataset forms the basis of supervised fine-tuning.
Fine-tune the Model Using This Data
The model learns to map instructions to expected outputs. Unlike pre-training, where the objective is predicting the next token, this method prioritizes instruction-following accuracy.
Optimize with Task-specific Objectives
During fine-tuning, only the relevant portion of the output sequence (usually following the instruction) is considered in the loss function, making the model more task-aware.

Instruction alignment significantly improves the model’s usability in real-world applications, especially when users need consistent and reliable responses to diverse queries.

4.3. Human Preference Alignment: Reinforcement Learning from Human Feedback (RLHF)

RLHF addresses the limitations of SFT, particularly in cases where human values are hard to encode into simple labels. The method treats model fine-tuning as a reinforcement learning task, where outputs are guided by a reward model.

Steps involved in RLHF include:

Generate Multiple Outputs for Each Input
The LLM produces several candidate completions for a given instruction.
Train a Reward Model from Human Preferences
Annotators rank the outputs, and these preferences are used to train a model that scores each response.
Optimize the LLM Using the Reward Model
The model is trained using policy gradient methods, such as Advantage Actor-Critic (A2C), to maximize the reward signal associated with high-quality outputs.

The result is a model that not only follows instructions but also generates contextually appropriate and ethically sound content.

4.4. Improvements in Human Preference Alignment

This section introduces refinements to reward modeling and alignment training:

Better Reward Modeling
Techniques such as the Bradley-Terry model are used to train reward functions that can effectively rank output quality. These models are sensitive to subtle differences between alternatives.
Direct Preference Optimization
Instead of converting preferences into a reward model, this method directly uses pairwise preference data to optimize LLM outputs, simplifying the alignment process.
Automatic Data Generation
Leveraging AI systems to generate preference data can enhance scalability and objectivity, especially in well-defined tasks.
Step-by-step Alignment
Rather than only supervising final answers, this approach introduces supervision for intermediate reasoning steps. This is particularly useful in complex reasoning or multi-step problem-solving tasks.
Inference-time Reranking
By scoring multiple candidate outputs at inference using the reward model, systems can prioritize the most aligned response without retraining the model.

4.5. Summary and Significance

Chapter 4 emphasizes that alignment is essential for ensuring LLMs behave safely, ethically, and helpfully. It explains why pre-training alone cannot fully address the challenges of human value alignment and presents a sequence of alignment strategies: from supervised fine-tuning to RLHF, and finally to reranking during inference.

The chapter concludes by reinforcing that alignment is not just a technical issue, but also a deeply human one—requiring the integration of social, ethical, and cultural considerations. The progress in this field is ongoing and crucial for the responsible deployment of LLMs in society.

ByAI Coach

Table of Contents