MongoDB Aggregation Pipeline: From Basics to Behind the Scenes

In this blog, you are going to understand the internals of the MongoDB aggregation pipeline. Most of us, when we first face this aggregation pipeline, get confused about why they use the $ sign inside [] brackets. Inside the [] brackets are the stages. These stages are created using {} braces. Each Stage is using operations to transform the data. e.g. group, project, addField, lookup etc..
The aggregation pipeline looks like:
Now we are going to understand that behind this aggregation pipeline, there is a mini query engine that handles each step of the aggregation process. I also provided the technical jargons for each step.
Step 1: Parsing the Data
Parsing, Logical Plan, AST(Abstract Syntax Tree)
At first MongoDB reads this pipeline and turn it into a structure of steps. This helps to analyze and execute the query more efficiently.
Step 2: Query Optimizer
Predicate Pushdown, Stage Re-ordering
It reorder the query to make it faster to execute and improves performance without changing the result.
Step 3: Query Planner
IXSCAN, COLLSCAN, B-Tree Index
It decides the proper way to fetch data. IXSCAN refers to index-based scanning, whereas COLLSCAN means scanning the entire collection. The index used in IXSCAN can be on any field, not just _id, but also combinations like { _id, name }.
Step 4: Physical Plan
Execution Plan, Operators
It converts the optimized plan into a physical execution plan, where MongoDB decides how each stage will actually run using specific operators and algorithms.
Step 5: Execution Engine
Slot Based Execution Engine(SBE)
It uses small memory boxes (slots) instead of moving full documents, which makes execution faster and more efficient.
{
"customerId": 01,
"amount": 10000,
"total": 1000,
}
Step 6: Iterator Model
Volcano Model, next()
Each stage processes data by pulling one document at a time from the previous stage using the next() function.
Step 7: Streaming VS Blocking
Streaming refers to processing one document at a time. Blocking refers to processing all the data only after it is fully available. Aggregation operations like match and project are considered streaming, while group and sort are considered blocking.
Step 8: Stage Algorithms
Hash Aggregation, External Merge Sort, Nested Loop Join
It defines which algorithm is used in each stage to perform the operation.
e.g. group - Hash Table, sort - chunks->sort->merge->Result
Step 9: Expression Evaluation
Expression Tree, Evaluation Engine
It uses an expression evaluation engine to process expressions (like sum, multiply) by converting them into expression trees and evaluating them for each document.
What happens internally:
Doc1 → price = 100, quantity = 2 → totalPrice = 200
Doc2 → price = 50, quantity = 3 → totalPrice = 150
For each document, MongoDB evaluates expressions like multiply by reading field values, applying the operation, and storing the result.
Step 10: Memory Management
Memory Threshold, Disk Spill, External Processing
It processes small data in RAM, but when the data size exceeds the limit (around 100MB), it uses disk storage temporarily.
Step 11: Final output (Cursor)
Cursor, Batching, Lazy Fetching
Instead of sending all results at once, it sends results gradually using a cursor. This saves memory, provides a faster response, and is scalable for large datasets.
The aggregation pipeline is not just a sequence of operations. it is a mini query engine that processes data efficiently using different strategies and algorithms.
Understanding this makes you a better developer when working with MongoDB.


