Thursday, May 21, 2026

Exadata architecture deep dive — how the database and storage cells work together Architecture

The architecture post that makes everything else make sense. Covers how Database Nodes communicate with Storage Cells over RDMA, how I/O offloading works at the cell level, the role of iDB (Intelligent Database Protocol), Smart Scan internals, and why Exadata can process data faster than traditional storage. Includes the full I/O path from SQL query to result. Exadata Architecture Deep Dive — How Database and Storage Cells Work Together | punitoracledba

Exadata Architecture Deep Dive — How Database and Storage Cells Work Together

Exadata — Basics to Pro Series 1. What Is Exadata · 2. Hardware Components · 3. Architecture Deep Dive · 4. Smart Scan, Storage Indexes, HCC · 5. Monitoring · 6. Performance Tuning · 7. Administration · 8. Patching · 9. EBS on Exadata · 10. OCI Exadata

Articles 1 and 2 gave you the what and the what-is-inside. This article gives you the how. How does Oracle Database on the database server actually talk to Exadata System Software on the storage cell? What happens inside the network when a query runs? What does the storage cell actually do with a SQL predicate? And why does any of this make a query run faster?

This is the article that makes everything else in the series click into place. Once you understand the architecture — the three layers, the iDB protocol, the Smart Scan flow, the RDMA path — you will understand every performance metric, every monitoring command, and every tuning decision from a different level of depth.

This article assumes you have read Articles 1 and 2. If you have not, start there. The component names and concepts used here — database server, storage cell, cellsrv, RoCE, XRMEM — are introduced in those articles.

The Three-Layer Architecture

Exadata architecture divides cleanly into three logical layers. Every I/O operation, every Smart Scan, every RDMA read passes through these layers in sequence. Understanding the layers is the foundation of understanding everything else.

Layer Physical Component Key Software Role
Compute Layer Database Servers (DB Nodes) Oracle Database, Grid Infrastructure, LIBCELL SQL parsing, query optimisation, transaction management, result assembly
Network Layer RoCE switches, NICs on every server RoCE fabric, iDB protocol, RDMA High-speed low-latency communication between compute and storage
Storage Layer Storage Cells Exadata System Software (cellsrv, MS, RS), CELLOFLSRVn Data storage, Smart Scan, predicate filtering, column projection, I/O offloading

The critical difference between Exadata and traditional storage is in the storage layer. A traditional SAN has no software intelligence — it reads blocks and returns them. The Exadata storage layer runs real database-aware software and can process SQL logic close to the data before anything travels across the network.

The Bridge Between Database and Storage — LIBCELL

On each database server, Oracle Database does not communicate with storage cells directly through the OS I/O stack. Instead, it uses a special Oracle library called LIBCELL — linked directly into the Oracle Database kernel.

LIBCELL is the client-side component of the Exadata communication stack. It knows how to package SQL predicates, column lists, and I/O requests into iDB messages, send them to storage cells over the RoCE network, and receive the filtered results back. When Oracle Database performs a large table scan eligible for Smart Scan, it calls LIBCELL instead of the standard OS file I/O functions.

This design means the entire Exadata intelligence path bypasses the operating system I/O stack entirely. There is no file system layer, no OS I/O scheduler, no block device driver involved in the critical data path between Oracle Database and the storage cells. LIBCELL talks directly to the network card, and the network card uses RDMA to communicate with storage cell memory.

This is why Exadata requires Oracle Database to run on the database servers — not just any database engine. LIBCELL is an Oracle-proprietary library integrated into the Oracle Database kernel. No other database software can take advantage of Exadata's Smart Scan or RDMA capabilities. Oracle Database is the only software that knows how to use iDB.

The iDB Protocol — How the Database Talks to the Cell

iDB stands for Intelligent Database Protocol. It is Oracle's purpose-built, proprietary protocol for communication between database servers and storage cells over the RoCE network. iDB is what transforms Exadata from fast hardware into an intelligent database machine.

What makes iDB different from standard storage protocols

Traditional storage protocols — iSCSI, Fibre Channel, NFS — operate at the block level. They say: give me blocks 1000 to 2000 from disk. They carry no semantic information about what the data means or why it is being requested. The storage array has no idea the request is for a database query.

iDB operates at the database intelligence level. An iDB message says: I need to scan this extent of table T. The WHERE clause is amount > 200. I only need columns customer_name and order_date. Here are the bloom filter values for the join. Skip any data region where the Storage Index tells you the minimum value of amount is greater than 200.

The storage cell receives this rich iDB message, understands it completely, and acts on it — filtering rows, projecting only the needed columns, applying bloom filters, and returning a result set rather than raw blocks.

What iDB carries in each message

  • The type of I/O operation — Smart Scan, conventional I/O, RMAN backup offload, or fast file creation
  • The SQL WHERE clause predicates to be evaluated at the cell — row filtering instructions
  • The column list — column projection so the cell returns only needed columns, not entire rows
  • Bloom filter vectors — for join offloading, so the cell can pre-filter rows that will not survive a join
  • Storage Index hints — the optimizer's knowledge about data distribution to assist Storage Index pruning
  • The Oracle Database version — because the CELLOFLSRVn sub-process is version-specific

Oracle Database, running on the Exadata Database Servers, leverages the purpose-built iDB protocol and RDMA over Converged Ethernet to communicate with Exadata Storage. iDB is used to direct Smart I/O operations on the Storage Servers — including Smart Scan (SQL Offload), Fast File Initialization, and RMAN Incremental Backup Offload.

CELLOFLSRVn — the version-specific offload server

Inside each storage cell, a sub-process called CELLOFLSRVn (Cell Offload Server) handles the actual SQL offload processing. The n in the name indicates the Oracle Database version — each version of Oracle Database has its own associated CELLOFLSRVn process on the storage cell. This design allows multiple different Oracle Database versions to run concurrently on the Exadata system, each with their own compatible offload processing code on the cell.

The I/O Path — Traditional Storage vs Exadata

The best way to understand the Exadata architecture is to trace the complete I/O path for the same query on traditional storage and on Exadata, side by side.

The query: SELECT customer_name FROM orders WHERE amount > 200 — the orders table is 1 TB, only 1 GB of rows match the predicate.

Traditional storage — the conventional I/O path

Traditional Storage — Full I/O Path for a 1 TB Table Scan
1
Oracle optimizer generates execution plan
Optimizer determines a full table scan is needed. Issues a read request via the OS I/O stack.
DB Node
2
OS sends block read requests to the SAN
The storage array receives block-level read requests. It has no knowledge of the SQL query or predicates.
Network
3
SAN reads all 1 TB from disk and returns it
The storage array reads every block of the table from disk and sends all 1 TB back across the storage network. The SAN cannot filter — it does not know what the query wants.
Network
4
Oracle loads 1 TB into the buffer cache
All 1 TB of raw blocks is loaded into database server memory. Buffer cache must accommodate or page data in/out.
DB Node
5
Oracle applies the WHERE clause and column filter
Oracle scans all 1 TB, evaluates WHERE amount > 200 on every row, and projects only the customer_name column. 99% of data is discarded after being read and transferred.
DB Node
6
1 GB result returned to user
After moving 1 TB across the network and loading it into memory, only 1 GB of rows is ultimately returned. Wasted I/O: 99%.
DB Node

Exadata — the Smart Scan I/O path

Exadata Smart Scan — Full I/O Path for the Same 1 TB Table Scan
1
Oracle optimizer generates execution plan — Direct Path Read
Optimizer determines a full table scan is needed and qualifies for Smart Scan via Direct Path Read. LIBCELL is invoked instead of standard OS I/O.
DB Node
2
Storage Index check — cells pre-eliminate I/O
Before reading any data from disk, each storage cell checks its Storage Index. Any 1 MB storage region where the minimum amount is > 200 OR maximum amount is ≤ 200 is skipped entirely. Zero I/O for eliminated regions.
Storage Cell
3
LIBCELL builds and sends iDB messages to all cells in parallel
LIBCELL packages the WHERE clause (amount > 200) and column list (customer_name) into iDB messages. All storage cells receive their portion of the scan simultaneously — full parallelism across all cells.
iDB / RoCE
4
cellsrv reads from disk — only remaining regions after Storage Index
cellsrv reads only the storage regions that the Storage Index did not eliminate. For each block read, CELLOFLSRVn applies predicate filtering (WHERE amount > 200) and column projection (only customer_name) directly on the storage cell's CPUs.
Storage Cell
5
Cells return filtered result rows — not raw blocks
Each cell sends back only the rows that matched the predicate, with only the requested columns. The result is not block images — it is a row set. Only 1 GB of matching data travels the network instead of 1 TB.
iDB / RoCE
6
DB node assembles result — no buffer cache involvement
Because Smart Scan results are row sets, not block images, they bypass the buffer cache entirely. The DB node assembles the results from all cells, performs any final aggregations, and returns to the user.
DB Node

The outcome: Traditional storage moved 1 TB across the network and loaded it into memory to return 1 GB of results. Exadata moved approximately 1 GB across the network to return 1 GB of results. The I/O saving is 99%. The CPU saving on the database server is also substantial — the predicate and column filtering work happened on the storage cell CPUs, not the database server CPUs. Those freed database server CPUs are now available for other concurrent workloads.

Smart Scan Internals — What the Cell Actually Does

Smart Scan is the name for Exadata's SQL offload capability. When a query qualifies for Smart Scan, the storage cell performs three primary functions that eliminate the vast majority of data movement.

1. Predicate Filtering

Rather than transporting all rows to the database server for predicate evaluation, Smart Scan predicate filtering ensures that the database server receives only the rows matching the query criteria.

The supported conditional operators include =, !=, <, >, <=, >=, IS NULL, IS NOT NULL, LIKE, BETWEEN, NOT BETWEEN, IN, NOT IN, EXISTS, NOT, AND, and OR. In addition, Exadata Storage Server evaluates most common SQL functions during predicate filtering — functions like UPPER(), LOWER(), TO_DATE(), SUBSTR(), and many others can be evaluated on the cell without the row ever reaching the database server.

2. Column Filtering (Column Projection)

Rather than transporting entire rows to the database server, Smart Scan column filtering ensures that the database server receives only the requested columns. For tables with many columns — or columns containing LOBs — the I/O bandwidth saved by column filtering can be very substantial.

For example if a table has 50 columns but your SELECT only references 3 of them, the cell returns only those 3 columns per row. The other 47 columns are never transmitted across the network.

3. Join Offloading via Bloom Filters

When the optimizer determines a query involves a join between a large fact table and a smaller dimension table, it can pass a Bloom filter to the storage cell along with the Smart Scan request. The Bloom filter encodes the join key values from the dimension table side. The storage cell applies this filter while scanning the fact table — rows that cannot possibly satisfy the join condition are eliminated at the storage layer, before they reach the network.

This means even the join elimination work — traditionally done at the database server — is pushed down to the storage cell where the data lives. The result set that travels across the network is already pre-filtered for the join.

Example — Smart Scan with predicate and column filtering
-- This query qualifies for Smart Scan on Exadata:
SELECT customer_name
FROM   orders
WHERE  amount > 200;

-- What the cell receives via iDB:
--   Predicate:  amount > 200
--   Projection: only customer_name column
--
-- What the cell does:
--   1. Reads blocks from disk (after Storage Index eliminates regions)
--   2. Evaluates amount > 200 on every row in the block
--   3. For matching rows: extracts only customer_name
--   4. Returns the filtered, projected result row set
--
-- What does NOT travel the network:
--   - Rows where amount <= 200
--   - All columns other than customer_name
--   - Any storage region Storage Index eliminated

-- Verify Smart Scan is active for your session:
SELECT name, value
FROM   v$mystat ms, v$statname sn
WHERE  ms.statistic# = sn.statistic#
AND    sn.name LIKE 'cell%smart%';

The RDMA Path — Bypassing the OS for Ultra-Low Latency

Not all Exadata I/O goes through cellsrv and the iDB protocol. For certain workloads — particularly OLTP reads that hit data in the storage cell's XRMEM (Exadata RDMA Memory) — the database server uses RDMA to read directly from storage cell memory without any software involvement on the cell side.

The RDMA read path

  1. The database server's network card issues an RDMA read request for a specific memory address in the storage cell's XRMEM
  2. The storage cell's network card services the request directly from DRAM — no cellsrv involvement, no OS call, no CPU interrupt on the storage cell
  3. The data arrives at the database server's network card and is placed directly into database buffer cache memory — again without OS involvement on the DB node side
  4. The entire round trip — from DB node requesting data to DB node receiving it — completes in approximately 17 microseconds

This is the lowest-latency data access path Exadata supports. It is used automatically for frequently accessed hot data that Exadata System Software has placed in XRMEM. The decision about what to place in XRMEM is made automatically by Exadata System Software based on I/O access patterns — no DBA configuration is required.

The three storage tiers and how data moves between them

Tier Media Access Latency Access Path Managed By
XRMEM DDR5 DRAM (1.25 TB per cell) 17 microseconds Direct RDMA — no cellsrv Auto — Exadata System Software
Smart Flash Cache NVMe PCIe flash (6.8 TB per cell) Sub-millisecond iDB via cellsrv Auto — Exadata System Software
Hard Disk 22 TB NVMe-connected HDD (HC cells) Milliseconds iDB via cellsrv Auto — ASM manages placement

Exadata System Software automatically moves data between these three tiers based on access patterns. Hot data migrates up to XRMEM. Warm data stays in Smart Flash Cache. Cold data remains on disk. The DBA does not manage this tiering manually — it is fully automatic and transparent.

Inside cellsrv — The Heart of the Storage Cell

cellsrv is the primary daemon running on every Exadata storage cell. It is a multi-threaded program that services all I/O requests from database servers. Every Smart Scan, every conventional I/O request, every RMAN backup offload flows through cellsrv.

What cellsrv does with an incoming iDB request

  1. Receives the iDB message from the database server via LIBCELL over the RoCE network
  2. Unpacks the message — identifies the operation type, predicates, column list, bloom filters
  3. Consults the Storage Index for the relevant data regions — eliminates regions where predicates cannot be satisfied
  4. Issues disk read requests for the remaining regions — reads from Smart Flash Cache if available, otherwise from disk
  5. Invokes CELLOFLSRVn for the Oracle Database version in use — this sub-process applies predicate filtering and column projection to each block read
  6. Assembles the filtered result rows and returns them to the database server via the iDB response message

The three cell daemons and their roles

Daemon Full Name Role
cellsrv Cell Server Handles all I/O processing — Smart Scan, conventional I/O, RMAN offload. The primary work process on every cell. Heavy CPU usage during Smart Scan is normal.
MS Management Server Provides the management interface — handles cellcli commands, monitoring queries, and out-of-band management communication. Does not handle data I/O.
RS Restart Server Monitors cellsrv and MS. If either process fails, RS automatically restarts it. RS is the watchdog process — it starts first when the cell boots and is the last to stop.
Check cell daemon status on all cells using dcli
# Check all three daemons on every storage cell simultaneously
dcli -g /opt/oracle.SupportTools/onecommand/cell_group \
  "service celld status"

# Expected output for each cell:
# cell01: CellSRV is running
# cell01: MS is running
# cell01: RS is running

# If a daemon is not running on a specific cell:
ssh root@cell01
service celld start cellsrv
service celld start ms

What Else Gets Offloaded Beyond Smart Scan

Smart Scan is the most well-known offload operation but it is not the only one. iDB and cellsrv support several additional offload operations that reduce database server I/O and CPU for non-query workloads.

Offload Operation What It Does Benefit
Smart Scan (SQL Offload) Predicate and column filtering for large table and index scans Reduces I/O by up to 99% for selective analytic queries
RMAN Incremental Backup Offload The storage cell identifies changed blocks for incremental backups rather than the database server Backup time and I/O reduced significantly — the cell does the changed-block detection work
Fast File Initialization When Oracle creates a new datafile, the cell initialises the space at storage speed rather than database server speed New datafile creation that takes minutes on standard storage takes seconds on Exadata
Join Offloading (Bloom Filter) Join pre-filtering using Bloom filter vectors sent from the DB optimizer to the storage cell Fact table rows that cannot survive a join are eliminated at the cell before network transmission
HCC Decompression Hybrid Columnar Compressed data is decompressed on the storage cell's CPUs before Smart Scan filtering Compression and Smart Scan work together — HCC reduces disk I/O, Smart Scan reduces network I/O
Storage Index Elimination Data regions are eliminated before disk reads begin based on in-memory min/max statistics Zero disk I/O for eliminated regions — the fastest possible I/O avoidance

How ASM Fits Into the Exadata Architecture

Automatic Storage Management (ASM) runs on the database servers as part of Oracle Grid Infrastructure. ASM manages the disk groups that Exadata storage presents to Oracle Database — it handles striping, mirroring, and disk group management.

In the Exadata architecture, ASM sees the storage cells as disk providers. Each storage cell presents its physical disks or flash as logical units called griddisks to ASM. ASM assembles griddisks from multiple cells into disk groups and stripes data across all cells automatically.

This striping is what enables full parallelism during Smart Scan. When a table is stored across 14 storage cells, a Smart Scan runs simultaneously on all 14 cells — each cell processes its share of the table independently and in parallel. The result sets from all 14 cells are assembled at the database server. The larger the number of storage cells, the more parallelism available for large scans.

Check ASM disk group composition on Exadata -- Connect to ASM instance sqlplus / as sysasm -- List disk groups and their redundancy SELECT name, type, state, total_mb, free_mb FROM v$asm_diskgroup ORDER BY name; -- List griddisks in a disk group — see which cells contribute SELECT dg.name AS disk_group, d.name AS disk_name, d.path AS cell_path FROM v$asm_disk d, v$asm_diskgroup dg WHERE d.group_number = dg.group_number ORDER BY dg.name, d.name;

Why Exadata Is Faster Than Traditional Storage — The Five Reasons

The architecture described above delivers performance advantages through five distinct mechanisms. Understanding all five helps you explain Exadata's value clearly to management and architects.

  • I/O elimination via Storage Indexes — data regions are skipped before disk is read. Zero I/O is the fastest I/O. No traditional storage system eliminates I/O this way because they have no knowledge of the data's min/max values.
  • I/O reduction via Smart Scan — only matching rows and requested columns cross the network. A 1 TB scan that returns 1 GB of results moves 1 GB, not 1 TB. Traditional storage always moves the full dataset.
  • Processing parallelism across all cells — every storage cell processes its portion of the scan simultaneously. 14 cells scanning in parallel is 14x the throughput of a single storage path. Traditional SAN I/O is not inherently parallelised at the intelligence level.
  • Database server CPU freed for real work — because the storage cells handle predicate filtering, column projection, bloom filter joins, and decompression, the database server CPUs are free for other work. In a consolidation environment this means more concurrent workloads can run simultaneously.
  • Ultra-low latency RDMA for hot data — OLTP workloads that hit XRMEM complete at 17 microsecond latency via direct RDMA. Traditional Fibre Channel SAN latency is measured in hundreds of microseconds to milliseconds. The difference is an order of magnitude.

Architecture in Numbers — Exadata X10M

Metric Value
XRMEM read latency (RDMA) 17 microseconds end-to-end
Internal network bandwidth per server 200 Gb/sec (2 x 100 Gb/sec RoCE active-active)
Storage cell read IOPS per cell 2.8 million 8K read IOPS per storage cell
Total read IOPS — full rack Up to 25.2 million IOPS per rack
Storage throughput Nearly 1 TB/sec per rack
I/O reduction for selective analytics Up to 99% with Smart Scan and Storage Indexes combined
OLTP throughput vs prior gen (X9M) Up to 3x higher with X10M
Analytic query speed vs prior gen Up to 3.6x faster with X10M

Summary — The Architecture in Five Points

  • Exadata has three layers — Compute (DB nodes), Network (RoCE + iDB), and Storage (cells + cellsrv). Every I/O passes through all three.
  • LIBCELL bridges Oracle Database to the Exadata network — bypassing the OS I/O stack entirely. Only Oracle Database can use LIBCELL.
  • iDB is not a block protocol — it carries SQL predicates, column lists, and bloom filters so the storage cell can perform intelligent offloaded processing.
  • cellsrv on each storage cell receives iDB messages, consults Storage Indexes, reads from disk or flash, applies CELLOFLSRVn offload processing, and returns row sets — not raw blocks.
  • RDMA bypasses cellsrv entirely for hot data in XRMEM — delivering 17 microsecond end-to-end latency for the most time-sensitive OLTP workloads.

No comments: