Exadata Architecture Deep Dive — How Database and Storage Cells Work Together
Articles 1 and 2 gave you the what and the what-is-inside. This article gives you the how. How does Oracle Database on the database server actually talk to Exadata System Software on the storage cell? What happens inside the network when a query runs? What does the storage cell actually do with a SQL predicate? And why does any of this make a query run faster?
This is the article that makes everything else in the series click into place. Once you understand the architecture — the three layers, the iDB protocol, the Smart Scan flow, the RDMA path — you will understand every performance metric, every monitoring command, and every tuning decision from a different level of depth.
This article assumes you have read Articles 1 and 2. If you have not, start there. The component names and concepts used here — database server, storage cell, cellsrv, RoCE, XRMEM — are introduced in those articles.
The Three-Layer Architecture
Exadata architecture divides cleanly into three logical layers. Every I/O operation, every Smart Scan, every RDMA read passes through these layers in sequence. Understanding the layers is the foundation of understanding everything else.
| Layer | Physical Component | Key Software | Role |
|---|---|---|---|
| Compute Layer | Database Servers (DB Nodes) | Oracle Database, Grid Infrastructure, LIBCELL | SQL parsing, query optimisation, transaction management, result assembly |
| Network Layer | RoCE switches, NICs on every server | RoCE fabric, iDB protocol, RDMA | High-speed low-latency communication between compute and storage |
| Storage Layer | Storage Cells | Exadata System Software (cellsrv, MS, RS), CELLOFLSRVn | Data storage, Smart Scan, predicate filtering, column projection, I/O offloading |
The critical difference between Exadata and traditional storage is in the storage layer. A traditional SAN has no software intelligence — it reads blocks and returns them. The Exadata storage layer runs real database-aware software and can process SQL logic close to the data before anything travels across the network.
The Bridge Between Database and Storage — LIBCELL
On each database server, Oracle Database does not communicate with storage cells directly through the OS I/O stack. Instead, it uses a special Oracle library called LIBCELL — linked directly into the Oracle Database kernel.
LIBCELL is the client-side component of the Exadata communication stack. It knows how to package SQL predicates, column lists, and I/O requests into iDB messages, send them to storage cells over the RoCE network, and receive the filtered results back. When Oracle Database performs a large table scan eligible for Smart Scan, it calls LIBCELL instead of the standard OS file I/O functions.
This design means the entire Exadata intelligence path bypasses the operating system I/O stack entirely. There is no file system layer, no OS I/O scheduler, no block device driver involved in the critical data path between Oracle Database and the storage cells. LIBCELL talks directly to the network card, and the network card uses RDMA to communicate with storage cell memory.
This is why Exadata requires Oracle Database to run on the database servers — not just any database engine. LIBCELL is an Oracle-proprietary library integrated into the Oracle Database kernel. No other database software can take advantage of Exadata's Smart Scan or RDMA capabilities. Oracle Database is the only software that knows how to use iDB.
The iDB Protocol — How the Database Talks to the Cell
iDB stands for Intelligent Database Protocol. It is Oracle's purpose-built, proprietary protocol for communication between database servers and storage cells over the RoCE network. iDB is what transforms Exadata from fast hardware into an intelligent database machine.
What makes iDB different from standard storage protocols
Traditional storage protocols — iSCSI, Fibre Channel, NFS — operate at the block level. They say: give me blocks 1000 to 2000 from disk. They carry no semantic information about what the data means or why it is being requested. The storage array has no idea the request is for a database query.
iDB operates at the database intelligence level. An iDB message says: I need to scan this extent of table T. The WHERE clause is amount > 200. I only need columns customer_name and order_date. Here are the bloom filter values for the join. Skip any data region where the Storage Index tells you the minimum value of amount is greater than 200.
The storage cell receives this rich iDB message, understands it completely, and acts on it — filtering rows, projecting only the needed columns, applying bloom filters, and returning a result set rather than raw blocks.
What iDB carries in each message
- The type of I/O operation — Smart Scan, conventional I/O, RMAN backup offload, or fast file creation
- The SQL WHERE clause predicates to be evaluated at the cell — row filtering instructions
- The column list — column projection so the cell returns only needed columns, not entire rows
- Bloom filter vectors — for join offloading, so the cell can pre-filter rows that will not survive a join
- Storage Index hints — the optimizer's knowledge about data distribution to assist Storage Index pruning
- The Oracle Database version — because the CELLOFLSRVn sub-process is version-specific
Oracle Database, running on the Exadata Database Servers, leverages the purpose-built iDB protocol and RDMA over Converged Ethernet to communicate with Exadata Storage. iDB is used to direct Smart I/O operations on the Storage Servers — including Smart Scan (SQL Offload), Fast File Initialization, and RMAN Incremental Backup Offload.
CELLOFLSRVn — the version-specific offload server
Inside each storage cell, a sub-process called CELLOFLSRVn (Cell Offload Server) handles the actual SQL offload processing. The n in the name indicates the Oracle Database version — each version of Oracle Database has its own associated CELLOFLSRVn process on the storage cell. This design allows multiple different Oracle Database versions to run concurrently on the Exadata system, each with their own compatible offload processing code on the cell.
The I/O Path — Traditional Storage vs Exadata
The best way to understand the Exadata architecture is to trace the complete I/O path for the same query on traditional storage and on Exadata, side by side.
The query: SELECT customer_name FROM orders WHERE amount > 200 — the orders table is 1 TB, only 1 GB of rows match the predicate.
Traditional storage — the conventional I/O path
Exadata — the Smart Scan I/O path
The outcome: Traditional storage moved 1 TB across the network and loaded it into memory to return 1 GB of results. Exadata moved approximately 1 GB across the network to return 1 GB of results. The I/O saving is 99%. The CPU saving on the database server is also substantial — the predicate and column filtering work happened on the storage cell CPUs, not the database server CPUs. Those freed database server CPUs are now available for other concurrent workloads.
Smart Scan Internals — What the Cell Actually Does
Smart Scan is the name for Exadata's SQL offload capability. When a query qualifies for Smart Scan, the storage cell performs three primary functions that eliminate the vast majority of data movement.
1. Predicate Filtering
Rather than transporting all rows to the database server for predicate evaluation, Smart Scan predicate filtering ensures that the database server receives only the rows matching the query criteria.
The supported conditional operators include =, !=, <, >, <=, >=, IS NULL, IS NOT NULL, LIKE, BETWEEN, NOT BETWEEN, IN, NOT IN, EXISTS, NOT, AND, and OR. In addition, Exadata Storage Server evaluates most common SQL functions during predicate filtering — functions like UPPER(), LOWER(), TO_DATE(), SUBSTR(), and many others can be evaluated on the cell without the row ever reaching the database server.
2. Column Filtering (Column Projection)
Rather than transporting entire rows to the database server, Smart Scan column filtering ensures that the database server receives only the requested columns. For tables with many columns — or columns containing LOBs — the I/O bandwidth saved by column filtering can be very substantial.
For example if a table has 50 columns but your SELECT only references 3 of them, the cell returns only those 3 columns per row. The other 47 columns are never transmitted across the network.
3. Join Offloading via Bloom Filters
When the optimizer determines a query involves a join between a large fact table and a smaller dimension table, it can pass a Bloom filter to the storage cell along with the Smart Scan request. The Bloom filter encodes the join key values from the dimension table side. The storage cell applies this filter while scanning the fact table — rows that cannot possibly satisfy the join condition are eliminated at the storage layer, before they reach the network.
This means even the join elimination work — traditionally done at the database server — is pushed down to the storage cell where the data lives. The result set that travels across the network is already pre-filtered for the join.
-- This query qualifies for Smart Scan on Exadata: SELECT customer_name FROM orders WHERE amount > 200; -- What the cell receives via iDB: -- Predicate: amount > 200 -- Projection: only customer_name column -- -- What the cell does: -- 1. Reads blocks from disk (after Storage Index eliminates regions) -- 2. Evaluates amount > 200 on every row in the block -- 3. For matching rows: extracts only customer_name -- 4. Returns the filtered, projected result row set -- -- What does NOT travel the network: -- - Rows where amount <= 200 -- - All columns other than customer_name -- - Any storage region Storage Index eliminated -- Verify Smart Scan is active for your session: SELECT name, value FROM v$mystat ms, v$statname sn WHERE ms.statistic# = sn.statistic# AND sn.name LIKE 'cell%smart%';
The RDMA Path — Bypassing the OS for Ultra-Low Latency
Not all Exadata I/O goes through cellsrv and the iDB protocol. For certain workloads — particularly OLTP reads that hit data in the storage cell's XRMEM (Exadata RDMA Memory) — the database server uses RDMA to read directly from storage cell memory without any software involvement on the cell side.
The RDMA read path
- The database server's network card issues an RDMA read request for a specific memory address in the storage cell's XRMEM
- The storage cell's network card services the request directly from DRAM — no cellsrv involvement, no OS call, no CPU interrupt on the storage cell
- The data arrives at the database server's network card and is placed directly into database buffer cache memory — again without OS involvement on the DB node side
- The entire round trip — from DB node requesting data to DB node receiving it — completes in approximately 17 microseconds
This is the lowest-latency data access path Exadata supports. It is used automatically for frequently accessed hot data that Exadata System Software has placed in XRMEM. The decision about what to place in XRMEM is made automatically by Exadata System Software based on I/O access patterns — no DBA configuration is required.
The three storage tiers and how data moves between them
| Tier | Media | Access Latency | Access Path | Managed By |
|---|---|---|---|---|
| XRMEM | DDR5 DRAM (1.25 TB per cell) | 17 microseconds | Direct RDMA — no cellsrv | Auto — Exadata System Software |
| Smart Flash Cache | NVMe PCIe flash (6.8 TB per cell) | Sub-millisecond | iDB via cellsrv | Auto — Exadata System Software |
| Hard Disk | 22 TB NVMe-connected HDD (HC cells) | Milliseconds | iDB via cellsrv | Auto — ASM manages placement |
Exadata System Software automatically moves data between these three tiers based on access patterns. Hot data migrates up to XRMEM. Warm data stays in Smart Flash Cache. Cold data remains on disk. The DBA does not manage this tiering manually — it is fully automatic and transparent.
Inside cellsrv — The Heart of the Storage Cell
cellsrv is the primary daemon running on every Exadata storage cell. It is a multi-threaded program that services all I/O requests from database servers. Every Smart Scan, every conventional I/O request, every RMAN backup offload flows through cellsrv.
What cellsrv does with an incoming iDB request
- Receives the iDB message from the database server via LIBCELL over the RoCE network
- Unpacks the message — identifies the operation type, predicates, column list, bloom filters
- Consults the Storage Index for the relevant data regions — eliminates regions where predicates cannot be satisfied
- Issues disk read requests for the remaining regions — reads from Smart Flash Cache if available, otherwise from disk
- Invokes CELLOFLSRVn for the Oracle Database version in use — this sub-process applies predicate filtering and column projection to each block read
- Assembles the filtered result rows and returns them to the database server via the iDB response message
The three cell daemons and their roles
| Daemon | Full Name | Role |
|---|---|---|
| cellsrv | Cell Server | Handles all I/O processing — Smart Scan, conventional I/O, RMAN offload. The primary work process on every cell. Heavy CPU usage during Smart Scan is normal. |
| MS | Management Server | Provides the management interface — handles cellcli commands, monitoring queries, and out-of-band management communication. Does not handle data I/O. |
| RS | Restart Server | Monitors cellsrv and MS. If either process fails, RS automatically restarts it. RS is the watchdog process — it starts first when the cell boots and is the last to stop. |
# Check all three daemons on every storage cell simultaneously dcli -g /opt/oracle.SupportTools/onecommand/cell_group \ "service celld status" # Expected output for each cell: # cell01: CellSRV is running # cell01: MS is running # cell01: RS is running # If a daemon is not running on a specific cell: ssh root@cell01 service celld start cellsrv service celld start ms
What Else Gets Offloaded Beyond Smart Scan
Smart Scan is the most well-known offload operation but it is not the only one. iDB and cellsrv support several additional offload operations that reduce database server I/O and CPU for non-query workloads.
| Offload Operation | What It Does | Benefit |
|---|---|---|
| Smart Scan (SQL Offload) | Predicate and column filtering for large table and index scans | Reduces I/O by up to 99% for selective analytic queries |
| RMAN Incremental Backup Offload | The storage cell identifies changed blocks for incremental backups rather than the database server | Backup time and I/O reduced significantly — the cell does the changed-block detection work |
| Fast File Initialization | When Oracle creates a new datafile, the cell initialises the space at storage speed rather than database server speed | New datafile creation that takes minutes on standard storage takes seconds on Exadata |
| Join Offloading (Bloom Filter) | Join pre-filtering using Bloom filter vectors sent from the DB optimizer to the storage cell | Fact table rows that cannot survive a join are eliminated at the cell before network transmission |
| HCC Decompression | Hybrid Columnar Compressed data is decompressed on the storage cell's CPUs before Smart Scan filtering | Compression and Smart Scan work together — HCC reduces disk I/O, Smart Scan reduces network I/O |
| Storage Index Elimination | Data regions are eliminated before disk reads begin based on in-memory min/max statistics | Zero disk I/O for eliminated regions — the fastest possible I/O avoidance |
How ASM Fits Into the Exadata Architecture
Automatic Storage Management (ASM) runs on the database servers as part of Oracle Grid Infrastructure. ASM manages the disk groups that Exadata storage presents to Oracle Database — it handles striping, mirroring, and disk group management.
In the Exadata architecture, ASM sees the storage cells as disk providers. Each storage cell presents its physical disks or flash as logical units called griddisks to ASM. ASM assembles griddisks from multiple cells into disk groups and stripes data across all cells automatically.
This striping is what enables full parallelism during Smart Scan. When a table is stored across 14 storage cells, a Smart Scan runs simultaneously on all 14 cells — each cell processes its share of the table independently and in parallel. The result sets from all 14 cells are assembled at the database server. The larger the number of storage cells, the more parallelism available for large scans.
-- Connect to ASM instance sqlplus / as sysasm -- List disk groups and their redundancy SELECT name, type, state, total_mb, free_mb FROM v$asm_diskgroup ORDER BY name; -- List griddisks in a disk group — see which cells contribute SELECT dg.name AS disk_group, d.name AS disk_name, d.path AS cell_path FROM v$asm_disk d, v$asm_diskgroup dg WHERE d.group_number = dg.group_number ORDER BY dg.name, d.name;
Why Exadata Is Faster Than Traditional Storage — The Five Reasons
The architecture described above delivers performance advantages through five distinct mechanisms. Understanding all five helps you explain Exadata's value clearly to management and architects.
- I/O elimination via Storage Indexes — data regions are skipped before disk is read. Zero I/O is the fastest I/O. No traditional storage system eliminates I/O this way because they have no knowledge of the data's min/max values.
- I/O reduction via Smart Scan — only matching rows and requested columns cross the network. A 1 TB scan that returns 1 GB of results moves 1 GB, not 1 TB. Traditional storage always moves the full dataset.
- Processing parallelism across all cells — every storage cell processes its portion of the scan simultaneously. 14 cells scanning in parallel is 14x the throughput of a single storage path. Traditional SAN I/O is not inherently parallelised at the intelligence level.
- Database server CPU freed for real work — because the storage cells handle predicate filtering, column projection, bloom filter joins, and decompression, the database server CPUs are free for other work. In a consolidation environment this means more concurrent workloads can run simultaneously.
- Ultra-low latency RDMA for hot data — OLTP workloads that hit XRMEM complete at 17 microsecond latency via direct RDMA. Traditional Fibre Channel SAN latency is measured in hundreds of microseconds to milliseconds. The difference is an order of magnitude.
Architecture in Numbers — Exadata X10M
| Metric | Value |
|---|---|
| XRMEM read latency (RDMA) | 17 microseconds end-to-end |
| Internal network bandwidth per server | 200 Gb/sec (2 x 100 Gb/sec RoCE active-active) |
| Storage cell read IOPS per cell | 2.8 million 8K read IOPS per storage cell |
| Total read IOPS — full rack | Up to 25.2 million IOPS per rack |
| Storage throughput | Nearly 1 TB/sec per rack |
| I/O reduction for selective analytics | Up to 99% with Smart Scan and Storage Indexes combined |
| OLTP throughput vs prior gen (X9M) | Up to 3x higher with X10M |
| Analytic query speed vs prior gen | Up to 3.6x faster with X10M |
Summary — The Architecture in Five Points
- Exadata has three layers — Compute (DB nodes), Network (RoCE + iDB), and Storage (cells + cellsrv). Every I/O passes through all three.
- LIBCELL bridges Oracle Database to the Exadata network — bypassing the OS I/O stack entirely. Only Oracle Database can use LIBCELL.
- iDB is not a block protocol — it carries SQL predicates, column lists, and bloom filters so the storage cell can perform intelligent offloaded processing.
- cellsrv on each storage cell receives iDB messages, consults Storage Indexes, reads from disk or flash, applies CELLOFLSRVn offload processing, and returns row sets — not raw blocks.
- RDMA bypasses cellsrv entirely for hot data in XRMEM — delivering 17 microsecond end-to-end latency for the most time-sensitive OLTP workloads.
No comments:
Post a Comment