Introduction to Exadata
The nodes run the Oracle Clusterware, the ASM instances, and the database instances. You may decide to create just one cluster or multiple ones. Similarly you may decide to create a single database on the cluster or multiple ones. If you were to create three databases – dev, int and QA - you would have two choices:
The first option allows you to add and remove instances of a database easily. For instance, with 8 nodes in a full rack, you may assign 2 nodes to dev, 2 to int, and 4 to QA. Suppose a full-fledged production stress test is planned and that temporarily needs all 8 nodes in QA to match 8 nodes in production. In this configuration, all you have to do is shut down the dev and int instances and start the other four instances of QA on those nodes. Once the stress test is complete, you can shut down those 4 QA instances and restart the dev and int instances on them.
If you run multiple production databases on a single rack of Exadata, you can still take advantage of this technique. If a specific database needs additional computing power temporarily to ride out a seasonal high demand, just shut down one instance of a different database and restart the instance of the more demanding one in that node. After the demand has waned, you can reverse the situation. You can also run two instances in the same node but they will compete for the resources – something you may not want. At the I/O level, you can control the resource usage by the instances using the IO Resource Manager (IORM).
On the other hand, with this option, you are still on just one cluster. When you upgrade the cluster, all the databases will need to be upgraded. The second option obviates that; there are individual clusters for each database – a complete separation. You can upgrade them or manipulate them any way you want without affecting the others. However, when you need additional computational power for other nodes, you can’t just start up an instance. You need to remove a node from that cluster and add the node to the other cluster where it is needed – an activity more complex compared to the simple shutdown and startup of instances.
Since the cells have the disks, how do the database compute nodes access them - or more specifically, how do the ASM instances running on the compute nodes access the disks? Well, the disks are presented to cells only, not to the compute nodes. The compute nodes see the disks through the cells. For the lack of a better analogy, this is akin to network-attached storage. (Please note, the cell disks are not presented as NAS; this is just an analogy.)
The flash disks are presented to the cell as storage devices as well, just like the normal disks. As a result they can be added to the pool of ASM disks to be used in the database for ultra fast access, or they can be used to create the smart flash cache layer, which is a secondary cache between database buffer cache and the storage. This layer caches the most used objects but does not follow the same algorithm as the database buffer cache, where everything is cached first before sending to the end user. Smart flash cache caches only those data items which are accessed frequently – hence the term “smart” in the name. The request for data not found in the smart flash cache is routed to disks automatically.
On two of the 12 disks, the operating system, Oracle Exadata Storage Server software, and other OS related filesystems such as /home are located. They occupy about 29GB on a disk. For protection, this area is mirrored as RAID1 with on another disk. The filesystems are mounted on that RAID1 volume.
However, this leaves two cell disks with less data than the other ten. If we create an ASM diskgroup on these 12 disks, it will have an imbalance on those two disks. Therefore, you (or whoever is doing the installation) should create another diskgroup with 29TB from the other 10 cell disks. This will create same sized ASM disks for other diskgroups. This “compensatory” diskgroup is usually named DBFS_DG. Since this diskgroup is built on the inner tracks of the disk, the performance is low compared to the outer tracks. Therefore instead of creating a database file here, you may want to use it for some other purpose such as ETL files. ETL files need a filesystem. You can create a database filesystem on this diskgroup – hence the name DBFS_DG. Of course, you can use it for anything you want, even for database files as well, especially for less accessed objects.
Now that you know the components, look at the next section to get a detailed description of these components.
Let's begin with a
whirlwind tour of the Oracle
Exadata Database Machine.
It comes in a rack with the components that make up a database infrastructure:
disks, servers, networking gear, and so on. Three configuration types are
available: full rack (see below), half rack, or quarter rack. The architecture
is identical across all three types but the number of components differs.
Now let's dive into each
of these components and the role they play. The following list applies to a
full rack; you can also view them contextually via a really neat 3D
demo.
·
Database
Nodes – The Exadata
Database Machine runs Oracle Database 11g Real Application Cluster.
The cluster and the database run on the servers known as database nodes or
compute nodes (or simply “nodes”). A full rack has 8 nodes running Oracle Linux
or Oracle Solaris.
·
Storage
cells - The disks are
not attached to the database compute nodes, as is normally the case with the
direct attached storage, but rather to a different server known as the storage
cell (or just “cell”; there are 14 of them in a full rack). The Oracle Exadata
Server Software runs in these cells on top of the OS.
·
Disks – each cell has 12 disks. Depending on the
configuration, these disks are either 600GB high performance or 2TB high
capacity (GB here means 1 billion bytes, not 1024MB). You have a choice in the
disk type while making the purchase.
·
Flash
disks – each cell also
has 384GB of flash disks. These disks can be presented to the compute nodes as
storage (to be used by the database) or used a secondary cache for the database
cluster (called smart cache).
·
Infiniband
circuitry – the cells and
nodes are connected through infiniband for speed and low latency. There are 3
infiniband switches for redundancy and throughput. Note: there are no fiber
switches since there is no fiber component.
·
Ethernet
switch – the outside
world can communicate via infiniband, or by Ethernet. There is a set of
Ethernet switches with ports open to the outside. The clients may connect to
the nodes using Ethernet. DMAs and others connect to the nodes and cells using
Ethernet as well. Backups are preferably via infiniband but they can be done
through network as well.
·
KVM
switch – there is a
keyboard, video, and mouse switch to get direct access to the nodes and cells
physically. This is used initially while setting up and when the network to the
system is not available. In a normal environment you will not need to go near
the rack and access this KVM, not even for powering on and off the cells and
nodes. Why not? You’ll learn why in the next installment. (Not all models have
a KVM switch.)
The nodes run the Oracle Clusterware, the ASM instances, and the database instances. You may decide to create just one cluster or multiple ones. Similarly you may decide to create a single database on the cluster or multiple ones. If you were to create three databases – dev, int and QA - you would have two choices:
·
One cluster – create one
cluster and create the three databases
·
Three clusters – create
three different clusters and one database in each of them
The first option allows you to add and remove instances of a database easily. For instance, with 8 nodes in a full rack, you may assign 2 nodes to dev, 2 to int, and 4 to QA. Suppose a full-fledged production stress test is planned and that temporarily needs all 8 nodes in QA to match 8 nodes in production. In this configuration, all you have to do is shut down the dev and int instances and start the other four instances of QA on those nodes. Once the stress test is complete, you can shut down those 4 QA instances and restart the dev and int instances on them.
If you run multiple production databases on a single rack of Exadata, you can still take advantage of this technique. If a specific database needs additional computing power temporarily to ride out a seasonal high demand, just shut down one instance of a different database and restart the instance of the more demanding one in that node. After the demand has waned, you can reverse the situation. You can also run two instances in the same node but they will compete for the resources – something you may not want. At the I/O level, you can control the resource usage by the instances using the IO Resource Manager (IORM).
On the other hand, with this option, you are still on just one cluster. When you upgrade the cluster, all the databases will need to be upgraded. The second option obviates that; there are individual clusters for each database – a complete separation. You can upgrade them or manipulate them any way you want without affecting the others. However, when you need additional computational power for other nodes, you can’t just start up an instance. You need to remove a node from that cluster and add the node to the other cluster where it is needed – an activity more complex compared to the simple shutdown and startup of instances.
Since the cells have the disks, how do the database compute nodes access them - or more specifically, how do the ASM instances running on the compute nodes access the disks? Well, the disks are presented to cells only, not to the compute nodes. The compute nodes see the disks through the cells. For the lack of a better analogy, this is akin to network-attached storage. (Please note, the cell disks are not presented as NAS; this is just an analogy.)
The flash disks are presented to the cell as storage devices as well, just like the normal disks. As a result they can be added to the pool of ASM disks to be used in the database for ultra fast access, or they can be used to create the smart flash cache layer, which is a secondary cache between database buffer cache and the storage. This layer caches the most used objects but does not follow the same algorithm as the database buffer cache, where everything is cached first before sending to the end user. Smart flash cache caches only those data items which are accessed frequently – hence the term “smart” in the name. The request for data not found in the smart flash cache is routed to disks automatically.
The Secret Sauce: Exadata Storage Server
So, you may be
wondering, what’s the “secret sauce” for the Exadata Database Machine’s amazing
performance? A suite of software known as Exadata Storage Server, which runs on
the storage cells, is the primary reason behind that performance. In this
section we will go over the components of the storage server very briefly (not
a substitute for documentation!).
Cell Offloading
The storage in the
Exadata Database Machine is not just dumb storage. The storage cells are
intelligent enough to process some workload inside them, saving the database
nodes from that work. This process is referred to as cell offloading.
The exact nature of the offloaded activity is discussed in the following
section.
Smart Scan
In a traditional Oracle
database, when a user selects a row or even a single column in a row, the
entire block containing that row is fetched from the disk to the buffer cache,
and the selected row (or column, as the case may be) is then extracted from the
block and presented to the user’s session. In the Exadata Database Machine,
this process holds true for most types of access, except a very important few.
Direct path accesses – for instance, full table scans and full index scans –
are done differently. The Exadata Database Machine can pull the specific rows
(or columns) from the disks directly and send them to the database nodes. This
functionality is known as Smart Scan. It results in huge savings in I/O.
For instance your query might satisfy only 1,000 rows out of 1 billion but a full table scans in a traditional database retrieves all the blocks and filters the rows from them. Smart Scan, on the other hand, will extract only those 1,000 rows (or even specific columns from those rows, if those are requested) – potentially cutting I/O by 10 million times! The cell offloading enables the cells to accomplish this.
Not all the queries can take advantage of Smart Scan. Direct buffer reads can. An example of such queries is a full table scan. An index scan will look into index blocks first and then the table blocks – so, Smart Scan is not used.
For instance your query might satisfy only 1,000 rows out of 1 billion but a full table scans in a traditional database retrieves all the blocks and filters the rows from them. Smart Scan, on the other hand, will extract only those 1,000 rows (or even specific columns from those rows, if those are requested) – potentially cutting I/O by 10 million times! The cell offloading enables the cells to accomplish this.
Not all the queries can take advantage of Smart Scan. Direct buffer reads can. An example of such queries is a full table scan. An index scan will look into index blocks first and then the table blocks – so, Smart Scan is not used.
iDB
How can storage cells
know what columns and rows to filter from the data? This is done by another
component inherently built into the storage software. The communication between
nodes and cells employ a specially developed protocol called iDB (short for
Intelligent Database). This protocol not only request the blocks (as it happens
in an I/O call in a traditional database) but can optionally send other
relevant information. In those cases where Smart Scan is possible, iDB sends
the names the table, columns, predicates and other relevant information on the
query. This information allows the cell to learn a lot more about the query
instead of just the address of the blocks to retrieve. Similarly, the cells can
send the row and column data instead of the traditional Oracle blocks using
iDB.
Storage Indexes
How does Smart Scan
achieve sending only those relevant rows and columns instead of blocks? A
special data structure built on the pattern of the data within the storage
cells enables this. For a specific segment, it stores the minimum, maximum, and
whether nulls are present for all the columns of that segment in a specified
region of the disk, usually 1MB in size. This data structure is called a
storage index. When a cell gets a Smart Scan-enabled query from the database
node via iDB, it checks which regions of the storage will not contain the data.
For instance if the query predicate states where rating = 3, a region on the
disk where the minimum and maximum values of the column RATING are 4 and 10
respectively will definitely not have any row that will match the predicate.
Therefore the cell skips reading that portion of the disk. Checking the storage
index, the cell excludes a lot of regions that will not contain that value and
therefore saves a lot of I/O.
Although it has the word “index” in its name, a storage index is nothing like a normal index. Normal indexes are used to zero in on the locations where the rows are most likely to be found; storage indexes are used just for the opposite reason – where the rows are most likely not to be found. Also, unlike other segments, these are not stored on the disks; they reside in memory.
Although it has the word “index” in its name, a storage index is nothing like a normal index. Normal indexes are used to zero in on the locations where the rows are most likely to be found; storage indexes are used just for the opposite reason – where the rows are most likely not to be found. Also, unlike other segments, these are not stored on the disks; they reside in memory.
Smart Cache
Database buffer cache is
where the data blocks come in before being shipped to the end user. If the data
is found there, a trip to the storage is saved. However, if it not found, which
might be the case in case of large databases, the I/O will inevitably come in.
In Exadata Database Machine, a secondary cache can come in between the database
buffer cache and the storage, called Smart Cache. The smart cache holds
frequently accessed data and may satisfy the request from the database node
from this cache instead of going to the disks – improving performance.
Infiniband Network
This is the network
inside the Exadata Database Machine – the nervous system of the machine through
which the different components such as database nodes and storage cells.
Infiniband is a hardware media running a protocol called RDP (Reliable Datagram
Protocol), which has high bandwidth and low latency – making the transfer of
data extremely fast.
Disk Layout
The disk layout needs some
additional explanation because that’s where most of the activities occur. As I
mentioned previously, the disks are attached to the storage cells and presented
as logical units (LUNs), on which physical volumes are built.
Each cell has 12 physical disks. In a high capacity configuration they are about 2TB and in a high performance configuration, they are about 600GB each. The disks are used for the database storage. Two of the 12 disks are also used for the home directory and other Linux operating system files. These two disks are divided into different partitions
Each cell has 12 physical disks. In a high capacity configuration they are about 2TB and in a high performance configuration, they are about 600GB each. The disks are used for the database storage. Two of the 12 disks are also used for the home directory and other Linux operating system files. These two disks are divided into different partitions
The physical disks are divided into multiple partitions. Each partition is
then presented as a LUN to the cell. Some LUNs are used to create a filesystem
for the OS. The others are presented as storage to the cell. These are
called cell disks. The cell disks are further divided as grid
disks, ostensibly referencing the grid infrastructure the disks are used
inside. These grid disks are used to build ASM Diskgroups, so they are used as
ASM disks. An ASM diskgroup is made up of several ASM disks from multiple
storage cells. If the diskgroup is built with normal or high redundancy (which
is the usual case), the failure groups are placed in different cells. As a
result, if one cell fails, the data is still available on other cells. Finally
the database is built on these diskgroups.
These diskgroups are
created with the following attributes by default:
Parameter
|
Description
|
Value
|
_._DIRVERSION
|
The minimum allowed version for directories
|
11.2.0.2.0
|
COMPATIBLE.ASM
|
The maximum ASM version whose features can use
this diskgroup. For instance ASM Volume Management is available in 11.2 only.
If this parameter is set to 11.1, then this diskgroup can’t be used for an
ASM volume.
|
11.2.0.2.0
|
IDP.TYPE
|
Intelligent Data Placement, a feature of ASM
that allows placing data in such a way that more frequently accessed data is
located close to the periphery of the disk where the access is faster.
|
dynamic
|
CELL.SMART_SCAN_CAPABLE
|
Can this diskgroup be enabled for Exadata
Storage Server’s Smart Scan Capability?
|
TRUE
|
COMPATIBLE
|
The minimum version of the database that can
be created on this diskgroup. The far back you go back in version number, the
more the message passing between RExadata Database MachineS and ASM instances
causing performance issue. So, unless you plan to create a pre-11.2 database
here (which you most likely do not plan on), leave it as it is.
|
11.2.0.2
|
AU Size
|
The size of Allocation Unit on this disk. The
AU is the least addressable unit on the diskgroup.
|
|
On two of the 12 disks, the operating system, Oracle Exadata Storage Server software, and other OS related filesystems such as /home are located. They occupy about 29GB on a disk. For protection, this area is mirrored as RAID1 with on another disk. The filesystems are mounted on that RAID1 volume.
However, this leaves two cell disks with less data than the other ten. If we create an ASM diskgroup on these 12 disks, it will have an imbalance on those two disks. Therefore, you (or whoever is doing the installation) should create another diskgroup with 29TB from the other 10 cell disks. This will create same sized ASM disks for other diskgroups. This “compensatory” diskgroup is usually named DBFS_DG. Since this diskgroup is built on the inner tracks of the disk, the performance is low compared to the outer tracks. Therefore instead of creating a database file here, you may want to use it for some other purpose such as ETL files. ETL files need a filesystem. You can create a database filesystem on this diskgroup – hence the name DBFS_DG. Of course, you can use it for anything you want, even for database files as well, especially for less accessed objects.
Now that you know the components, look at the next section to get a detailed description of these components.
Detailed Specifications
As of this writing, the
current (third) generation of Exadata Database Machine comes in two models
(X2-2 and X2-8); various sizes (full rack, half rack, and quarter rack); and
three classes of storage (high performance, high capacity SAS, and high
capacity SATA). For detailed specifications, please see the configuration specs
on the Oracle website: X2-2, X2-8, X2-2
Storage Server.
No comments:
Post a Comment