Based on a presentation delivered by Carlos Carballo, Senior Sales Consultant at Oracle
I worked with Verizon for about 10 years and had a junior DBA who left to become a main DBA. His first month on the job, he calls me up and says, “I don’t think that the backups have been working and it just went down.”
They get a hold of tech support and the person says, “First thing . . . update your resume. Second, pray…”
In a survey of 250 IT global decision makers, 45% invest less than $500,000 a year in database recovery. And while 91% have some kind of disaster recovery plan, only 40% actually test it.
Traditional Database High Availability
Typical solution is use storage mirroring to an idle disaster recovery server. The problem is that databases used for Disaster Recovery just sit around doing nothing all day, until there is an update.
You can go crazy with Disaster Recovery. You can even achieve 99.999% availability but you are going to pay a lot for it. Oracle tries to provide a mid-point through Real Application Clusters (RAC) and Data Guard.
Do you need a grid?
A grid is a pool of resources where you have a Real Application Cluster. ASM can make all the hard drives look like a single pool of storage. Using ASM can allow RAC to be used as a scalability solution. It gives you an easy way to grow your number of available servers.
RAC is a new way of providing High Availability. RAC will allow you to have those servers double as a Disaster Recovery system.
Below is a breakdown of Oracle Maximum Availability Architecture (MAA) Best Practices.
- System Failures – RAC
- Data Failures – ASM, Flashback, RMAN, Data Guard, Streams
- System Changes – Online Reconfiguration, Rolling Upgrades
- Data Changes – Online Redefinition
Oracle has a geographical RAC or a long distance RAC available. RAC gives you the immediate jump from one server to another
Data Guard – Standby database that takes the archive logs and applies them so that it is available.
There are a number of resources available online at http://otn.oracle.com/deploy/availability
Data-aware High Availability and Data Protection
RMAN is the only solution that knows at a block level what has been changed and what needs to be backed up automatically. With RMAN, data is always transitionally consistent and data corruption is detected automatically.
Best Practices for High Availability / Disaster Recovery
There has been an ongoing debate about which provides the best bang for the buck – many small servers vs. a single large server. Using many lower cost servers is the perfect setup for RAC. You get similar performance to large single server machines and still have redundancy.
On the other hand with virtualization, you can partition a big machine; enjoy cost benefits from less hardware, fewer racks, and less cooling costs.
High Speed Switch or Interconnect
Where do I begin?
Nodes communicate via a private network (Interconnect) – which has to be pretty fast.
70% of RAC users use Oracle Clusterware
60% of RAC users use Oracle Clusterware and ASM
What is fast enough for Interconnect?
Has to be fiber, ethernet is not fast enough for the interconnect. Gigabit switch works fine for communication but outside traffic can quickly fill up the channel. Infiniband is recommended. Have dedicated switches and set up a VPN. Don’t share the network with other traffic. Nodes will work less and CPU utilization will go down.
How do I monitor performance?
Tuning is done automatically in 10g. You can think of ADDM for RAC as your performance expert in the box. It sees the entire infrastructure across the RAC environment and identifies the most “Globally Significant” performance problems across the entire RAC cluster database. ADDM runs proactively every hour.
How do I manage the workload?
Load balancing using load balancing advisory and connection pools
ASM simplifies the operational stack, giving you a volume manager that abstracts the physical files. So you don’t have to worry about the distribution of files across multiple hard drives. ASM will automatically start distributing the data with a new hard drive.
If one of the nodes starts getting out of sync with the rest of the nodes and does not connect to the interconnect very well. Grid control plugs into both the agent and the enterprise manager. Everything is managed by the Enterprise Manager. It allows you to see in one interface, the configuration changes side-by-side on all grid components.
Oracle 11g has a new SYSADM Privilege that separates ASM administration from the RDBMS. It is suggested to create an ‘asmadmin’ OS group that allows for systems administration to interact with Oracle without granting DBA rights.
- OSADM – Manage disk group
- OSDBA – Manage files
- OSOPER – Startup / shutdown
What are your Recovery Time Objectives (RTO) – how quickly will you need to recover?
RTO may vary according to database and/or server.
Oracle 11g has Active Data Guard. The distinction between physical standby databases and logical standby databases diminishes in 11g. In Oracle 10g, you cannot add other stuff to the standby databases, at least not in the original schema. Oracle 11g addresses this.
Logical Standby – Redo apply standby, DMLs are shipped across the network.
Can use this as a reporting database
Physical Standby – Can have synchronous replication, apply all updates at the end of the day.
Maximum Availability – Synchronous availability will cause both systems to go down if one of them is unavailable.
Backup and Recovery Design
- Disk – Fast, random I/O access
- Virtual Disk Library (VTL)
- Tape Library – Well known interface, tapes are easily added, can be moved easily off-site.
The flash recovery area provides a common repository for datafiles, redo logs, and archived logs to facilitate instantaneous restore. It keeps datafiles online during needed repair and disk space is automatically managed. It is recommended to keep control file backups and archived logs in the flash recovery area. Be sure to estimate the size of your archive logs stored in the Flash Recovery Area to avoid any possible sizing issues.
Oracle Recovery Manager (RMAN)
RMAN provides fully automatic disk-based backup and recovery. Corrupt block detection during backup and restore with the ability to validate backups. Changed blocks are tracked using a very efficient algorithm that can be 20 times faster and backs up only changed blocks. RMAN operates online and in parallel for fast processing and can catalog all backup and recovery activities. Users can create and maintain backup policies. RMAN is also integrated with Oracle HA solutions such as Oracle Secure Backup, Data Guard, ASM, Enterprise Manager, and other third party backup products.
RMAN Enhancements with Oracle 11g
Enterprise Manager has a new Data Recovery Advisor for RMAN. It also boast better performance and compression with Intrafile parallel backup and restore of single data files. Better security control with the new Virtual Private Catalog. This allows the catalog administrator to grant visibility of a subset or registered databases in the catalog to specific RMAN users. On Oracle 11g, you can now easily create a standby database over the network, avoiding intermediate staging areas.
RMAN offers integration with Windows Volume Shadow Copy services (VSS). Database is automatically recovered upon snapshot restore via RMAN
Very Large Databases (VLDB)
Backup and recovery of very large databases (VLDB) requires more preparation than standard backup and recovery operations. However, a few tips can help make VLDBs a lot more manageable.
Exploit partitioning and read-only tablespaces. Older partitions can be moved to read-only tablespaces. Take an incremental backup when NOLOGGING operations finish to ensure recoverability. Also, divide full backup workload across multiple days.
On VLDBs, you can save time with tablespace level backups. Other time saving tips include backing up index tablespaces and scarcely used tablespaces less frequently than regular data tablespaces .
What’s the difference between DBA 1.0 and DBA 2.0?
DBA 1.0 has a script that they have carried around and used forever
DBA 2.0 uses Grid control