Ensuring Database Availability
With visions of Hurricane Katrina and New Orleans’ failing levees in mind, how do database administrators (DBAs) relax away from work? Obviously, a full recovery plan, including a business continuity plan (BCP), possible hot or cold sites and other preparation steps are critical. Many commercial sites offer general guidance such as http://www.drj.com/new2dr/samples.htm, which lists many sample plans and templates and has other references. Usually, this would be part of your company’s disaster-recover plan, which would include high-level details such as employees’ contact information, replacement servers and bare metal restore plans. Bear in mind that no high-availability solution is 100 percent, so even a manual business model should be in your BCP.
What causes downtime? In a TechNet article on SQL availability, http://www.microsoft.com/technet/prodtechnol/sql/2000/reskit/sql2kres.mspx?mfr=true, Microsoft states that up to 80 percent of failure to contact the database can be network outages of one sort or another and not intrinsic database failure. Much of the 80 percent can be split between outright network failures and application connectivity, which now requires the DBA to understand an alphabet soup of connectivity: open database connectivity (ODBC), Java database connectivity (JDBC), ActiveX data object (ADO), object-linking and -embedding database (OLE-DB) and so on.
Assuming network operations are handled by a different team, you still need to understand how connectivity works in order to hand off responsibility. Aside from the traditional backups, DBAs also are concerned with replication, which can take two forms. Asynchronous replication is the method used by SQL server replication and Microsoft Message Queuing, and synchronous replication is that used by disk duplexing or, less commonly, software two-phase commit.
Are your databases backed up often enough? Typical, operations back up once or twice a day, sending these backups off-site but in some businesses, that’s not good enough. You might need to replicate data to another server or site to enable full recovery. Alternatively, consider log shipping, wherein transaction logs are sent to a standby database server, so that you can make that server the new primary database server. MS SQL log shipping set up is here: http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/logship1.mspx. As in simpler times, make sure appropriate backup agents are installed for whichever database used, that the disk storage system is redundant (RAID greater than zero or SAN) and depending on line of business, servers are clustered. Rather than this high-level, business- or system-level approach, the rest of this article will examine what steps DBAs can take at the database level.
One aspect of availability that gets short shrift in most discussions is database design. Does the design allow fast access for multiple users but prevent collisions among multiple writes? At one former employer, multiple network users would attempt to use a large FoxPro database, whose indices would become corrupt regularly if more than one person accessed the database. The fix — to reindex nightly — became worse than the solution, which was to migrate to a more robust SQL database. This was because that reindexing process required its own server!
In another case, in a consultant role, a company had multiple sites on SharePoint Portal server, which were created directly at the same highest-tree level. Because of only moderate use, no performance problems were noted, but restore would have been a nightmare. Again, only the composite SharePoint portal was backed up. Appropriate design would have allowed individual SharePoint sites (and subsites) to be backed up — and restored — as the SharePoint Portal Server Data Backup and Restore tool only backs up the whole SharePoint site. Even that wasn’t implemented. Rather, the company was using standard SQL external backup.
After site redesign, following http://www.microsoft.com/technet/technetmag/issues/2005/11/BePrepared/default.aspx, we had a consultant write a script using “Smigrate” and “StsAdmin” to back up each site, allowing granular-level item recovery. Much of that method can be found at http://support.microsoft.com/kb/889236. Don’t forget to run the SharePoint Configuration Analyzer to verify your server is configured the same way as you and your documentation think.
Pay attention to the obvious. For example, while examining a SQL server at work, we found all databases on that server were set to back up but as a collection, not individually. It is much better to set each database to be backed up, so in the event of corruption or other calamity, individual databases can be restored rather than the entire server or partition. In MS SQL, the wizard easily accomplishes this task. For MySQL, use “Mysqldump” or “Mysqlhotcopy” script with appropriate arguments. Note that “Mysqlhotcopy” works for ISAM and MyISAM tables only, not InnoDB tables, which stores data separately from the table structure. MyISAM and ISAM tables are stored as files, so they also can be backed up individually, using “lock tables” and “flush Tables.” The specific steps you will take to ensure availability depend on the database you are using, as well as the operating system.
Maintenance of tables, indices and other database components is also overlooked. Check for corruption regularly. For MyISAM tables, use “Myisamchk,” which also repairs and optimizes databases. MS SQL uses Database Consistency Checker (DBCC) with DBCC Checkdb. See http://www.sql-server-performance.com/dbcc_commands.asp for documented and undocumented uses of DBCC.
Before backup, you can check and repair (minor) database integrity directly with the CkDBRepair switch but make sure you are in single-user mode with no connections. Other things to check for MS SQL include fragmentation, also done with DBCC arguments, explained at http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx. Various housekeeping functions, including log shipping and integrity checking, in addition to the backup setup mentioned earlier, can be done either by scripting or — more easily — by the Database Maintenance Plan Wizard (see http://www.sqlservercentral.com/columnists/awarren/sqlmaintenanceplans.asp).
Restating the obvious, the server hosting the database should have not only a RAID array, a mirror or server as a front end for a storage area network (SAN), but it should also have redundant controllers, power supplies, fans and multiple CPUs, as most servers will fail over to a working CPU. Microsoft recommends multiple small arrays rather than one big one so that log files, database files and the operating system can be written to different arrays, facilitating recovery, as well as providing increased design flexibility.
To protect against application or total server failure, Microsoft’s server operating system can use Microsoft Cluster Server (MSCS) or Network Load Balancing as SQL 2000 and above with Windows 2000 Advanced or Datacenter or above supported. MSCS is fairly flexible, as two servers (which must be on the same subnet) can be active/active or active/passive. The simplest setup would be shared-disk backup. Using duplicate fiber channels and disk d