Wednesday, 14 March 2012

How Resilient Is Your Recovery Plan?


In this modern age of technology DBA’s face an ever increasing demand from businesses; our databases must perform, be secure, highly available and scalable and equally as important recoverable.  There is an obsession with the 5 9’s within the industry, and rightly so, but just how resilient is your recovery plan?  Can you honestly say hand on heart you have covered everything?  I can’t, but it isn’t through want of trying.

Performance and Security
Let’s say then that you are happy with the performance of your databases, you have normalised to 3rd normal form in the vast majority of cases without compromising performance.  All the right security measures are in place, Active Directory Groups with access available only for production DBA’s, you also have a controlled staged release procedure through Development, UAT and Live.   You monitor a wide variety of metrics and analyse them on a weekly basis for auditing and capacity planning.   You have a solid index strategy to squeeze out every possible ounce of performance.  Performance and Security you have covered pretty well but being a DBA you are always looking for ways to make things better.

Highly Available
You have a 2 node active passive Windows Failover Cluster, have tested all the possible failover scenarios and are happy that failovers and failbacks occur successfully.  Again you are happy with the high availability of your environment.

Scalability
You have plenty of room for “growth” both in size and load on the environment, there is also a contingency for additional resources (for ITIL people I mean storage, CPU and memory ;) if required.  Performance and utilisation is monitored and reviewed weekly as part of your standards.  You are comfortable that the environment scales.

Recoverable
You have configured Database Mirroring in high safety with manual failover to prevent false failovers as the DR site in a different country.  Failover has been tested and the databases are served to the applications successfully failback also works a treat.  You take full backups on a Sunday differentials every night and transaction log backups every 15 minutes.  The backups are then backed up to tape and sent to an offsite storage facility.

Good Times (GT’s), so far so good you have ticked all the required boxes up to this point.  Now let’s throw in some scenarios and see how this environment would work.

Performance problems reported
                Your solid release process highlighted a potential risk and the affected processes are rolled back to guarantee performance while the issues are being addressed.

Potential Security Breach
                This was found proactively as during your weekly review you noticed login failures and raised the issue with your security team to resolve.

Running low on disk space
Again this is proactive, based on current growth rates you predict that a volume will run out of space in 6 months, a change request is made and the volume expanded.

A Server in the cluster fails
                No problem here, failover is automatic you resolve the error with the problematic node and add it back into the cluster.

Entire cluster / primary site failure
                Again no problem you manually failover from the principal to the failover partner, the application and operation teams follow their DR plans and hey presto with little downtime your environment is up and serving customers again.

Data Loss
                At 12:00 reports of errors in the application reach your team, Investigation finds that a disgruntled DBA who was working their notice period deleted 100,000 records from the order table at 11:05 on a Friday morning.  A decision is made to recover from backup, the backups are kept on disk for one week so you look at restoring the affected database.  This is where the problem arises.

Your backup script consists of restoring the full backup from Sunday The differential from Thursday night and then all transaction logs from Friday with a STOP AT command at the last transaction before the delete.  The full backup restores fine but the differential backup from Thursday night fails as it is incomplete.  You investigate further and have to result to restoring the full backup from Sunday the differential backup from Wednesday and all transaction logs from Thursday up until 23:45 which is the last transaction log before the incomplete differential.

Yes that’s right you have lost just over 12 hours’ worth of data!  Things didn’t have to be this way though.  The solution is to verify your backups; if you would have verified the consistency of your backups this could have been avoided.

Summary
I know from experience trying to justify the cost of a server can be a painstaking task but justifying one that does not serve clients is even more difficult.  The spin I always use is;

“How much would it cost the company if we lost a day’s worth of data?” 

As a techie it is a no brainer, but persuading someone to part with their money that they see no day to day benefit from having is a different matter.  From their standpoint it is not revenue generating and the risk is worth taking.

In an ideal world all backups would be verified by restoring them but this is simply not practical for everyone as the budget may simply not available.  Take into consideration also that most environments I have worked in have had 40+ SQL Servers running different versions; this would potentially require more than one server to verify the backups.

Recommendations
What I would recommend is;

1 – Get buy in from other team members and line managers, the more people you have that agree with you on this the stronger the case will be when it is presented to the person with the cheque book!

2 - Prioritise your SQL Servers and verify the backups of the most critical databases as often as possible.  If you lose data in some databases you can get by, for example IT Internal databases (SQLSentry, Solarwinds).

Believe me it may one day save you and quite possibly your job!

No comments:

Post a Comment