In this modern age of technology DBA’s face an ever
increasing demand from businesses; our databases must perform, be secure,
highly available and scalable and equally as important recoverable. There is an obsession with the 5 9’s within
the industry, and rightly so, but just how resilient is your recovery plan? Can you honestly say hand on heart you have
covered everything? I can’t, but it
isn’t through want of trying.
Performance and
Security
Let’s say then that you are happy with the performance of
your databases, you have normalised to 3rd normal form in the vast
majority of cases without compromising performance. All the right security measures are in place,
Active Directory Groups with access available only for production DBA’s, you
also have a controlled staged release procedure through Development, UAT and
Live. You monitor a wide variety of
metrics and analyse them on a weekly basis for auditing and capacity
planning. You have a solid index
strategy to squeeze out every possible ounce of performance. Performance and Security you have covered
pretty well but being a DBA you are always looking for ways to make things
better.
Highly Available
You have a 2 node active passive Windows Failover Cluster,
have tested all the possible failover scenarios and are happy that failovers
and failbacks occur successfully. Again
you are happy with the high availability of your environment.
Scalability
You have plenty of room for “growth” both in size and
load on the environment, there is also a contingency for additional resources
(for ITIL people I mean storage, CPU and memory ;) if required. Performance and utilisation is monitored and
reviewed weekly as part of your standards.
You are comfortable that the environment scales.
Recoverable
You have configured Database Mirroring in high safety
with manual failover to prevent false failovers as the DR site in a different
country. Failover has been tested and
the databases are served to the applications successfully failback also works a
treat. You take full backups on a Sunday
differentials every night and transaction log backups every 15 minutes. The backups are then backed up to tape and
sent to an offsite storage facility.
Good Times (GT’s), so far so good you have ticked all the
required boxes up to this point. Now
let’s throw in some scenarios and see how this environment would work.
Performance problems
reported
Your
solid release process highlighted a potential risk and the affected processes
are rolled back to guarantee performance while the issues are being addressed.
Potential Security
Breach
This
was found proactively as during your weekly review you noticed login failures
and raised the issue with your security team to resolve.
Running low on
disk space
Again this is proactive, based
on current growth rates you predict that a volume will run out of space in 6
months, a change request is made and the volume expanded.
A Server in the
cluster fails
No
problem here, failover is automatic you resolve the error with the problematic
node and add it back into the cluster.
Entire cluster /
primary site failure
Again
no problem you manually failover from the principal to the failover partner,
the application and operation teams follow their DR plans and hey presto with
little downtime your environment is up and serving customers again.
Data Loss
At 12:00 reports of errors
in the application reach your team, Investigation finds that a disgruntled DBA
who was working their notice period deleted 100,000 records from the order
table at 11:05 on a Friday morning. A
decision is made to recover from backup, the backups are kept on disk for one
week so you look at restoring the affected database. This is where the problem arises.
Your backup script consists of restoring the full backup
from Sunday The differential from Thursday night and then all transaction logs
from Friday with a STOP AT command at the last transaction before the
delete. The full backup restores fine
but the differential backup from Thursday night fails as it is incomplete. You investigate further and have to result to
restoring the full backup from Sunday the differential backup from Wednesday
and all transaction logs from Thursday up until 23:45 which is the last transaction
log before the incomplete differential.
Yes that’s right you have lost just over 12 hours’ worth
of data! Things didn’t have to be this way
though. The solution is to verify your backups;
if you would have verified the consistency of your backups this could have been
avoided.
Summary
I know from experience trying to justify the cost of a
server can be a painstaking task but justifying one that does not serve clients
is even more difficult. The spin I always
use is;
“How much would it cost the company if we lost a day’s
worth of data?”
As a techie it is a no brainer, but persuading someone to
part with their money that they see no day to day benefit from having is a
different matter. From their standpoint it
is not revenue generating and the risk is worth taking.
In an ideal world all backups would be verified by
restoring them but this is simply not practical for everyone as the budget may
simply not available. Take into
consideration also that most environments I have worked in have had 40+ SQL
Servers running different versions; this would potentially require more than
one server to verify the backups.
Recommendations
What I would recommend is;
1 – Get buy in from other team members and line managers,
the more people you have that agree with you on this the stronger the case will
be when it is presented to the person with the cheque book!
2 - Prioritise your SQL Servers and verify the backups of
the most critical databases as often as possible. If you lose data in some databases you can
get by, for example IT Internal databases (SQLSentry, Solarwinds).
Believe me it may one day save you and quite possibly
your job!