: Automated Restore Testing
Making backups is half the job.
Experienced storage administrators know that nothing matters but restores.
All the other jobs - correcting failed backups, performance tuning, reporting - only matter because they support restorability of failed systems. As people often say,
We're not in the backup business... we're in the restore business.
But testing restorability is hard. When we ask people why they don't do it, they say:
- "I have 2,000 clients. Where do I start?"
- "We have change control rules here. Restore a file to a running production system? Forget it!"
- "I'd have to write a whole reporting system. Another one."
- "We're supposed to do more with less. My day's full already."
Close the circle. With .
ART makes testing easy.
- It comes packaged as an appliance. Getting started is easy.
- It discovers all your client nodes. There's almost no setup.
- It never changes the actual client files. It's a perfectly safe way to test live production servers.
- It puts minimal loads on your TSM server's performance.
- It's web-based dashboard shows percent success and failure at a glance.
- You can drill down to the corrupt volume or other failure cause quickly.
- And it's affordable: $3,500 for a 100-node site.
ART can test as many TSM servers as you want. ART auto-discovers the client nodes on each TSM server, and tests them one by one. During each test cycle, ART:
- contacts the TSM server on behalf of that client.
- requests a list of files backed up from that client.
- randomly selects a few files (you set how many, and the maximum size).
- tries to restore those files. The files come to ART, not to the real client, which remains untouched.
- logs success or failure of this test to its local database.
ART's dashboard uses its local database to show you what's happening right now, and what happened in the past. Drill down to see detailed logs and find the root cause!
ART has tested dozens of customer sites, and uncovered issues like these:
- Tapes not in the library: Tape volumes were removed for library maintenance. Most were checked back in, but some were not - until ART needed one for a restore. The admins then corrected the problem for all the missing tapes.
- Nodes not on a schedule: ART flags nodes that have not been backed up in months. One administrator realized he had installed TSM, done a manual backup, but had never put the node on a schedule!
- Wasted storage: A server was taken out of service, but nobody remembered to delete its storage from TSM after 90 days had passed. ART flagged it, prompting a review of all such nodes. The customer reclaimed 4 Terabytes of wasted storage!
- Broken Include/Exclude lists: if excludes or domain statements in the client's configuration accidentally ignore an entire filespace, ART will show you that.
- Restores too slow: If a restore takes more than 10 minutes, your users will start complaining. ART flags these nodes as "Failed" if they take longer than you allow.
- Not enough tape drives: when ART finds that there are no free drives, it does not try to restore the file, but marks it Failed. If this had been an actual user's restore, it would have pre-empted your Migration and Backup Stgpool jobs. This can help make the case for buying more tape drives.
- ... and more: The examples above are from our current customer base. But every site is different. ART continues to sweep the cobwebs from TSM installations!
makes your TSM site more reliable!