Note: The content of this post are my own personal views on backing up and restoring TFS, this is not a how-to guide, nor a recommendation in any way as to which methods of backing up and restoring TFS should be used – I have published this in the hope that it may save you time or help you with your own research around the subject. I strongly suggest you test your backup, restore and disaster recovery strategies in your own lab based on your TFS environment before using in production.
A few days ago, I was in the unfortunate position of having corrupted TFS databases on my hands. The actual problem was outside of my control – a glitch of the virtualisation software host of the database VM combined with a VM snapshot creation, coinciding with a server reconfiguration led to an “unhandled error” causing the database server to just power off.
So, first step – turn the server back on. The server successfully booted to Windows – great! However, on hitting the TFS website, it appeared all was not well. Still returning errors. Ok – so next stop, the databases.
I opened up the database server up in SSMS and was met with a list of “suspect” databases. Great. I have backups – no problem, I should be up and running quickly! Like a good TFS admin, I am using the backup tool from the TFS admin console, and backing up with a full backup weekly, a differential every night, and transaction log backups every 15 minutes. So I shouldn’t lose more than 15 minutes of data, right? This is when it became apparent that things were more complicated than that.
This was issue number one. Two months ago I went on a TFS 2015(RTM) Administration and Configuration course, one of the topics extensively covered on the course being backup and restore. During the module, we took a full back up, then went to town deleting databases. When we wanted to restore, we simply opened the TFS Admin Console, waited a few seconds for it to time out hunting for the database and finally open, then used the restore tool in the Admin Console to restore all the databases. This was the process I was fully expecting to happen at this point. No such luck.
I was greeted with this error when I tried to open the TFS Admin Console on the App Tier:
“There was an exception while launching the Team Foundation Administration Console: TF246017: Team Foundation Server could not connect to the database. Verify that the instance is specified correctly, that the server that is hosting the database is operational, and that network problems are not blocking communication with the server.”
After which the console did not open. Ah. Wasn’t expecting that. So under pressure to resume normal service I was left with this unexpected catch-22 situation. I needed the console to run the restore. But we needed the database we wanted to restore in order to run the console to run the restore…
So, I decided to restore the latest diff backup of the Tfs_Configuration database manually via SSMS in order to get the console running. This went without a hitch and allowed me to open the console, and begin the restore.
Second catch 22 situation. So now I have the console, and I’ve picked to restore my latest backup, and choosing all databases so we’re in sync, I now find I need to delete the configuration database in order to run the restore.
Getting Somewhere! So, using this method seems to work ok – in testing after the event, I ran this through a couple of times and on one occasion (I can’t seem to reproduce, the wizard came up with the TF246017 error at the “Saving Settings” stage however the restore appeared to work fine, and closing and opening the admin console cleared the error). Every time I ran this however, the console would then report that scheduled backups were no longer enabled. Again, closing and reopening the console populated the correct schedule information.
So, what will I do in the future? Couple of options here:
The tfsRestore tool exe can be found under C:\Program Files\Microsoft Team Foundation Server 14.0\Tools on any server with TFS installed on (it doesn’t have to be a configured TFS instance even). To use it, simply double click on the exe.
It will then ask you for your target SQL instance, followed by the location of your backups.
The limitation of this tool is it does not appear to work against diff or transaction log backups, only full backups. On the pro side, there is an option to overwrite existing databases, rather than having to drop them manually first.
Installing TFS on another server which can see the database server is an alternative. You don’t need to configure TFS on the new server, just install it, then exit the configuration wizard. You can then load the admin console, and use the “Restore Databases” tool via the Scheduled Backups tab.
This seems to be a fairly hassle-free way of being able to restore up to the minute log backups (provided your log backup chain is uninterrupted! See lesson 2). The downside of using the console restore is that you have to drop all databases that will be restored (if you are restoring to the original database server).
Provided you can connect to the database server via SSMS, you have the option of restoring each database via SQL directly. Right click on the database > tasks > restore. You can then pick a point in time, or latest:
SQL knows where and when your backups where taken, as whenever the TFS console runs a backup, it is logged against the database in SQL. Running the restore through SQL also means you have more flexibility - you can use the GUI or SQL to build up your restore command with whichever parameters you want to include.
This is the method I ended up using, however it feels rather messy.. Having to restore a database to open the console, only to have to drop it again in order to run the restore doesn’t quite sit right in my mind.
My second complication became apparent when the restore kept failing. Trying to use the restore option through the TFS console meant that only the transaction log backups that TFS knew about were being used for the restore (based on the BackupSets.xml file).
Turns out that I had a SQL Agent Job set to backup transaction logs on all databases on the server every couple of hours. This was running and causing there to be gaps in the transaction log backups that the TFS backup tool was taking. Unfortunately, I was dealing with a very large TFS TPC database and my restores were taking a long time, so I had to make a decision to revert to the last known good backup (a differential from 2am in the morning). What I did try before reverting, was to edit the backupsets.xml to include the log backup files taken by my other tool. Unfortunately, I had no joy with this (probably something to do with the backupset IDs, as I have no idea how they are generated).
Easy one this – ensure that only one backup tool is running on the TFS databases! If you do happen to find yourself in this situation, your best bet is to reapply the sequence of backups manually after restoring to your latest full backup and rolling up with a differential.
If nothing else, this experience has taught me that it is not enough to know the “best practice” way of backing up and restoring TFS. You need to have practised it, ideally many times with any scenario you can plausibly think of. For example, I know now that I’d use a different restore approach if I had access to the existing corrupted databases, vs losing database server altogether. There are also things that you may never think to check, such as other backup tools interfering with log backups – though there is little you can do about your unknown unknowns, by planning ahead with your known scenarios will mean you have done everything you can to prepare, and unknowns that do occur will not have as much impact.