In most traditional migrations towards a new environment, one will most probably come up with the most current ideas to get the job done, but what about migrating a multi-terabyte sized database with a minimal down time?
First, we opted to set up HADR, but that would imply a second down time apart from the actual switch of environment just to get HADR up and running. Another method could be to set up a replication tool to copy the data for this particular database, but the method wouldn’t be reusable for the many other databases we have to migrate as well. This left us with one feasible option: continuous rolling forward. You might come up with different ways of working as well, but read through the rest of the story first before confirming on your thought.
The idea behind it is that the production database keeps on being used up until the last minute. To do this, we’ll roughly follow these steps:
A source server might have multiple NFS mounts when delivering to multiple targets and each target server exports one single local drive. On the source side the NFS drives get mounted as
A scheduled task is created on the source server per NFS mount and it will check which logs are ready to be copied onto the NFS. Similar on the target server, a scheduled job is created to use those logs to use for the rollforward.
When relying on the quality of the logs, we need to be assured that no-one is capable of doing actions that are not logged. Non-recoverable loads, table creations initially not-logged and what else comes to mind … needs to be forbidden. Additionally, each index (re)build done by the original database also needs to be re-played automatically on the target database while processing the logs. This way your indexes are ready to go when the new database gets started up. You do not want an application hitting a large table whose index is to rebuild the first time it is addressed.
Two registry settings are set to deal with above described situations, luckily both settings can be done online:
On each NFS drive we create a configuration file describing the link between the source and target. This information is used to know where on the NFS log files need to be stored (by the source) or are searched for (by the target). One line describes the link between one source-target pair.
On the NFS mount we create a directory structure which assures us we do not have a mix up between databases and environments:
The source figures out whether it has to copy log files to on a particular mount by searching on
Likewise, the target searches through the same configuration file by using the pattern
to know where to find log files which are available for a database existing in one of the instances located on the particular target.
Make an online backup
The database in scope is a due for online backups on a regular basis, so for us there is no need for any additional work to create a backup image. In our case, the backup is made to a large set of disks as well to tape.
To facilitate the possible path changes + conversion from DMS to AS we generate the restore script, not by using the “generate restore script” facility of DB2, but by a house-made script. Too many changes in the paths are to be done, so building up such a script from scratch was the better choice. The script contained the creation of directories
! mkdir -p /
, the setting of the storage groups and tablespaces, and the restore command itself.
! Leave out the rollforward command, it’ll have to be done at a later time in this process.
The restore script is kept as general as possible so that we can re-use it in the future when we e.g. need to restore a Production database to a Test environment, but it also implies that actions have to be taken before it is usable for real.
Restore it into the new server while moving to automatic storage
This phase in the procedure is the one I find the one more difficult: make the generate restore script ready for usage and restore the database.
A ready to use restore script
Setting up a new environment gives you also the advantage of fixing what went wrong in the past. You should not come up with too many things to change, but some you can do right on the spot. There are a ton of reasons thinkable arguing why you shouldn’t … but we did.
Adapting the restore script, we altered the historical wrong paths of data containers, added storage groups and moved the database managed tablespaces to automatic storage. Keep in mind that user temporary table spaces most probably need to be dropped and recreated to have them using automatic storage.
To limit the number of retrials of running the script, we add the creation of all container paths in the restore script + added `! date ;` statements to measure the elapsed time per important script part we wanted gather timings for.
Restore the database
Using the script, the database can get restored. Keep far away from the ROLLFORWARD command at this point as well as all statements which might be done after this.
Start copying all logs since the backup, rollforward and wait for a new load
How to copy the new log files is probably different for every shop. You could have one server pushing, a server pulling, performing some kind of ftp and so on.
We opt to use a shared NFS drive between the two servers. The source database figures out which log files are new and copies those to a specific directory on the NFS drive; the target server picks up these logs, uses them to roll forward and removes them from the directory.
To have not files already used for rolling forward when they are still being copied from the source, we’ve scheduled a script on both servers with a time difference of 20 minutes.
We might spike up the frequency, though, so we use a ‘lock file’.
- When the source is searching for log files to copy onto the NFS a lock file prevents the target from rolling forward
- When the target is rolling forward, the source skips a round before trying again
It isn’t unthinkable that a script might stop in the middle while processing, leaving its lock file. The lock file holds, however, the process id of the running script. Should something drastic happen in a previous run, the next run will find the lock file and check whether there is still a process running with the same process id. The procedure probably isn’t airtight, but it does the job it needs to do.
As soon as the script on the target side has done a rollforward with the available log files, the log files get deleted. This ensures us disk space. On the other had we are copying from the archive logs on NFS mount, how to keep track on which log files are already copied and which ones still are in the running to be copied?
To keep track which log was already shipped, we saved the value of the last file in a “progression” file, e.g.
All log files found newer than this value, are candidates for shipping. The name of the last log shipped will replace the previous value (in the example: S0006542.LOG) in the “progression” file.
Decide when a log is due to get shipped
The source databases are archiving their log files to tape and to disk. We considere a log file ready for shipping as soon as it is archived to disk.
First find out which LOGARCHMETH is dumping to disk:
Secondly find the last LOG which was archived:
All logs between the last run (see the previous point) and the log just found, are then ones that need to get copied to the NFS mount.
When is a log due to get used for rolling forward?
At first the time gap between copying files from the source and fetching files on the target is large enough, but we want to make the window smaller. How are we to know when a log file is completely copied and ready for consumption?
This topic is actually addressed by “a. Locking”: the rolling forward script cannot kick in before the script copying the logs is done. Once the path is cleared, all log files available on the NFS drive are cleared for usage.
Applying the logs … wait a minute!
We have some copying done with the log files: db2 archived them and then the script picked them up to move to the NFS mount. Something might have gone wrong during the whole travel! We’ll better check the condition of each of the log files before using them!
Applying the logs … rolling forward
It is key to perform the rollforward like this:
Leave out the ‘and stop’ or ‘and complete’!
Both scripts are foreseen of tons of checks and whenever something goes wrong a mail is send out to the migration team. We do this to make sure the database in continuous rollforward modus stays healthy and in great shape.
Lifting a tip of the veil – when restoring our largest database, it will take us – assuming the lack of glitches – between 27 to 30 hours. We’ll start this process on the weekend before the actual migration of the database. During one week we’ll feed the new database with logs. If all goes well, we’ll just need to execute next bullet points.
Should our new database get damaged during the week before the actual migration, we can redo the restore, hoping we’ll be ready in time and no new hick-ups will pop up.
Stop the database on the old environment
At some point in time it is necessary to stop the old database. When doing so, the last set of log files are dumped. Once the database is up ‘n running in the new environment the aftercare can be started up.
Do whatever is necessary to complete the setup on the new environment
This is a good time to perform the last steps we’ve foreseen during the move:
- Upgrade DB2 if necessary / Apply a fix-pack
- Apply a new security set; we’ve revised the way privileges are given to application and technical users
- Transfer ownerships were applicable
- Remove unused residuals, e.g. wrappers, servers, nicknames, …
Start the database on the new environment
We are almost there! Open up the database for the applications, make sure all connections point to the correct server / port / database and we’re off!
Thanks to Cristian Molaro and Dirk Coppieters this methodology has grown to what is now. Soon it will have to prove its robustness during the real life migration we are facing in a near future.