Building a Better Data Center Move
by Josh Williams on Feb.22, 2010, under Project Management
Or, Four Things I Learned As A Project Manager.
A little over a week ago, I headed a project that was small in scale but large in importance: we picked up a sizeable chunk of hardware from one colocation facility and moved it across town to another. Small in scale, as it was only a dozen or so pieces of equipment; large in importance, as it involved high end switches, a SAN, a VMware cluster, and a couple dedicated web servers. We host entire networks, so with multiple businesses riding on this being careful was priority one.
The underlying concept of the move was a phased pipelined approach, where in as any Equipment Group X is being pulled from the rack, Group Y is in transit, and Group Z is being installed. In other words about the time something was done installing the next set of gear would be arriving.
It actually worked fairly well; I think it ultimately was the correct approach. But there were a couple bumps in the road, which, as PM, I could have planned around. These are the lessons I learned…
Lesson 1: Employ Automation. Under the faulty assumption that it’d give us more control over the start up of virtual machines, and thus avoid among other things a boot storm on the SAN, the automatic control of virtual machine power on/off by the host was disabled. In truth this feature could have saved us considerable time, especially considering the Virtual Center Server is itself running in a VM. If that doesn’t start on its own, you have an annoying chicken-and-egg situation to deal with.
Lesson 2: If It Can Be Done Early, Do It Early. We pulled the two big, redundant switches and in place of those put in a single smaller switch for the equipment staying at the colocation facility. Rather than installing that switch at the tail end of the project, it could have been installed, cabled up, and ready to go in the days leading up to the primary project execution. Then we wouldn’t have had anything to do that day but unplug the stuff that was moving. That could have avoided some not unwarranted confusion when engineers not as familiar with the environment, and what can plug where, ran out of switch ports.
It additionally would have made it much, much more likely we’d have found the bad cable before it added another 25% time to find and diagnose to the move day, after everyone had closed up and left the old facility not anticipating the need to go back there that day.
Lesson 3: Tear Down ≠ Install Time. In planning, I’d assumed that the time required to pull the equipment from the rack would roughly equate to the time required to install it at the new place. However when it came down to it, most of the equipment was pulled quickly, in part since the stuff being pulled last had been some of the first to power down and was thus ready to go. I’m not too sure how this would change the plan, apart from perhaps doing more things to allow Lesson #2 above to happen to a greater degree.
Lesson 4: Enable Testing. In the software design world testing is very important, to the extent that automated suites and dedicated test harnesses have become essential. An explicitly written set of steps to perform with the expected outcomes is now the norm. We could have taken the same approach for the last people to leave the old colo facility, and defined some test procedures for which VLAN’s should be able to communicate with which equipment, even if everything wasn’t up and running at the new facility yet.
And again we would have had a much better chance of finding the not-quite-dead, but flaky and eventually very problematic cable.