Analyzing the current environment and making recommendations to the customer

I had a long-term project at a customer site where I was to analyze, design, and architect a solution based on the equipment, environment, and requirements. Before I rolled in to the customer site as the new VMware SME, there had been a recommendation by a junior and recent VCP to implement distributed switching, linked vCenters and a few other feature sets of VMware and NetApp. There was not any experience with distributed switching by the on-site staff and their exposure to VMware was minimal, although many thought themselves as experts after a few weeks with the product. I kept hearing the comment that VMware was easy. I recommended a hybrid solution with the MC using standard switching, and VM network\storage on distributed switching as a compromise to a fully distributed solution. They decided against this even after I presented them with the advantages.
A few weeks later they had an unplanned outage and they lost visibility to the vSphere hosts, so they went to a hybrid solution. About this time, I recommended going to a fully standard switch configuration across the entire environment. My reasoning was that it was less complex and easier to manage and troubleshoot in the event of problems, outages, vCenter loss. In reality, they did not have the skill-sets required to manage a distributed switch solution, as they rotted out every 6 to 12 months and had little to no experience with the product. Yes, I know, VMware is easy. Once again, they decided not to follow my recommendations, and once again another outage based on a planned 30 minute network change which resulted in a 6 hour outage. I was not informed of the planned change, nor was the server team which managed the VMware environment.
The network went offline, the server team was not aware of the outage, the hosts lost communications with the storage and each other. HA isolation response was initiated, VMs started to power off, admins started to freak out, and as any good admin knows, the first step in troubleshooting is to reboot the hosts, it’s easy. During this time the storage was also rebooted as there was not any communication between the server, storage and network folks, Basically, everyone is rebooting systems as a troubleshooting step and things are getting ugly.
They decide to call me and the NetApp SME after they have gone through this initial phase of troubleshooting and we get everything sorted out, but the vSphere infrastructure is not happy.
We have the same VMs existing on multiple hosts, the distributed switching on a few hosts are corrupt and they are looking at the system like a hog looking t a wristwatch. I got it all sorted out and life was good once again.
They did end up going to standard switching and they did develop a communications plan regarding their change management process. They also realized that maybe VMware was not easy, at least for few weeks.
So the moral of the story is to to take into account the current and future skill-sets of the admins as part of the analysis, and analyze the change mananegement process. Remember VMware is easy:)

Another Virtualization Blog?

Another Virtualization Blog? Yep, but this one is for me. I run into a lot of nuances when dealing with virtualization and storage. I find the solutions, and sometimes I run into them again, but I can’t always remember off the top of my head how I skinned that chicken, so I figure this is a good way for me to post my solutions, thoughts and questions for future use.

I will deal with virtualization and storage, as well as any other issues I run up against. I have dealt with the commercial sector for five years, and am currently in the federal sector. Same animal, different color:)