Analyzing the current environment and making recommendations to the customer

I had a long-term project at a customer site where I was to analyze, design, and architect a solution based on the equipment, environment, and requirements. Before I rolled in to the customer site as the new VMware SME, there had been a recommendation by a junior and recent VCP to implement distributed switching, linked vCenters and a few other feature sets of VMware and NetApp. There was not any experience with distributed switching by the on-site staff and their exposure to VMware was minimal, although many thought themselves as experts after a few weeks with the product. I kept hearing the comment that VMware was easy. I recommended a hybrid solution with the MC using standard switching, and VM network\storage on distributed switching as a compromise to a fully distributed solution. They decided against this even after I presented them with the advantages.
A few weeks later they had an unplanned outage and they lost visibility to the vSphere hosts, so they went to a hybrid solution. About this time, I recommended going to a fully standard switch configuration across the entire environment. My reasoning was that it was less complex and easier to manage and troubleshoot in the event of problems, outages, vCenter loss. In reality, they did not have the skill-sets required to manage a distributed switch solution, as they rotted out every 6 to 12 months and had little to no experience with the product. Yes, I know, VMware is easy. Once again, they decided not to follow my recommendations, and once again another outage based on a planned 30 minute network change which resulted in a 6 hour outage. I was not informed of the planned change, nor was the server team which managed the VMware environment.
The network went offline, the server team was not aware of the outage, the hosts lost communications with the storage and each other. HA isolation response was initiated, VMs started to power off, admins started to freak out, and as any good admin knows, the first step in troubleshooting is to reboot the hosts, it’s easy. During this time the storage was also rebooted as there was not any communication between the server, storage and network folks, Basically, everyone is rebooting systems as a troubleshooting step and things are getting ugly.
They decide to call me and the NetApp SME after they have gone through this initial phase of troubleshooting and we get everything sorted out, but the vSphere infrastructure is not happy.
We have the same VMs existing on multiple hosts, the distributed switching on a few hosts are corrupt and they are looking at the system like a hog looking t a wristwatch. I got it all sorted out and life was good once again.
They did end up going to standard switching and they did develop a communications plan regarding their change management process. They also realized that maybe VMware was not easy, at least for few weeks.
So the moral of the story is to to take into account the current and future skill-sets of the admins as part of the analysis, and analyze the change mananegement process. Remember VMware is easy:)

One response to “Analyzing the current environment and making recommendations to the customer

  1. Reblogged this on VirtuallyMikeBrown and commented:
    There are a several good points made my new blogging buddy, Miguel. Number one, you don’t include in your design features for the sake of features. This may seem obvious, but perhaps for a passionate (maybe overzealous!) VMware Architect, implementing features on which on-site staff are not proficient or can’t manage is not a benefit. As Miguel shares in this “palm-to-face” anecdote, such features in the hands of untrained staff can have the opposite effect for which they’re designed. So take into account the staff’s abilities before including advanced features in your design. Number two, communication is key in any environment. Communicating to the customer the gravity of the decisions they make in regards to what’s included in the design and certainly sharing planned maintenance times with all stakeholders. A communication strategy and change control process are key to making this work. And number three, as Miguel shared with me, if an admin is looking at virtual infrastructure like a hog looks at a wristwatch, well, things are pretty bad.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s