Feb 18, 2018

Cloud Transformation from the Trenches


I’m deep in the middle of managing my 2nd transition to AWS. Back in 2012, right before the first ReInvent I moved to put my startup's entire infrastructure on AWS. Back then Amazon had moved well past it’s S3 initial offering with key pieces VPC, EC2, EBS…but it was still early in the technology adoption cycle. And the story from Netflix was a confidence builder. My company, had hosted infrastructure in Rackspace that mostly scaled vertically and ordering hardware was painful. I knew the stack well and I could sense the opportunity in AWS to scale out, on demand, and most importantly use hosted pieces of infrastructure instead of building our own. In short, IT could now be a strategic opportunity to react faster instead of simply being a cost center. At the time I was new to management and I loved solving problems. After 10 years of software development, I’d finally learned that people are at the heart of solving tech problems. Leading a team was the best way to Get Shit Done with legacy systems. My biggest priority was how to cultivate motivation in my team, hire top talent, build complementary teams, and manage out cancerous attitudes. Now that I’m at my second IT transformation go-around some 5 years later, albeit at a much larger company, the playbook is becoming more clear. I’m dusting off the cobwebs of that experience and felt compelled to put my learnings on paper.

Transitioning to the cloud exposes practices and approaches your org should be doing, but isn’t. Do you have a CI/CD pipeline? Do you practice “build it/run it” mindset? Are you releasing often? How often do engineers have to tinker with production? How much system visibility do you have? How mature is your monitoring? Chances are if you haven’t moved to AWS yet, you don’t have great answers to many of those questions. Moving to the cloud and practicing devops go hand in hand because infrastructure is code in the new cloud model. Software is eating the world. Linux admins now need to work with VPCs, AMIs, Puppet, CloudFormation, Terraform, Docker, and Kubernetes. Software developers need to think about risk mitigation, operational posture, and monitoring. Committing to a migration plan means exposing things you should be doing, but aren’t. Moving to the cloud is an opportunity to leverage new technical capabilities, but to get there you need understand two things: what to stop doing so you can focus on building the future and where players fit on your team:leading the way, following along, or holding you back.

Prepare for automation
Go read The Phoenix Project if you haven’t already. While sophomoric and repetitive at times, the principles driven through narrative have a greater permanence than a dry and prescriptive checklist. I’ve worked with devops teams that have problems on opposite ends of the devops spectrum: from “throw it over the wall to ops” to developers running roughshod in production. Both types of teams can benefit from visualizing the work. What does the team do that isn’t written down? Do we have a SPOF (Single Point of Failure) on the team, like the infamous Brent from The Phoenix Project? How much of our time is spent on unplanned work? Is the team operating with an ’interrupt driven’ mindset? These are the questions you need to answer first. If your site goes down often, and the team is afraid to release software, you need to put on paper nebulous processes and system coddling. A good rule of thumb: anything that runs on production with some manual intervention needs a procedure. Now if you have a SPOF, this a great task to give them. These procedures become the requirements document needed for the long pole of automation.

As problems arise on the system, be sure to quickly run those to ground. The team must value that the site is available and reliable. This is the Give a Shit factor. If that’s true, then you can have effective retros and accountability. If that’s not true, you need to start thinking about organizational debt, and it’s time to manage out. Give a Shit factor is non-negotiable for any team, if you don’t have it you can’t create it.

Find the hidden work
I mentioned earlier making your work visible. My preferred method is Kanban to start. I don’t want to go into much more detail, as lots has been written about Agile, but suffice to say minimizing Work In Progress, daily standups, and timely retros go a long way. Getting company and team buyin can be the hardest part here, but once you spin this flywheel, you are on the happy path.

Finding hidden work is important because it exposes a value misalignment. If you have team members who are constant fire fighters, they have built their self worth on saving the day. Two values are important here: team work and kaizen. Firefighters solve for fixing issues, which makes it difficult to solve for long term system improvement. Work to find a person on the team whom you can build the team around and praise their diligence at digging out of the operational debt you may have. Other team members might have their pet projects. They’ve become disavowed with constant fires, and have resigned their sanity to a sandbox of unproductivity. The great thing about these people, is that you can give them a number of pet projects that can help solve for the long pole of automation or building the future.

Right people in the right roles
If you’re new to the team, it pays to listen first. As Steven Covey says, “seek first to understand, then be understood”. Don’t develop a prescription without sitting with your team during fires, listening to their thoughts on what they do well and what they don’t do well, and seeing how they normally get work done. So many times I’m tempted to jump in and start solving, but catch myself and practice patience. Likely there is quite a bit I don't understand early on. Maybe the devs are burnt out from too many fires. Maybe people have “we’ve always done it that way” attitude. Maybe you’ve got the brilliant jerk on the team. Maybe people don’t care. Those last two are deal breakers and are cancers on the team. You need to squash those immediately. Turning the narrative around can be hard and it can seem like you don’t know where to start. Slow down, listen to the team, have the hard conversations with individuals and teams. It will pay off in spades when the team starts running itself, freeing you up to look out ahead into future business strategies or getting to those side projects you’ve wanted to. The hard part comes when you realize that you are missing critical functions on your team or that you have team members that need to go. If you are missing a critical function, there might be creative ways you can reach outside of your team for help. Have a coach or consultant come in, if its coachable. You may also be able to tap resources on other teams, if you frame the problem to the business well.

Do more faster
Once you’ve cut your unplanned work down below 50% and once you’ve level set teams and individuals on some key issues, you can start diving into technical tactics. That is, you got some traction on your people and process problems, now you can focus on technology. In the last 6 years, devops has seen an explosion in productivity tooling. I come from the startup world where it’s all about ‘do more faster’. A key piece of this philosophy has been APMs. In general I hate operations, so the more I can focus on building or analysis the better. A good APM is to monitoring what the iPhone was to flip phones. New Relic, Dynatrace, and AppDynamics all have good offerings here. If you’re using Nagios or something similar, you are missing out on a tool that gives you system, database, JVM, CLR, webserver…metrics out of the box that also has basic alerting for when the system has breached baseline on any metric. The disk is full, memory spiked, database queries ground to a halt, and average http response times are spiking all works without a ton of custom tooling. Once you have your arms around how your system is behaving, you might need more granular custom tooling for subsystems. Sometimes you can build that custom tooling right into your APM. I’ve had great success at deploying simple metrics in a key code pipeline to visualize where the system spends most of the time and resources. For JVM and .Net you can get code level hotspots, that can unlock critical thorny bottlenecks in your system. Code hotspots along can justify the APM cost when you consider how much time your team has lost to troubleshooting enigmatic issues.

Bi-what?
So far these are basic dynamics have little do with the cloud. That’s not an accident. These are the key things that prepare your team for the transition, but here is the dark secret to this. I’ve had my best success taking teams toward a bimodal IT approach. One team is responsible for stabilizing and automating the old system, one team is responsible for the new go-forward paradigms. That is a contentious opinion. The team responsible for building the new system is less interrupt driven and able to keep steady progress so you don’t have to trade critical unplanned work for forward progress. The constraints are baked into the split teams. Now you can draw clear accountability. The key to making this transition is having the teams trust. They must trust that you’ll develop their career path and give them interesting problems. The biggest concern comes with the senior talent that runs the old system, that should be rewarded with new projects and greenfield. The reality is, that’s not where the complexity lies. The hard part is managing the old system, where the value lies, and creating space for the new system to develop. It’s not as if these 2 teams are in silos, there are key conversations happening back and forth, but again accountability is clear. The eventual goal is to bring the teams back together, or have some other team reshuffle. This split won’t last forever, but it’s typically the best thing for the business. Anything else, will take up too much time.

The other key benefit is maker vs manager time. In a heavy ops team that is responding to system issues, customer requests, and investigating bugs…it’s very difficult to get the headspace required to construct a new paradigm. I’ve read of teams that do a lift and shift, but that’s never been my approach. I’ve often leveraged new services and patterns that required writing software in conjunction with new AWS system architecture.

Moving to AWS is much more than changing infrastructure, it's a catalyst to look critically at the way you’ve always done things to make your company more competitive. In this regard I find more value in transformation than running operations. Beyond the transition to cloud and moving forward in operational evolution, and what I'm most passionate about is, using product and operational data to provide key insight into where the real business value is. Leveraging the organizational change from the AWS transition it's easier to build out a data strategy that will change the narrative from the bottom up. Perhaps another blog post.