Enterprise

AWS Well-Architected: achieving operational excellence

Zen Feb 3 2022

As the first pillar in the AWS Well-Architected Framework, Operational Excellence is critical to business success. But what is it, why is it important and how can you ensure operational excellence within your business?

AWS defines the Operational Excellence pillar as “the ability to support development and run workloads effectively, gain insight into their operation, and continuously improve supporting processes and procedures to deliver business value.”

Proper consideration and implementation of Operational Excellence is of vital importance, especially to organisations running business-critical workloads.

Where failure of those workloads would have a significant negative impact on the business, running your IT infrastructure and technology stack efficiently and effectively is a must. It can help you to keep up with or even surpass your competitors and can take significant stress away from you and your teams, enabling consistency, continuous improvement, and smoother, more effective crisis management.

The design principles

We can better understand Operational Excellence by examining it in the context of the five AWS design principles (plus our own ‘sixth’ principle)

1) Perform operations as code

Within your AWS environment, there is a surprising amount that can be done using API calls. They can be templated and scripted – from the creation of your infrastructure to tasks like patching, security updates and testing. In fact, many of the processes and operations that may have required manual intervention in a traditional environment can now be performed as code.

One of the most obvious advantages of this approach is that ‘code will do what code does’. In other words, it is consistent and reliable. Where you have a person carrying out a manual task the prospect of variation and human error is introduced. As long as your code is set up correctly, that prospect doesn’t exist.

2) Make frequent, small, reversible changes

A common mistake that many organisations make is to attempt to batch up a large number of changes and push them all into production at once.

The problem with this approach is that you’re much more likely to find that something gets ‘stuck’. And if a single change gets stuck, nothing gets through.

It is far more prudent (and even efficient) to make small, manageable, bite-size changes instead.

With small pieces, it’s easier to quickly change direction should you encounter difficulty or even change your mind. And even if you encounter a problem with one change, you’ll still be able to push the others through, delivering value into production and providing business benefit.

3) Refine operations procedures frequently

A common pitfall that many businesses make is to write their documentation and run books and then simply forget about them.

Rather than the ‘set and forget’ approach, you should actually be looking to continually test your procedures. AWS discuss ‘GameDays’ often in their best practice: simulated (or even self-created real) events and issues can help to create an environment for your team to test their mettle.

A quick study of the field of chaos engineering will reveal the lengths that successful businesses often go to in order to ensure that their processes and procedures remain as robust as possible.

4) Anticipate failure

It really does help to think about what could go wrong.

When you envisage a worst-case scenario, you’re then able to put appropriate steps in place to pre-empt it or better deal with it when it arrives.

For example, you might use an auto-scaling group so that if a machine fails a new machine will quickly spin up to replace it. But that’s just a single example, and the measures you actually put in place are only limited by your ability to imagine what might possibly go wrong, your tolerance to failure, and your available resources.

5) Learn from all operational failures

We’ve kind of already said it, but things will definitely go wrong.

The key is to learn when things do go wrong. Why did it happen? What could you have done differently? What worked well?

If your team is pulling in the right direction, failure will most likely not be the result of any deliberate malice or mal intent. In that case, punishing or doing all you can to avoid mistakes can have the dangerous effect of stifling creativity and innovation. Genuinely learning from your errors without fear of reprisal can be a very healthy exercise.

6) Annotate documentation

In a traditional environment, filled with manual processes, you can often find that your documentation doesn’t match the reality of your environment. One advantage of AWS is the ability to tag resources. When you need to make changes, it’s right there. You can see who is responsible for what, who has changed what (and when, and what too).

You can also, for example, use security tags on resources if security is important. You can easily differentiate between production and development and testing environments. On a physical server in your datacentre such differentiation is difficult, but if you’re tagging and defining your infrastructure in code it’s much easier to keep track of things.

Best practices

In addition to the already identified design principles, we also recommend three best practice steps.

1) Prepare

The most significant best practice for operational excellence is proper preparation. Putting good practice in place upfront just makes everything easier when you’re in-life.

So what does good preparation look like?

First of all, it is important to understand your priorities. Define them early in the process and resist the temptation of putting a solution in place simply because it is available.

Questions to ask include: What are your internal needs? What do your teams need to support your environments and deliver business value? What are your external needs (what are your customers asking for, and are you in a regulated industry, for example)? Finally, are there any acceptable trade-offs? For example, if speed to market is most important you might compromise availability or cost. If security is paramount, you might accept slower rollout.

Other key considerations in the preparation stage include understanding the current state of your workloads, defining how you will mitigate risk, and implementing feedback and continuous improvement into your flow.

2) Operate

Once things are up and running, how do you ensure they keep running effectively?

Important practices in this context include understanding your workload health (at both a hardware and application level), understanding your team health (both their ability to deliver and your employee net promoter score), and your ability to effectively manage events that happen in life.

This stage requires regular QA careful consideration of logs and metrics, and an understanding of the value of your people.

3) Evolution

Part three of Operational Excellence best practice involves a commitment to constant evolution and progress – both in your operational practices and the way you work.

You’ve no doubt heard this many times, but all it takes to fall behind in the IT world is to stand still.

Important practices here include properly implemented feedback loops, defining the drivers for improvement and the ability to validate your insights.