We have seen a spate of cloud service provider outages with overly negative economic impact. The AWS outage in December 2021 disrupted Disney+, Ticketmaster, Slack and Netflix, among a host of others. The Facebook (now Meta) outage in October 2021 was particularly painful.
It not only took down WhatsApp and Messenger, but also the livelihood of many small businesses in the developing world. Their services were down for almost six hours because of their dependency on Facebook as an authentication service and WhatsApp and Messenger for order-taking services.
The single culprit responsible for these outages? Service misconfiguration driven by human error. In fact, most misconfigurations are introduced through inadvertent human mistakes. Yet others can be the result of natural disasters. Regardless of the cause, the solution to recovery from such outages and business disasters is automating with the right actions.
When does automation become a necessity?
There are many reasons to automate. Some of the more common reasons include offsetting resource constraints as well as adding safety guards and creating more efficiencies around configuration management, compliance, or monitoring processes.
For instance, let’s say you have a development engineer with an application that requires testing with a load balancer before release. The developer may not have the time or motivation to learn the networking and security language.
In a traditional process, a help-desk ticket would be opened. Then a network administrator would reach out to the developer, get information about the use case, and create a custom load balancer for the developer to use for testing. This process is time-consuming and expensive and creates inefficiencies around testing and deployment. To further complicate the process, if security needs to be tested and provisioned, another ticket is required.
In an automated process, on the other hand, the objective is to provide a self-service application in a language the developer can understand. The developer would test the application against the load balancer without involving IT in the process. If multiple devices, environments (on-premise and multi-cloud) or required skill sets (networking vs security) are involved, then an orchestration workflow could be created.
The benefit of automation and self-service is not just shaving time off provisioning, testing and deployment processes, but also reducing the amount of expertise required to provision networking and security components that developers may not be familiar with. In many cases where services are being provisioned and tested on new platforms, automation and self-service also save IT teams from having to quickly become experts in new domains.
What to look for in automation tools?
There are several basic considerations that go into purchasing automation tools. First and foremost, before purchasing automation tools, you should do a proof of concept to ensure they meet your business needs. It’s important to assess how easy the tools are to use and whether they can actually automate the tasks that you require.
Make sure you pick tools that will help you simplify individual tasks and remove complexity and that are easily consumable by your users. Many workflow and automation tools exist, from LAMP stack, OpenStack, and Ansible, to built-in Microsoft Azure, Amazon AWS and Google GCP tools, and other orchestration engines, such as VMware vRealize Suite.
Second, it’s important to determine whether the automation tools integrate with other tools in your portfolio, including your orchestration engine as well as SIEM, configuration management, incident response, and logging tools.
Finally, the automation tools should allow you to schedule tasks and support deployments across a variety environments, including on-premise and cloud.
Steps to automation success
Automation success is made up of several incremental steps that build upon each other. To start the process of automation, it’s important to identify all use cases that are prone to introducing errors in configuration. Below are four common use cases:
- Reduction in operational costs: Bringing applications up or down depending on user connections to better control infrastructure costs in cloud
- Multi-cloud management: Provisioning of applications and devices across multiple cloud environments
- Timely backups: Scheduling regular application and network configuration backups
- Error prone repetitive tasks: Automating upgrade processes and ensuring successful completion
Next, script the processes using CLI or python scripts. Then, you’re ready to automate.
Ready for anything
Unplanned service outages or slow response to user requests from network, service and application misconfigurations, lack of application resource availability, compliance and security posture checks, unsuccessful backups can all impact a business. The key benefit of automating many of these tasks that have manual steps is to reduce service downtime and avert disasters like the ones seen by AWS and Facebook, which can be expensive and damage a brand.