Software complexity has grown dramatically over the past decade, and enterprises are looking to hybrid cloud technologies to help power their applications and critical DevOps pipelines. But with so many moving pieces, how can you gain confidence in your hybrid cloud investment?
The hybrid cloud is not a new concept. Way back in 2010, AppDynamics founder Jyoti Bansal had an interesting take on hybrid cloud. The issues Jyoti discussed more than eight years ago are just as challenging today, particularly with architectures becoming more distributed and complex. Today’s enterprises must run myriad open source and commercial products. And new projects — some game-changers — keep sprouting up for companies to adopt. Vertical technologies like container orchestrators are going through rapid evolution as well. As they garner momentum, new software platforms are emerging to take advantage of these capabilities, requiring enterprises to double down on container management strategies.
All businesses need to be prepared for a major technological incident in order to minimize their losses and protect their client information. Experts don’t always agree on what constitutes a major incident, but you can create a definition for your company.
A major incident interrupts your company’s ability to function, sometimes completely shutting you down. Most often, the problems are man-made. They can include hackers stealing data, emails infecting your system with ransomware, or employee errors that introduce catastrophic failures. When these major incidents occur, you need to immediately set your team in motion.
Automation in IT is, especially in the age of Cloud and complex highly-distributed computing systems, is absolutely a no-brainer. Whenever I think about the speed required (and pushed) by most businesses and the ways to make IT organizations deliver consistent and predictable high-quality services, automation comes up as one of the cornerstones to enable a sustainable execution model. Therefore, to automate, or not to automate, is not a question we should ask ourselves. We rather ask what to automate and how to enable effective and reproducible automation at scale.
On the pro-automation side of things, automation brings a number of advantages and, although this following list is far from being exhaustive, in the context of IT, important items to consider are:
Automation can eliminate the need to hire new administrative employees at significant cost savings.
Automation eliminates human error and fosters standardization, which leads to better overall quality at scale.
Automation enables faster action and faster repair, which is mandatory as we know how systems can quickly degrade or become unusable due to secondary cascading failures if immediate action is not taken upon first failure data capture.
Automation can also remove most tedious busywork tasks, freeing up employees to do tasks that humans would do better than computers.
On the anti-automation side of things, automation should not be seen as a panacea. The expression “Automate Yourself Out of the Job: Automate ALL the things” (Site Reliability Engineering – How Google Runs Production Systems), is a great moto to drive people to look for opportunities to improve, in real-life complex organizations. Some concerns should be taken into careful consideration, and they are:
Automation implementation is usually a technical endeavor, which requires people with skills and specialized tools. Before automating, we need to ask if the task is worth automating and how much value will be captured with it.
Automation is change, from both a human behavior as well as from an IT operations perspective. Change creates entropy, therefore change needs to be managed before chaos takes over.
Automation can potentially hide systemic deficiencies or even worse, it can exponentiate bad processes, which can result in a catastrophe if deployed on a large scale.
Automation decouples the operation from the operator. If managed with great care, it can preserve enterprise knowledge over time as people either retire or leave the company. If not, organizations might become so dependent on tools and automation, that may lose control of their environment and also lose the ability to rethink their own processes.
All things said and knowing that automation is not a matter of IF, but a matter of WHAT, and HOW, then where should I start? What approaches would allow enterprises to place their bets to enable sustainable and fruitful automation?
Although this should be intuitive enough, the temptation of finding low-hanging fruit, and start automating as many tasks and processes as one could possibly identify, might create a bigger problem down the road. I’m absolutely not saying enterprises should not experiment and “get their feet wet” before they can establish a structured program to automate at scale, in fact, I really think they should do that to understand the challenges this endeavor entails on a large scale. When it’s time to really take things seriously, I believe the first and possibly the most important step is to understand the existing processes and create the “Treasure Map for Automation.”
A valid approach to understand and document current processes is through value stream mapping. This approach lends itself to understand “the current state and designing a future state for the series of events that take a product or service from its beginning through to the customer with reduced lean wastes as compared to the current map.” (Value Stream Mapping, Wikipedia). The purpose of the value stream mapping is to identify and remove the wastes in the processes, increasing efficiency and productivity. Upon the evaluation of a value stream, the following types of waste can be analyzed:
“Daniel T. Jones (1995) identifies seven commonly accepted types of waste. These terms are updated from the Toyota production system (TPS)’s original nomenclature:
Faster-than-necessary pace: creating too much of a good or service that damages production flow, quality, and productivity. Previously referred to as overproduction, and leads to storage and lead time waste.
Waiting: any time goods are not being transported or worked on.
Conveyance: the process by which goods are moved around. Previously referred to as transport, and includes double-handling and excessive movement.
Processing: an overly complex solution for a simple procedure. Previously referred to as inappropriate processing, and includes unsafe production. This typically leads to poor layout and communication, and unnecessary motion.
Excess Stock: an overabundance of inventory which results in greater lead times, increased difficulty identifying problems, and significant storage costs. Previously referred to as unnecessary inventory.
Unnecessary motion: ergonomic waste that requires employees to use excess energy such as picking up objects, bending, or stretching. Previously referred to as unnecessary movements, and usually avoidable.
Correction of mistakes: any cost associated with defects or the resources required to correct them.”
For each and every one of them, at a first glance, there seems to be room to leverage automation to take waste out the picture. But try not to be so optimistic and start automating everything, for each step of your value stream-mapped processes, with special attention to the hand-offs (where most of the wastes reside), perform the traditional 5-whys technique to iterate over them and determine if the step could be completely ripped off the process (utmost automation) or if the processes should be partially or totally automated.
In my humble opinion and experience, when it comes to IT, the most common wastes are associated with waiting, processing, and correction of mistakes. Benefits go way above and beyond the savings in labor hours. Delivery excellence through standardized and “always available” products and services also brings tangible value to the business and needs to be tracked with new indicators, like Net Promoting Score (NPS). What do you think?
With your treasure map at hand, your organization would have all the information available to determine WHAT to automate. HOW to actually execute it is something for the next article. Any suggestions?
What to learn more about a value stream map? Here’s a helpful article: Identify bottlenecks with value-stream mapping.
I’ve always been a great enthusiast for anything related to technology and one aspect of it that always intrigued me was figuring out how to balance the need to deliver new apps and functionalities sooner than later, with the need to provide a unique and quality experience to the users. In summary, how does one maintain speed and deliver higher quality products capable of differentiating their company from the competition?
From the speed perspective, Agile seems to be the well-accepted answer to build teams in such a fashion that they gain autonomy and continuously learn how to improve speed through the delivery of incremental value-added products. On the software quality side of things, for quite a while now, enterprises understand how to benefit from the “Shift-Left” approach for testing and, as a matter of fact, have largely embraced it, in all its shapes and forms (Traditional, Incremental, Agile/DevOps, and Model-Based). A brief search on the web will bring lots and lots of results, with both the challenges of those who had adopted it earlier, as well as with the benefits from those who were able to reduce the number of errors in production with a better, earlier, and more frequent testing strategy.
Although delivering flawless applications, with unique engaging features, is what everybody, obviously, seems to be going for, the promised “Prime Experience” also requires a well-designed IT operational model, one able to deliver a reliable experience to the users, including stable end-to-end performance and availability that is noticeable from the client’s perspective.
To that matter, instilling IT operations best practices to early stages of systems design and development seems to be very much appropriate. As a matter of fact, my understanding is that getting ITOps right during systems development isn’t a revolutionary approach, but simply a more mature state of the very same Shift-Left testing already in place, now with a new set of operational automated test cases.
Although for modern cloud applications, DevOps pipelines for CI/CD should address most of the operational concerns, ensuring apps are consistently deployed following known and well-established ITOps best practices, for legacy applications, the existing tools and processes don’t usually enforce best practices. How could shift left be applied to legacy applications in the context IT Operations? Leveraging approved templates and patterns, automated tests to ensure applications comply with ITOps readiness checklists and (what I consider the most important of all) full automation of the entire application lifecycle, are examples of approaches to promote effective, consistent, and reliable IT Operations execution. Shifting ITOps left seems to be in its early stages for most traditional enterprises and we should see great improvements in the near future as more companies embrace the concept.