What data center operators can learn from their cars
In my everyday life, I trust that if I make a panic stop, my car's antilock brake system will work, with hardware, software, and servos working together to ensure that my wheels don't lock up—helping me avoid an accident. If that's not sufficient, I trust that the impact sensors embedded behind the front bumper will fire the airbag actuators with the correct force to protect me from harm, even though they've never been tested. I trust that the bolts holding the seat in its proper place won't shear. I trust the seat belts will hold me tight, and that cargo in the trunk won't smash through the rear seats into the passenger cabin.
Engineers working on nearly every automobile sold worldwide ensure that their work practices conform to ISO 26262, a standard that describes how to manage the functional safety of the electrical and electronic systems in passenger cars. A significant portion of ISO 26262 involves ensuring that software embedded into cars—whether in the emissions system, the antilock braking systems, the security systems, or the entertainment system—is architected, coded, and tested to be as reliable as possible.
I've worked with ISO 26262 and related standards on a variety of automotive software security projects, and we're not going to get into the hairy bits of those standards because unless you are personally designing embedded real-time software for use in automobile components, they don't really apply. Also, ISO 26262 is focused on the real-world safety of two-ton machines hurtling at 60-plus miles per hour—that is, things that will kill or hurt people if they don't work as expected.
Instead, here are five IT systems management ideas that are inspired by ISO 26262, to help you ensure your systems are designed to be Reliable, with a capital R, and Safe, with a capital S.
1. Understand where failure can happen
The ISO 26262 standards have nothing to do with the proper functioning of the vehicle. Instead, they have to do with something going wrong, like when the antilock brakes respond to a panic stop, or the adaptive speed control sees a possible collision, or the impact sensors in the bumper detect that collision.
The beauty of ISO 26262 is that it forces us to think about functional safety for each and every part of the project, in addition to its normal functions. In the car domain, it might be a fuel pump that shuts off (or doesn't) in an accident or an airbag that should inflate (but might not). In the IT domain, think about failures of every part of your infrastructure, either for innocent reasons (a power supply failure, user stupidity, a flawed firmware upgrade, or a backhoe through a fiber bundle) or malicious reasons (malware or hacking attempts).
Say a mobile app fails. Failure doesn't necessarily mean a blue screen of death; it could be a dropped connection to a database in the data center or an attempt to call an API that was deprecated in an operating system upgrade. The failure might be visible or invisible. What happens? Does it crash the phone? Lose data? Expose data to hackers? Corrupt data on a server? Lock up key cloud services? Think about all the possibilities. If you don't, you can't prevent those possibilities from happening.
2. Consider the safety lifecycle
In the automotive world, ISO 26262 defines a specific lifecycle for automotive electric/electronic systems: management, development, production, operation, service, and decommissioning. Those are remarkably similar to those we talk about in the enterprise IT department, though we typically might use the words "requirements" or "conception" instead of "management."
Each of the stages of a product has its own safety concerns. For example, putting a new Wi-Fi access point online could have the unexpected side effect of blocking or jamming signals from another access point. Decommissioning servers can unmask hidden dependencies and, all of a sudden, other applications fail. That's a challenge when upgrading critical infrastructure servers, such as a DNS server or directory server: Just when you think you've nailed it down, you find yourself stuck in a never-ending game of Whack-A-Mole.
Automotive engineers typically have to create extensive upfront plans for each phase of a hardware or software subsystem's lifecycle. We are luckier—we don't have to think 20 years down the road about when brake hoses might start to degrade. However, we should plan at least one step ahead so we aren't surprised by failure.
3. Define "harm" in your context
When something fails on a passenger car, we know the worst-case scenario: the personal injury or death of occupants of the car or bystanders. When we talk about IT harm, sometimes we are dealing with possibly fatal consequences—imagine the infrastructure that handles a building's fire suppression systems. Usually, the harm is more business-related, but we still have to manage that risk.
Consider opening up external access to a customer order-tracking system for access by your company's mobile app—think apps from Amazon or Starbucks or eBay. The lessons of ISO 26262 are to figure out what could go wrong as the result of a planning or execution failure. How is security implemented? Are user-entered inputs (via the mobile app) run through the same parsers as the traditional sales system, to ensure that bad characters can't get in? What if hackers steal the encryption keys or learn how to access the APIs directly? What if there's so much traffic that the middleware server starts dropping packets? What if a transaction is interrupted before it's completed? What if the load balancer fails and all the traffic hits one machine?
There are lots of ways these bad things could harm an organization. Undiscovered data corruption can seriously disrupt business operations. Loss of the database to hackers can reveal customer information. Lack of access to the database can cause angry users and bad publicity. Crashed infrastructure could affect other systems or even bring down the entire data center network.
The lesson here is to spend time brainstorming, planning, and documenting what could go wrong and what will be the practical results (i.e., the harm) if something goes wrong. Learn from the well-publicized mistakes of others—for example, Google's 2010 data center outage that was caused, in part, by the lack of proper documentation allowing on-site staff to fix the problem. If you don't plan, you can't be prepared to attenuate the damage.
4. Look at safety integrity levels
Let's talk about Automotive Safety Integrity Levels (ASILs), which are an integral part of ISO 26262. ASILs quantify the risks from faults in the automotive environment. Remember, here we are talking about functional safety requirements, not normal operation—that is, it's not how comfortable the seat is; it's about the potential failure of the seat in an accident. For each possible failure, there are three scores that make up the ASIL:
- Exposure: How frequently or likely a failure event will occur
- Severity: The safety impact if the failure event occurs
- Controllability: What control you have to avoid those events
Each of those levels—Exposure, Severity, and Controllability—are expressed by numbers, with 0 being negligible and 4 being catastrophic. For example, E0 means that an event is extremely unlikely, S2 means that the harm will be severe but survivable, and C4 means that an event is nearly impossible to guard against.
Based on those values, an overall ASIL risk score is calculated for the event, on a scale of A to D, with A meaning "no big deal" and D meaning "yikes, we need to do something about this before we sign off on the seat rail subsystem."
While this risk scoring system is specific to ISO 26262 (and other related standards), you could adapt it for use when evaluating the risk of "something going wrong" events in the enterprise IT context. Sure, it may be somewhat subjective, but if thought goes into the E, S, and C values, you may get a handle on what can go wrong and where you need to rethink plans for implementing technology or pre-think responses to a failure in technology you already have.
5. Reuse existing, known components
For decades, software development gurus have preached the gospel of component reuse. The main reason: It's faster and cheaper to reuse software than to build it fresh. Also, why maintain 20 data-sorting routines if you can maintain just one? In the car-designing world, there are other reasons to reuse components: They don't need to go through rigorous safety compliance testing and have survived out in the field.
A seat-belt assembly that's already been installed in 20,000 cars with no known failures is considered to be less risky than a brand-new seat-belt assembly. Maybe it's not perfect, but at least you know it—and can understand the risk. This is an ISO 26262 factor used in calculating the ASIL score described above.
In the IT world, the same logic can apply. If you already have 200 servers of a specific type and need 10 more servers, go ahead and buy 10 more. You've installed a lot of fiber from one vendor? If it's working out, stick with it.
The whole idea of ISO 26262 is to develop and stick to processes that manage functional safety requirements and minimize risk by improving safety. That's necessarily a conservative outlook: If something is well designed, if something is well tested, and if something has proved to be safe in the field, stick with it. Only make a change if strictly necessary to accommodate changed requirements.
We are the same way in IT. In many ways, we are conservative when deploying new technologies or when asked to embrace change. Still, let's never forget that our business requires constant innovation. We simply cannot say no to new ideas that will drive revenue opportunities or improve competitiveness because they introduce risk. Still, whenever feasible, let's stick with tried-and-true technologies, including hardware, software, and services that have demonstrated real-world reliability.
Think about failure, not just functionality
When something goes wrong with a car, people can die. When something goes wrong with enterprise IT, people usually don't die, but harm is done. We can be inspired by ISO 26262, which teaches us to routinely plan for when things go wrong with hardware, software, infrastructure, and services, and to consider the risk of failure when we evaluate new technologies. We can use a scale like ASIL to guesstimate the effects of failure events at every stage of an IT system's lifecycle, and focus our efforts on avoiding or minimizing harm.
At the end of the day, it's our responsibility to deliver services that connect users to applications and help them get their job done—safely, securely, and with the least risk possible.