Effectively communicating system redundancy is important because redundancy touches system performance, risk management, disaster recovery, regulatory compliance, and customer & owner confidence. Getting the redundancy communication wrong produces blind spots and surprises. Getting it right produces a well-oiled, predictable machine.
Redundancy and How to Apply It
Redundancy is the existence of more than one means for accomplishing a given function. Each means of accomplishing the function need not necessarily be identical.
The FINESSE Fishbone Diagram as One Approach
Strategic, operational, and emergency situations require different communication approaches. FINESSE is a strategic communication approach for getting senior management and decision makers to understand.
Trial and error is not advisable for big decisions.
Confirm Redundancy (Don’t Assume It)
We must confirm that it works if we rely on redundancy to minimize risk. In other words, tolerating failure only works when you confirm that the redundancy works. These are some key reasons that redundancy fails that I have encountered throughout my career.
Poor switching
Missing equipment
Oversold designs
Non-documented modifications
Poorly understood systems
Is there an alternative to redundancy? Yes, fault avoidance is the other branch of fault management. Fault avoidance includes simplicity, better parts, lower stresses, and training.
The reality is that good fault tolerance (redundancy or robustness) starts with good fault avoidance.
Communicating with The Four Horsemen of Redundancy
The original Four Horsemen are:
Conquest, who rides a white horse and carries a bow.
War, who rides a red horse and wields a large sword.
Famine, who rides a black horse and holds a pair of scales.
Death, who rides a pale horse and is followed by Hades.
The Four Horsemen of Redundancy are:
Complexity
Independence
Propagation
Human Error
1 Complexity
Extra elements require further managerial systems to determine, indicate, and mediate failures. Redundancy can increase to the point where it is the primary source of unreliability.
Key Points for Communicating Complexity
More elements equal more points of failure.
Too little or too much redundancy can make a system more fragile and unreliable.
More training, support systems, and management processes come with redundancy.
Key Approach
Show them the visuals!
2. Independence
"Independent" means that the chances of one failing are not linked in any way to the chances of the other failing. Most things do not work or fail independently. Independence is a simplifying assumption in analysis (breaking things into parts) and design.
Key Points for Communicating Independence
Many redundancy calculations assume that redundant elements behave independently.
Identical elements will likely wear in similar ways.
Identical elements will likely fail at similar times.
Key Approaches for Separation and Single Points of Failure
Use a simple visual.
Have a concise message.
Block diagrams and line drawings are good tools.
Key Approaches for Diversification
Keep discussion as simple as possible (diversification is not easily accepted).
Use an example from the relevant industry.
Use a logical discussion that if you use redundancy, use it to the fullest extent possible.
3. Propagation
An unexpected failure mode or effect in an upstream system may unexpectedly impact (over-stress) the performance of downstream systems. The unexpected catastrophic failure of an upstream system may wipe out a downstream system. This is commonly called a cascading failure.
Another failure mode is an adjacent system not otherwise connected to the main system. However, a partial or complete failure of the adjacent system creates a failure of the main system. This behavior is referred to as system-of-systems (SoS) failure.
Key Points for Communicating Propagation (cascading failures)
Redundancy and robustness are not the same.
Robustness is a system's ability to handle a wide range of inputs, stresses, and unexpected conditions.
Capacity erosion (loss of design capacity) is a real consequence of using redundancy to offset stresses from overloaded upstream systems.
Key Approach
The problem frame, including system boundaries and definitions, must be clear.
4. Human Error (Human Performance)
Failures of all kinds instigate human action. When redundancy is present, human interaction is easy to occur as a response to the designed potential for failure of a primary unit.
Technological failures in monitoring or control systems open the door to human error, even if the system is actually functioning as designed.
Errors are not automatically detected in many cases. Even worse, latent human errors can become ‘normalized’ over time, resulting in an unconscious reliance on redundancy.
Key Points for Communicating Human Error
Systems include people, equipment, processes, inputs, outputs, and feedback loops.
Communicating that people are part of systems, especially those that tolerate failure, is part of an upfront message related to systems thinking.
More often than not, everything goes into chaos when redundancy happens. That's largely because humans intercede when something catastrophically fails, whether it's part of the plan or not. People are an integral part of any system.
Frontline staff are often incorrectly blamed in crises and poor failure evaluations. That's unfortunate because more standards, training, and management systems are needed when the plan is to tolerate failure.
Key Approach
Communicate systems, not blame. The root causes usually relate to management systems, training, and standards. Those aren’t ever perfect, but neither are humans.
Tying Tips to a Communication Approach for Redundancy
Communicating to Senior Management
Communicating with FINESSE focuses on getting the boss's boss to understand. In this case, tying tips into a communication approach is about communicating up to decision makers.
FINESSE and the Seven Bones
Effective communication requires doing all seven bones well, but not necessarily perfect. Here, we boil down to one tip per bone for communicating system redundancy. Obviously, there is more than one tip per bone.
Frame: Explain the frame, including the system boundaries and key definitions.
Illustrate: Use block diagrams and line diagrams.
Noise reduction: Too many reliability calculations produce noise; keep discussions on interfaces and switches.
Empathy: No one wants failure, yet we design to have it. Senior management must understand that redundancy is a form of failure tolerance.
Structure: Discuss the weak points first (single points of failure or need for separation).
Synergy: Have one-on-one discussions before having a group meeting.
Ethics: Redundancy still requires high levels of safety, specification, and rigorous testing (validation).
Communicating System Redundancy
There you have it! How to communicate system redundancy in less than 1200 words! You could make a whole webinar or short course out of this (wait a minute, we do).
Why 1200 words? Because that’s about as much time (less than 10 minutes) as senior management has for you.
Are you Communicating with FINESSE?
Founded by JD Solomon, Communicating with FINESSE is a not-for-profit community of technical professionals dedicated to being highly communicators as trusted advisors to senior management. Learn more about our publications, webinars, and workshops. Join the community for free.