Failure mode and effects analysis
Failure mode and effects analysis (FMEA) is a structured, proactive methodology used to identify potential failure modes within a system, design, process, or service, assess their causes and effects, and prioritize risks to mitigate them before they occur, thereby enhancing reliability, safety, and quality.[1] FMEA originated in the U.S. military during the late 1940s as a technique to evaluate and reduce sources of variation and improve the reliability of complex systems, particularly in response to malfunctions in munitions and equipment.[2] This approach was initially applied in aerospace and defense contexts, where it was formalized through standards like MIL-P-1629 (1949), the precursor to MIL-STD-1629A, which established procedures for conducting failure mode, effects, and criticality analysis (FMECA) to systematically evaluate design reliability from concept through production.[3] By the 1960s, NASA adopted and refined FMEA for space missions, such as the Apollo program, to identify and address potential failures in critical hardware and software.[4] In the 1970s and 1980s, FMEA transitioned to the automotive and manufacturing industries, driven by quality improvement initiatives; for instance, Ford Motor Company published guidelines in 1988 for design and process FMEAs, influencing broader adoption.[5] The Society of Automotive Engineers (SAE) standardized the practice with J1739 in 1994, providing a framework for potential failure mode and effects analysis in design (DFMEA) and manufacturing processes (PFMEA); in 2019, the Automotive Industry Action Group (AIAG) and Verband der Automobilindustrie (VDA) published a harmonized FMEA handbook introducing a 7-step process, which has been updated periodically, most recently in the SAE J1739 revision of 2021 to incorporate supplemental analyses like FMEA for monitoring and system response (FMEA-MSR). Today, FMEA is integral to industries including healthcare, where the Veterans Affairs National Center for Patient Safety adapted it as Healthcare FMEA (HFMEA) in 2001 to proactively identify risks in patient care processes.[6] The core of FMEA involves assembling a multidisciplinary team to brainstorm failure modes, rate their severity (impact on end user or system), occurrence (likelihood), and detection (ability to identify before failure), then calculate a risk priority number (RPN = severity × occurrence × detection) to guide mitigation actions.[1] Common variants include system-level FMEA for overall architecture, DFMEA for product design, PFMEA for manufacturing, and FMECA, which extends FMEA by quantifying criticality through probability assessments.[7] While FMEA excels at bottom-up risk identification, it is often integrated with other tools like fault tree analysis for comprehensive reliability engineering.[8]Overview
Introduction
Failure mode and effects analysis (FMEA) is a structured, systematic technique for identifying potential failure modes within a system, design, process, or service and evaluating the effects of those failures on overall performance.[1] This methodology enables teams to anticipate issues by examining how individual components or steps might fail and the resulting impacts at local, subsystem, and system levels.[9] The core objective of FMEA is to prioritize risks through the assessment of three key factors—severity of the effect, likelihood of occurrence, and probability of detection—allowing organizations to focus mitigation efforts on the highest-priority failures before they manifest.[10] A common output is the risk priority number (RPN), calculated as the product of these ratings, which quantifies and ranks failure modes for targeted interventions.[1] Originating in reliability engineering during the 1940s as a tool developed by the U.S. military for aerospace applications, FMEA has evolved into a widely adopted standard across diverse industries, including manufacturing, healthcare, and automotive sectors, where it supports proactive quality and safety improvements.[1] Unlike reactive approaches such as root cause analysis, which investigate failures after they occur to identify underlying reasons, FMEA emphasizes prevention by analyzing potential weaknesses in advance.[11]History
The origins of Failure Mode and Effects Analysis (FMEA) trace back to the late 1940s, shortly after the end of World War II, when the United States military developed it as a systematic method to evaluate equipment reliability, in response to malfunctions observed in munitions and other systems.[2] This development was prompted by the need to reduce sources of variation and improve reliability following issues with malfunctions in complex military systems. The U.S. Department of Defense formalized the approach in Military Procedure MIL-P-1629, published in 1949, which outlined procedures for performing failure mode, effects, and criticality analyses (FMECA) to identify potential malfunctions in military systems and prioritize corrective actions.[2] In the 1960s, NASA adopted and refined FMEA for ensuring the reliability of mission-critical hardware in the Apollo space program, extending its application to complex aerospace systems where failure could have catastrophic consequences.[12] This formalization emphasized proactive risk identification, building on military foundations to support the high-stakes demands of space exploration.[13] By the 1970s, FMEA gained traction in the aerospace and automotive industries, with Ford Motor Company notably integrating it following the safety controversies surrounding the Ford Pinto model, which highlighted the need for rigorous failure prevention in vehicle design.[14] Aerospace firms, influenced by NASA's success, also began widespread use to enhance aircraft safety and reliability.[15] Standardization accelerated in the 1980s and 1990s, with the U.S. Department of Defense issuing MIL-STD-1629A in 1980 to provide updated guidelines for FMECA in defense applications.[16] The Society of Automotive Engineers (SAE) introduced J1739 in the 1990s to tailor FMEA for automotive design and processes, while the Automotive Industry Action Group (AIAG) released its first FMEA manual in 1993, followed by revisions such as the third edition in 2001 and fourth in 2008, establishing it as a core quality tool. Post-2000 developments integrated FMEA into broader quality management standards, including ISO 9001 for general quality systems, AS9100 for aerospace (revised in 2009 and 2016 to require risk-based thinking, for which analyses like FMEA are commonly used), and IATF 16949 for automotive suppliers (updated in 2016 to emphasize preventive actions via FMEA).[17] By the 2020s, adaptations extended FMEA to software and artificial intelligence systems, with frameworks like AI-supported FMEA emerging to assess algorithmic failures and ethical risks in machine learning applications.[18]Fundamentals
Basic Terms
In failure mode and effects analysis (FMEA), key terminology revolves around the identification and assessment of potential breakdowns in systems, components, or processes. These terms provide the foundational vocabulary for analyzing reliability and risk, distinguishing how failures manifest, propagate, and are detected. Failure mode refers to the specific manner or way in which a component, subsystem, or system could fail to perform its intended function, such as cracking, short-circuiting, or excessive wear.[1] This concept emphasizes the observable or physical manifestation of failure, often tied to defects or errors that could impact performance or safety.[9] Failure effect describes the consequences or outcomes resulting from a failure mode, which can occur at multiple levels: local effects on the immediate component, next-level effects on upstream or downstream elements, or end effects on the overall system, user, or mission.[19] For instance, a short circuit in an electrical component might cause local overheating, disrupt subsystem operation, and ultimately lead to system shutdown.[9] These effects are evaluated to understand their scope and severity in the context of risk assessment.[20] Failure cause identifies the underlying root reasons or sources that lead to a particular failure mode, such as material fatigue, design deficiencies, environmental stressors, or manufacturing variations. In FMEA, causes are traced to enable preventive actions, distinguishing them from symptoms by focusing on origins like improper assembly or inadequate specifications.[9] Indication denotes the detectable signals, symptoms, or methods by which a failure mode becomes apparent, such as alarms, visual anomalies, performance degradation, or diagnostic outputs.[21] This term highlights observable cues that allow for timely identification, often integrated with detection controls in the analysis.[3] The dormancy or latency period is the elapsed time between the initiation of a failure mode and its detectable effects or manifestation, during which the failure remains hidden.[22] For example, a latent crack in a structural component might propagate undetected for hours or months before causing visible effects, influencing the urgency of monitoring strategies.[12] FMEA distinguishes between functional failure, which occurs when a system or subsystem fails to fulfill its overall intended purpose (e.g., a pump not delivering fluid at required pressure), and component failure, which involves the breakdown of an individual part or element (e.g., a seal cracking under stress).[9] Component failures often contribute to functional failures, but not all do, allowing analysts to prioritize at different hierarchical levels.[23] These distinctions ensure comprehensive coverage from granular parts to holistic system performance.Ground Rules
Conducting a Failure Mode and Effects Analysis (FMEA) relies on the assumption of complete system knowledge, necessitating detailed design or process data to identify potential failure modes effectively; incomplete information can result in significant gaps in the analysis, such as overlooked failure causes or effects.[24] This prerequisite ensures that the analysis is grounded in comprehensive technical specifications, including hardware configurations, operational procedures, and interface details, allowing for a bottom-up evaluation of foreseeable failure modes.[25] A key prerequisite for effective FMEA is the formation of a multidisciplinary team, comprising experts such as designers, operators, manufacturing personnel, and reliability analysts, to provide diverse perspectives and comprehensive input on potential failures.[1] This cross-functional approach, often documented in team rosters, facilitates the identification of failure modes from multiple viewpoints, reducing biases and enhancing the thoroughness of the assessment.[26] The scope of an FMEA must be clearly defined by establishing system boundaries, focusing exclusively on foreseeable failures within those limits while typically excluding external factors like user misuse unless explicitly included in the ground rules.[27] These boundaries, specified at the outset, guide the analysis to concentrate on internal system elements and probable operational scenarios, ensuring relevance and manageability.[28] FMEA is inherently iterative, serving as a living document that evolves and is updated throughout the system lifecycle in response to new design changes, test results, or operational feedback.[29] This ongoing refinement maintains the analysis's accuracy and utility across phases from design to deployment.[30] Documentation in FMEA requires the use of standardized worksheets to record failure modes, effects, causes, and controls, with traceability ensured to system requirements and prior analyses for consistency and auditability.[12] Such worksheets, often formatted per industry standards, provide a structured format that links each failure mode to its indenture level and supports subsequent reviews or criticality assessments.[28]Methodology
Step-by-Step Process
The step-by-step process for conducting a Failure Mode and Effects Analysis (FMEA) is a systematic methodology designed to proactively identify and address potential failures in a system, product, or process. This core sequence emphasizes team collaboration, thorough analysis, and iterative improvement, as detailed in established standards like SAE J1739, which outlines a 6-step process, while the AIAG & VDA FMEA Handbook (2019) expands this to 7 steps for more detailed structure analysis, function analysis, and documentation.[31][32] The process typically unfolds in six main phases per SAE J1739, ensuring comprehensive coverage from initial planning to final documentation. Step 1: Define the scope and assemble the team. The process begins by clearly defining the boundaries of the analysis, such as the specific system, subsystem, or process to be examined, and reviewing relevant documentation like system designs, process flowcharts, or boundary diagrams to establish context. A cross-functional team is assembled, including experts from design, engineering, operations, and quality to bring diverse perspectives and ensure balanced input. This foundational step aligns the analysis with project goals and customer requirements, as recommended in the AIAG & VDA approach for planning and preparation.[31][1] Step 2: Brainstorm potential failure modes. With the scope set, the team identifies possible ways each function or component could fail to perform its intended role, often breaking down the system into hierarchical elements like subsystems and parts. Techniques such as brainstorming sessions or complementary tools like fault tree analysis are employed to systematically explore failure scenarios, focusing on how elements might deviate from expected behavior. This phase draws on structure and function analysis to map out all conceivable modes without initial judgment.[31][9][33] Step 3: Determine causes and effects for each failure mode. For every identified failure mode, the team analyzes its root causes—such as material defects, environmental factors, or human error—and traces the resulting effects through the system's hierarchy, from local component impacts to broader system-level or end-user consequences. Effects are classified by severity and scope, ensuring traceability from the failure mode to potential safety, performance, or regulatory issues. This failure analysis step promotes a holistic view, highlighting interdependencies across the system.[1][31][34] Step 4: Evaluate and rank risks. The team then assesses the significance of each failure mode by considering factors like likelihood of occurrence, detectability, and potential impact, providing an overview to prioritize items for further attention. Risks are ranked using a qualitative or semi-quantitative method, such as the Risk Priority Number (RPN), to focus efforts on high-priority areas without deep calculation at this stage. This evaluation integrates insights from prior steps to guide resource allocation.[1] Step 5: Recommend and prioritize actions. Based on the risk ranking, the team develops targeted recommendations to mitigate identified failures, such as design modifications, process controls, redundant features, or enhanced monitoring. Actions are prioritized by their potential to reduce risk most effectively, with responsibilities assigned to team members or departments and timelines established for implementation. This optimization phase ensures practical, feasible solutions aligned with cost and feasibility constraints.[31][34] Step 6: Implement actions and reassess. Recommended actions are executed, followed by verification through testing, simulation, or monitoring to confirm risk reduction. The FMEA is updated to reflect new ratings and residual risks, with full documentation of the process, findings, and outcomes maintained for audits, continuous improvement, and compliance purposes. This iterative reassessment ensures the analysis remains relevant as the system evolves.[1][31]Risk Assessment Metrics
In Failure Mode and Effects Analysis (FMEA), risk assessment relies on three primary quantitative metrics—Severity (S), Occurrence (O), and Detection (D)—each rated on a standardized 1-10 scale to evaluate and prioritize potential failure modes. These metrics enable teams to systematically quantify risks by assessing the impact, likelihood, and detectability of failures, facilitating focused mitigation efforts. The scales are designed to be consistent across analyses, drawing from established industry guidelines to ensure objectivity and comparability.[35] Severity (S) measures the seriousness of the effects resulting from a failure mode, rated from 1 (negligible impact, such as minor inconvenience with no safety or performance issues) to 10 (catastrophic consequences, including hazards to human safety without warning, regulatory non-compliance, or total loss of primary function). For instance, a failure causing potential injury or death scores a 10, while one resulting in only cosmetic damage scores a 4 or lower. This scale emphasizes customer and end-user impacts, prioritizing safety-related effects.[36] Occurrence (O), sometimes referred to as Probability (P), evaluates the likelihood of a failure cause occurring, scaled from 1 (extremely unlikely) to 10 (highly probable). Ratings are informed by empirical data, such as failure rates from testing, production history, or similar systems (e.g., level 1 corresponds to failure rates below 1 in 1,500,000 opportunities per AIAG guidelines), with lower scores reflecting robust preventive controls.[37] Detection (D) assesses the probability of identifying the failure mode or its cause before the effect reaches the end user, rated from 1 (almost certain detection through current controls, like automatic sensors) to 10 (undetectable, with no effective monitoring or testing in place). This metric focuses on the adequacy of inspection, testing, and preventive measures, where higher scores indicate gaps in detection capabilities.[38] The core output of these metrics is the Risk Priority Number (RPN), calculated as RPN = S × O × D, which yields a value from 1 to 1,000 to rank failure modes by overall risk. Failure modes with RPN values exceeding a predefined threshold, such as 100, typically require immediate action, though prioritization also considers high severity regardless of RPN. This multiplicative approach highlights risks where even moderate individual ratings combine to indicate significant concern.[35][36] In safety-critical applications, an alternative to RPN is criticality analysis, which uses only S × O to focus on the product of severity and occurrence, excluding detection to avoid underestimating hazards in undetectible scenarios. This method is particularly emphasized in aerospace and medical device FMEAs where human safety overrides detectability.[36] To ensure consistency, ratings should follow industry standards such as those from the Automotive Industry Action Group (AIAG), which provide detailed criteria tables for S, O, and D tailored to automotive contexts but adaptable to other sectors. Teams are encouraged to calibrate scales using historical data or cross-functional consensus, avoiding subjective biases by referencing quantitative benchmarks where possible.[39]Example Worksheet
To illustrate the application of Failure Mode and Effects Analysis (FMEA) in practice, consider a hypothetical Design FMEA for key components of an automotive brake system, following the standard worksheet format outlined in SAE J1739. This example focuses on the brake pads and caliper assembly, identifying potential failure modes, assessing risks using Severity (S), Occurrence (O), and Detection (D) ratings on a 1-10 scale (where 10 indicates the highest severity, likelihood, or difficulty in detection), and calculating the Risk Priority Number (RPN = S × O × D). High RPN values prioritize actions to mitigate risks, such as redesigns or enhanced monitoring. The worksheet below presents sample entries for three failure modes. For instance, in the case of brake pad wear, the initial high Severity (S=9) reflects a critical safety risk of sudden loss of braking, moderate Occurrence (O=4) due to typical vehicle mileage, and Detection (D=3) via periodic inspections. The resulting RPN of 108 indicates priority for intervention. Recommended actions, such as integrating an electronic wear sensor, reduce Detection to 1 by enabling real-time alerts, lowering the revised RPN to 36 while keeping S and O unchanged. This demonstrates how targeted improvements in detection can significantly lower overall risk without altering the failure's inherent severity or frequency.| Item/Function | Failure Mode | Effects | Causes | S | O | D | RPN | Recommended Actions | Revised S | Revised O | Revised D | Revised RPN |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Brake pads: Provide friction to decelerate wheel | Excessive wear | Reduced stopping power; potential vehicle instability or collision | Material degradation from heat/friction; inadequate maintenance | 9 | 4 | 3 | 108 | Install electronic wear sensors with dashboard alerts; specify higher-durability pad material | 9 | 4 | 1 | 36 |
| Brake caliper: Apply force to pads | Piston seizure | Uneven braking; pulling to one side; increased stopping distance | Corrosion or contamination in piston seals | 8 | 3 | 5 | 120 | Enhance seal design with corrosion-resistant materials; add routine flush procedures in service manual | 8 | 3 | 2 | 48 |
| Brake fluid line: Transmit hydraulic pressure | Leak | Loss of brake pressure; total brake failure | Fatigue crack from vibration; improper installation | 10 | 2 | 4 | 80 | Reinforce lines with braided steel; implement torque checks during assembly | 10 | 2 | 2 | 40 |