avatar🌸
AtoposyzAtoposyz

生活本就是奇迹与日常的结合体

Atoposyz 的个人博客

体系结构A会错误注入分析论文全集 (2021-2025)

数据源: CCF 2026 第七版推荐目录 → 体系结构/并行与分布计算/存储系统 A 类 10 个会议
检索方式: 自建 CCF API 关键词搜索 + 6800 篇标题 LLM 人工扫描 + 多角度补搜 + 人工审查剔除
结果: 核心 100 篇 + 边界 8 篇,每篇附完整标题与 DOI 直链


目录


一、硬件错误分析

分析软错误 / 瞬态故障 / 永久故障对硬件电路和微架构的影响

会议年份论文链接
ISCA2021Demystifying the System Vulnerability Stack: Transient Fault Effects Across the LayersDOI
ISCA2021Failure Sentinels: Ubiquitous Just-in-time Intermittent Computation via Low-cost Hardware Support for Voltage Monitoring.DOI
ISCA2024Harpocrates: Breaking the Silence of CPU Faults through Hardware-in-the-Loop Program Generation.DOI
MICRO2021ExHero: Execution History-Aware Error-Rate Estimation in Pipelined Designs.DOI
MICRO2021HARP: Practically and Effectively Identifying Uncorrectable Errors in Memory Chips That Use On-Die Error-Correcting Codes.DOI
MICRO2021Characterizing and Mitigating Soft Errors in GPU DRAMDOI
MICRO2022RemembERR: Leveraging Microprocessor Errata for Design Testing and Validation.DOI
MICRO2022Characterizing and Mitigating Soft Errors in GPU DRAMDOI
MICRO2023Impact of Voltage Scaling on Soft Errors Susceptibility of Multicore Server CPUsDOI
MICRO2023Predicting Future-System Reliability with a Component-Level DRAM Fault ModelDOI
MICRO2025DRAM Fault Classification through Large-Scale Field Monitoring for Robust Memory RAS Management.DOI
MICRO2025Swift and Trustworthy Large-Scale GPU Simulation with Fine-Grained Error Modeling and Hierarchical Clustering.DOI
MICRO2024DelayAVF: Calculating Architectural Vulnerability Factors for Delay Faults.DOI
HPCA2021Operating Liquid-Cooled Large-Scale Systems: Long-Term Monitoring, Reliability Analysis, and Efficiency Measures.DOI
ISCA2021Dv: Improving DRAM Reliability and Performance On-Demand via Coherent Replication.DOI
HPCA2023A Systematic Study of DDR4 DRAM Faults in the FieldDOI
HPCA2023AVGI: Microarchitecture-Driven, Fast and Accurate Vulnerability Assessment.DOI
HPCA2025Veritas - Demystifying Silent Data Corruptions: μArch-Level Modeling and Fleet Data of Modern x86 CPUs.DOI
HPCA2025Variable Read Disturbance: An Experimental Analysis of Temporal Variation in DRAM Read Disturbance.DOI
ASPLOS2024Dr. DNA: Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations.DOI
ASPLOS2024MulBERRY: Enabling Bit-Error Robustness for Energy-Efficient Multi-Agent Autonomous Systems.DOI
ASPLOS2024Proactive Runtime Detection of Aging-Related Silent Data Corruptions: A Bottom-Up Approach.DOI
ASPLOS2025Hardware Sentinel: Protecting Software Applications from Hardware Silent Data Corruptions.DOI
SC2022From Correctable Memory Errors to Uncorrectable Memory Errors: What Error Bits Tell.DOI
SC2023Understanding the Effects of Permanent Faults in GPU’s Parallelism Management and Control Units.DOI
SC2023Demystifying Cross-Layer Deficiencies of Soft Error ProtectionDOI
SC2025Story of Two GPUs: Characterizing the Resilience of Hopper H100 and Ampere A100 GPUs.DOI
DAC2024How accurately can soft error impact be estimated in black-box/white-box cases? - a case study with an edge AI SoC -.DOI
DAC2024Graph Learning-based Fault Criticality Analysis for Enhancing Functional Safety of E/E Systems.DOI

二、容错技术

会议年份论文链接
ISCA2024On Error Correction for Nonvolatile Processing-In-MemoryDOI
MICRO2021Turnpike: Lightweight Soft Error Resilience for In-Order Cores.DOI
MICRO2022Featherweight Soft Error Resilience for GPUsDOI
MICRO2022Revisiting Residue Codes for Modern Memories.DOI
MICRO2025Optimizing All-to-All Collective Communication with Fault Tolerance on Torus Networks.DOI
HPCA2021ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance.DOI
HPCA2021CARE: Coordinated Augmentation for Elastic Resilience on DRAM Errors in Data Centers.DOI
HPCA2022Reliability-Aware RunaheadDOI
HPCA2023Realizing Extreme Endurance Through Fault-aware Wear Leveling and Improved Tolerance.DOI
HPCA2024Gem5-MARVEL: Microarchitecture-Level Resilience Analysis of Heterogeneous SoC Architectures.DOI
HPCA2025ER-DCIM: Error-Resilient Digital CIM Architecture with Run-Time MAC-Cell Error CorrectionDOI
HPCA2025Revisiting Reliability in Large-Scale Machine Learning Research Clusters.DOI
ASPLOS2024TAROT: A CXL SmartNIC-Based Defense Against Multi-bit Errors by Row-Hammer Attacks.DOI
ASPLOS2024Lightweight Fault Isolation: Practical, Efficient, and Secure Software Sandboxing.DOI
ASPLOS2025MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training.DOI
ASPLOS2025Fault Escaping: Improving Robustness of DPU Enhanced Platform with Mutual Assisted VM Recovery.DOI
ASPLOS2025PCcheck: Persistent Concurrent Checkpointing for ML.DOI
ASPLOS2025Robustness Verification for Checking Crash Consistency of Non-volatile Memory.DOI
SC2021Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs.DOI
SC2023Dynamic Selective Protection of Sparse Iterative Solvers via ML Prediction of Soft Error Impacts.DOI
SC2023Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory Faults.DOI
SC2023Unity ECC: Unified Memory Protection Against Bit and Chip Errors.DOI
SC2024Versatile Datapath Soft Error Detection on the Cheap for HPC Applications.DOI
SC2024Achieving High-Performance Fault-Tolerant Routing in HyperX Interconnection Networks.DOI
SC2025FT-Transformer: Fault Tolerant Transformer with ABFT AttentionDOI
SC2025Demystifying the Resilience of Large Language Model Inference: An End-to-End Perspective.DOI
DAC2021Fault-free: Fault-resilient DNN Accelerator on ReRAMDOI
DAC2021BayesFT: Bayesian Optimization for Fault Tolerant Neural Network Architecture.DOI
DAC2021Analyzing and Improving Fault Tolerance of Learning-Based Navigation Systems.DOI
DAC2021Low-Cost FT Enhancement for Emerging Memories-Based DNNDOI
DAC2023HBP: Hierarchically Balanced Pruning and Accelerator Co-Design for Efficient DNN Inference.DOI
DAC2022SoftSNN: low-cost fault tolerance for spiking neural network accelerators under soft errors.DOI
DAC2022Winograd convolution: a perspective from fault toleranceDOI
DAC2022SEM-latch: Low-Cost Latch for Mitigating Soft Errors in CMOSDOI
DAC2023STRIVE: Enabling Choke Point Detection and Timing Error Resilience in a Low-Power Tensor Processing Unit.DOI
DAC2023UpTime: Towards Flow-based In-Memory Computing with High Fault-Tolerance.DOI
DAC2023TFix: Exploiting the Natural Redundancy of Ternary Neural Networks for Fault Tolerant In-Memory Vector Matrix Multiplication.DOI
DAC2023BERRY: Bit Error Robustness for Energy-Efficient Reinforcement Learning-Based Autonomous Systems.DOI
DAC2023RQ-DNN: Reliable Quantization for Fault-tolerant Deep Neural Networks.DOI
DAC2024Maintaining Sanity: Algorithm-based Comprehensive Fault Tolerance for CNNs.DOI
DAC2024MENDNet: Just-in-time Fault Detection and Mitigation in AI Systems with Uncertainty Quantification and Multi-Exit Networks.DOI
DAC2024Cross-Layer Reliability Evaluation and Efficient Hardening of Large Vision Transformers Models.DOI
DAC2025ReaLM: Reliable and Efficient Large Language Model Inference with Statistical Algorithm-Based Fault Tolerance.DOI
DAC2025PoP-ECC: Error Correction against Multi-Bit Upsets in DNN AccelDOI
DAC2025CXL-ECC: ECC for CXL DRAM Error CorrectionDOI
DAC2025EPIC: Error PredIction and Correction for Power-Efficient Voltage Underscaling Multiply-Accumulate Unit.DOI
PPoPP2025TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUs.DOI
PPoPP2025ATTNChecker: Highly-Optimized Fault Tolerant Attention for Large Language Model Training.DOI
HPDC2025FT2: First-Token-Inspired Online Fault Tolerance on Critical Layers for Generative Large Language Models.DOI
HPDC2024RL-based Adaptive Mitigation of Uncorrected DRAM ErrorsDOI
HPDC2023FT-GEMM: Fault Tolerant GEMM on x86 CPUsDOI
HPDC2022Automated Pipeline for Advanced FT in Edge ComputingDOI
EuroSys2024Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures.DOI
EuroSys2024SplitFT: Fault Tolerance for Disaggregated Datacenters via Remote Memory Logging.DOI
EuroSys2025Achilles: Efficient TEE-Assisted BFT Consensus via Rollback Resilient Recovery.DOI

三、软件级错误分析

会议年份论文链接
PPoPP2021Understanding a program’s resiliency through error propagationDOI
SC2021PEPPA-X: finding program test inputs to bound silent data corruption vulnerability in HPC applications.DOI
SC2021G-SEPM: Soft Error Prediction Model for GPGPUsDOI
SC2022Mitigating SDCs in HPC across Multiple Program InputsDOI
SC2023Recovering Detectable Uncorrectable Errors via Spatial Data PredictionDOI
SC2025Reproducible Performance Evaluation of OpenMP and SYCL Workloads under Noise Injection.DOI
SC2025Compression Error Sensitivity Analysis for Different Experts in MoE Model Inference.DOI
DAC2023SEPE-SQED: Symbolic Quick Error DetectionDOI
EuroSys2023An Empirical Study of Resource-Stressing Faults in Edge-Computing Applications.DOI
EuroSys2023Causal fault localisation in dataflow systemsDOI

四、故障注入攻击

会议年份论文链接
MICRO2021ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition.DOI
ISCA2023RowPress: Amplifying Read Disturbance in Modern DRAM Chips.DOI
ISCA2024PrIDE: Achieving Secure Rowhammer Mitigation with Low-Cost In-DRAM Trackers.DOI
ISCA2025DREAM: Enabling Low-Overhead Rowhammer Mitigation via Directed Refresh Management.DOI
ISCA2025MoPAC: Efficiently Mitigating Rowhammer with Probabilistic Activation Counting.DOI
ISCA2025PuDHammer: Experimental Analysis of Read Disturbance Effects of Processing-using-DRAM in Real DRAM Chips.DOI
HPCA2023Multi-Granularity Shadow Paging with NVM Write Optimization for Crash-Consistent Memory-Mapped I/O.DOI
HPCA2024CoMeT: Count-Min-Sketch-based Row Tracking to Mitigate RowHammer at Low Cost.DOI
HPCA2025Chronus: Understanding and Securing the Cutting-Edge Industry Solutions to DRAM Read Disturbance.DOI
DAC2021DeepStrike: Remotely-Guided Fault Injection Attacks on DNN Accelerator in Cloud-FPGA.DOI
DAC2021Rewrite to Reinforce: Binary Countermeasures against Fault InjectionDOI
DAC2021SACReD: An Attack Framework on SAC Resistant Delay-PUFs leveraging Bias and Reliability Factors.DOI
DAC2023NNTesting: Neural Network Fault Attacks Detection Using Gradient-Based Test Vector Generation.DOI
DAC2023ExploreFault: Identifying Exploitable Fault Models in Block Ciphers with Reinforcement Learning.DOI
DAC2023Stalker: A Framework to Analyze Fragility of Cryptographic Libraries under Hardware Fault Models.DOI
DAC2023Pre-silicon Side Channel and Fault AnalysisDOI
DAC2023VideoFlip: Adversarial Bit Flips for Reducing Video Service Quality.DOI
DAC2024CDS: An Anti-Aging Calibratable Digital Sensor for Detecting Multiple Types of Fault Injection Attacks.DOI
DAC2024Plug Your Volt: Protecting Intel Processors against Dynamic Voltage Frequency Scaling based Fault Attacks.DOI

五、故障注入工具/框架

会议年份论文链接
DAC2023Fault Injection in Native Logic-in-Memory Computation on Neuromorphic Hardware.DOI
DAC2025GraphFI: An Efficient Fault Injection Framework for Graph Processing on GPGPUs.DOI
DAC2025EPIC: Error PredIction and Correction for Power-Efficient Voltage Underscaling Multiply-Accumulate Unit.DOI
DAC2025MEEK: Re-thinking Heterogeneous Parallel Error Detection Architecture for Real-World OoO Superscalar Processors.DOI
DAC2025FlexStep: Enabling Flexible Error Detection in Multi/Many-core Real-time Systems.DOI
DAC2025FT-MUX: Fault-Tolerant Microfluidic Multiplexer DesignDOI

六、存储/内存可靠性

会议年份论文链接
ISCA2021CryoGuard: A Near Refresh-Free Robust DRAM Design for Cryogenic Computing.DOI
ISCA2023On Endurance of Processing in (Nonvolatile) MemoryDOI
DAC2021Efficient ECC Mechanism for Memristive PIMDOI
FAST2025AWUPF Rediscovered: Atomic Writes to Unleash Pivotal Fault-Tolerance in SSDs.DOI
EuroSys2021Understanding and dealing with hard faults in persistent memory systems.DOI

七、边界论文(系统级)

与体系结构错误注入有距离,视研究需要决定是否纳入

会议年份论文链接
EuroSys2024Dashing and Star: Byzantine Fault Tolerance with Weak CertificatesDOI
PPoPP2024OsirisBFT: Say No to Task Replication for Scalable Byzantine Fault Tolerant Analytics.DOI
EuroSys2024SplitFT: Fault Tolerance for Disaggregated Datacenters via Remote Memory Logging.DOI
HPDC2022Heterogeneous Systems Resilience: From Research to Industry Standards.DOI
HPDC2024RL-based Adaptive Mitigation of Uncorrected DRAM ErrorsDOI
HPDC2023FT-GEMM: Fault Tolerant GEMM on x86 CPUsDOI
HPDC2022Automated Pipeline for Advanced FT in Edge ComputingDOI
SC2024Fault-Tolerant Deep Learning Cache with Hash Ring for Load Balancing in HPC Systems.DOI

八、专题:Fault Space 缩减

从 100 篇中筛选出涉及故障空间缩减的论文(~25 篇)

脆弱性驱动剪枝

会议年份论文链接
ISCA2021Demystifying the System Vulnerability StackDOI
MICRO2024DelayAVF: Calculating Architectural Vulnerability Factors for Delay Faults.DOI
SC2021PEPPA-X: finding program test inputs to bound silent data corruption vulnerability in HPC applications.DOI
PPoPP2021Understanding a program’s resiliency through error propagationDOI
DAC2024Graph Learning-based Fault Criticality AnalysisDOI
HPCA2023AVGI: Microarchitecture-Driven, Fast and Accurate Vulnerability Assessment.DOI

统计采样与故障分类

会议年份论文链接
MICRO2025DRAM Fault Classification through Large-Scale Field Monitoring for Robust Memory RAS Management.DOI
SC2021G-SEPM: building an accurate and efficient soft error prediction model for GPGPUs.DOI
MICRO2025Swift and Trustworthy GPU SimulationDOI
HPCA2023A Systematic Study of DDR4 DRAM Faults in the Field.DOI
HPCA2025Veritas - Demystifying Silent Data Corruptions: μArch-Level Modeling and Fleet Data of Modern x86 CPUs.DOI

选择性保护

会议年份论文链接
SC2023Dynamic Selective Protection of Sparse Iterative Solvers via ML Prediction of Soft Error Impacts.DOI
PPoPP2022Hardening selective protection across multiple program inputs for HPC applications.DOI
HPCA2021CARE: Coordinated Augmentation for Elastic Resilience on DRAM Errors in Data Centers.DOI
DAC2023STRIVE: Enabling Choke Point Detection and Timing Error Resilience in a Low-Power Tensor Processing Unit.DOI
DAC2025EPIC: Error PredIction and Correction for Power-Efficient Voltage Underscaling Multiply-Accumulate Unit.DOI

故障模拟加速

会议年份论文链接
DAC2025EPIC: Error PredIction and Correction for Power-Efficient Voltage Underscaling Multiply-Accumulate Unit.DOI
DAC2025GraphFI: An Efficient Fault Injection Framework for Graph Processing on GPGPUs.DOI
DAC2023Fault Injection in Native Logic-in-MemoryDOI
MICRO2021ExHero: Execution History-Aware Error-Rate Estimation in Pipelined Designs.DOI

安全向故障模型缩减

会议年份论文链接
DAC2023ExploreFault: Identifying Exploitable Fault Models in Block Ciphers with Reinforcement Learning.DOI
DAC2023Stalker: A Framework to Analyze Fragility of Cryptographic Libraries under Hardware Fault Models.DOI
EuroSys2023Causal fault localisation in dataflow systems.DOI

九、专题:故障注入加速

从 100 篇中筛选出涉及加速的论文(~35 篇)

硬件加速 FI

会议年份论文链接
DAC2025GraphFI: An Efficient Fault Injection Framework for Graph Processing on GPGPUs.DOI
DAC2025EPIC: Error PredIction and Correction for Power-Efficient Voltage Underscaling Multiply-Accumulate Unit.DOI
DAC2023Fault Injection in Native Logic-in-MemoryDOI
DAC2021DeepStrike: Remotely-Guided Fault Injection Attacks on DNN Accelerator in Cloud-FPGA.DOI

轻量级方法

会议年份论文链接
MICRO2021Turnpike: Lightweight Soft Error Resilience for In-Order Cores.DOI
MICRO2022Featherweight Soft Error Resilience for GPUs.DOI
ASPLOS2024Lightweight Fault Isolation: Practical, Efficient, and Secure Software Sandboxing.DOI
SC2024Versatile Datapath Soft Error Detection on the Cheap for HPC Applications.DOI
SC2023Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory Faults.DOI
DAC2022SoftSNN: low-cost fault tolerance for spiking neural network accelerators under soft errors.DOI
DAC2022SEM-latch: a lost-cost and high-performance latch design for mitigating soft errors in nanoscale CMOS process.DOI
HPCA2021ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance.DOI

统计/ML 预测加速

会议年份论文链接
SC2021G-SEPM: building an accurate and efficient soft error prediction model for GPGPUs.DOI
SC2023Dynamic Selective Protection of Sparse Iterative Solvers via ML Prediction of Soft Error Impacts.DOI
MICRO2025DRAM Fault Classification through Large-Scale Field Monitoring for Robust Memory RAS Management.DOI
MICRO2025Swift GPU SimulationDOI
MICRO2024DelayAVF: Calculating Architectural Vulnerability Factors for Delay Faults.DOI
MICRO2021ExHero: Execution History-Aware Error-Rate Estimation in Pipelined Designs.DOI
DAC2024Graph Fault CriticalityDOI
DAC2025EPIC: Error PredIction and Correction for Power-Efficient Voltage Underscaling Multiply-Accumulate Unit.DOI
SC2021PEPPA-X: finding program test inputs to bound silent data corruption vulnerability in HPC applications.DOI
HPCA2021CARE: Coordinated Augmentation for Elastic Resilience on DRAM Errors in Data Centers.DOI

并行错误检测

会议年份论文链接
DAC2025MEEK: Re-thinking Heterogeneous Parallel Error Detection Architecture for Real-World OoO Superscalar Processors.DOI
DAC2025FlexStep: Enabling Flexible Error Detection in Multi/Many-core Real-time Systems.DOI
DAC2023SEPE-SQEDDOI

快速容错恢复

会议年份论文链接
PPoPP2025TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUs.DOI
PPoPP2025ATTNChecker: Highly-Optimized Fault Tolerant Attention for Large Language Model Training.DOI
HPDC2025FT2: First-Token-Inspired Online Fault Tolerance on Critical Layers for Generative Large Language Models.DOI
SC2021Arithmetic-intensity-guided FTDOI
EuroSys2024Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures.DOI
ASPLOS2025PCcheck: Persistent Concurrent Checkpointing for ML.DOI

十、统计

按类别

类别核心 100 中缩减相关加速相关
硬件错误分析305
容错技术56816
软件错误分析1045
故障注入攻击1943
工具/框架646
存储/内存52

缩减/加速列表与主列表有交叉,部分论文同时属于多个类别。

按会议

会议篇数特点
DAC40故障注入攻击 + 容错电路设计最集中
SC18大规模系统容错 + SDC 分析
HPCA16DRAM 故障 + 可靠性(二轮深挖后大幅增加)
ASPLOS13微架构容错 + SDC 检测
MICRO14软错误分析 + 轻量级弹性
ISCA12脆弱性分析 + RowHammer(二轮深挖后大幅增加)
EuroSys7系统容错 + 故障实证研究
HPDC5在线容错 + DRAM 故障缓解
PPoPP4容错并行计算
FAST1SSD 容错
容错系统复习