The Explainable AI Tradeoff

When companies explain how their AI makes decisions, they may be handing hackers an easy way in- and completely undetected.

A New Standard Is Emerging

Across financial services, healthcare, and government, there are growing regulations about making AI systems explainable. The EU AI Act, U.S. banking regulators, and consumer protection agencies are pushing organizations to adopt tools that reveal why an AI reached a particular decision.

For fraud detection specifically, this means banks and fintech companies are increasingly deploying tools like SHAP (SHapley Additive exPlanations), which assign each transaction feature a score reflecting its contribution to the model's prediction. These explanations are shared with auditors, regulators, and increasingly, with customers themselves.

The Emerging Risk

If interpretability becomes the public standard, and companies are required to expose their model explanations, attackers gain access to a precise map of how those models make decisions. This research examines what happens next.

This Is Already Happening

XAI is not a future concern. It is being deployed at scale today, across the financial infrastructure that processes trillions of dollars in transactions.

FICO Explainability Suite Used by major banks to explain credit decisions. Surfaces SHAP-style feature attribution directly to loan officers and compliance teams.
Google Vertex AI Built-in Explainable AI outputs SHAP values for any deployed model. Available to any organization running models on Google Cloud.
Microsoft Azure Responsible AI dashboard exposes SHAP explanations as a standard feature of their ML platform, used across banking and insurance clients.
AWS SageMaker Clarify Generates SHAP values for bias detection and model explainability, integrated into the world's largest cloud ML infrastructure.
Salesforce Einstein Surfaces feature-level explanations to sales and fraud teams, making model reasoning visible to non-technical end users.
Stripe / Adyen Payment processors increasingly require explanation trails for disputed fraud decisions, driving demand for customer-facing SHAP outputs.
The Regulatory Trajectory

Most of these deployments are currently internal, serving auditors and compliance teams. But the EU AI Act (2024) and fair lending regulations are pushing toward customer-facing explanations: "your transaction was flagged because field X and field Y were unusual." That is precisely the exposure this research warns about.

The Research Question

This project investigates one specific consequence of public interpretability: does access to SHAP explanations allow adversarial attackers to evade fraud detection more stealthily, with fewer feature changes, smaller perturbations, and less chance of triggering detection systems?

If companies make interpretable AI the public standard and expose explanation data, do hackers gain the ability to conduct surgical, near-invisible attacks on fraud detectors?

The hypothesis: an attacker with access to SHAP values knows exactly which features to manipulate. Rather than scattering changes across many transaction fields, they can subtly modify only the features that matter, making the attack far harder to detect by the monitoring systems that sit behind the ML model.

The Experiment

Train a robust fraud detector

XGBoost classifier on 100,000 synthetic transactions. 30 features, 10 truly informative, 20 noise. Regularized to prevent fragile decision boundaries.

Generate SHAP explanations

Compute SHAP values for every transaction. Each value reveals how much a feature contributes to the "fraud" prediction, and in which direction.

Run two attacks with identical resources

Both attackers get the same budget: 50 iterations, step size 0.05, same dataset. The only difference: Attacker A has full SHAP access. Attacker B is completely blind.

Measure stealth, not just success

How many features did each attack need to modify? How large were the perturbations? Which attack would be harder for secondary detection systems to catch?

The Finding: Stealth, Not Speed

With equal resources and comparable success rates, the SHAP-guided attack modified dramatically fewer features, nearly half as many as the blind attack. The perturbations were also smaller, making the fraudulent transaction look far more like a legitimate one.

50%
Fewer features modified
by SHAP-guided attacks
4.78
Average features changed
with SHAP access
Metric With XAI (SHAP) Without XAI (Blind)
Features Modified 4.78 features 9.63 features
Perturbation Size 0.0189 0.0198
Model Queries 19,821 20,871
The Core Finding

SHAP access does not dramatically increase attack success rates, but it makes attacks significantly stealthier. The attacker modifies half as many features, producing transactions that look far more plausible to secondary anomaly detection systems.

Why Stealth Is the Real Threat

Real-world fraud detection is not a single model. It is a layered system: the machine learning classifier sits alongside rule-based monitors, behavioral analytics, and anomaly detectors. Many of these secondary systems work by flagging transactions that change too many features simultaneously, or that deviate too far from typical behavior.

SHAP-Guided Attack

5 of 30 features touched, surgical and hard to flag

Blind Random Attack

10 of 30 features touched, scattered and easier to detect

A SHAP-guided attacker crafts a fraudulent transaction that looks, to secondary monitors, like a slightly unusual but plausible purchase. A blind attacker produces a transaction with erratic changes across many unrelated fields, far more likely to raise flags before the transaction is even processed.

The Implication

If interpretability becomes the public standard and SHAP values become widely accessible, the fraud detection ecosystem does not just face smarter attacks. It faces attacks specifically engineered to be invisible to the monitoring infrastructure built around ML models.

The Broader Problem

This finding is not specific to fraud detection. Any domain where explainable AI is deployed publicly (loan approvals, insurance underwriting, content moderation) faces the same structural vulnerability. Publishing explanations is publishing a partial attack manual.

The tension is real and not easily resolved:

  • Regulators require explainability for fairness and accountability
  • Customers deserve to understand decisions that affect them
  • Auditors need transparency to catch bias and errors
  • But each of these legitimate uses also creates an attack surface

The answer is not to abandon transparency. It is to build transparency systems that are privacy-preserving by design, so that explanations serve their legitimate purpose without revealing the model's full decision logic.

Mitigation Strategies

  • Differential privacy on explanations: Add calibrated noise to SHAP values before exposing them. Preserves the general narrative without revealing exact feature weights that attackers can exploit.
  • Rate-limit explanation APIs: Treat explanation endpoints as sensitive data. Monitor for systematic probing patterns that suggest an attacker is mapping the model's decision logic.
  • Retune anomaly detection: Existing monitors look for many large changes. Add rules that also flag transactions with suspiciously few but precisely targeted feature changes.
  • Adversarial training: Augment training data with SHAP-guided adversarial examples so the model learns to resist precisely targeted perturbations.
  • Selective transparency: Provide full SHAP values only to vetted internal auditors. Serve truncated or aggregated explanations to external-facing systems.
  • Ensemble diversity: Use models with diverse architectures so that SHAP values from one model do not expose the full ensemble's decision boundary.

Conclusion

Explainable AI does not just help attackers understand a model. It helps them perform invisible attacks inside it. The stealth advantage given by SHAP access is a signficiant structural vulnerability in any explainable AI deployment.

As interpretability standards evolve from best practice to regulatory requirement, we must develop structured standards for how explanations are generated, stored, and shared. Transparency and robustness can work together to create stronger AI.

About This Research

DatasetSynthetic, generated using scikit-learn make_classification. 100,000 transactions, 30 features (10 informative, 20 noise), 0.5% fraud rate, 1% label noise. Chosen to simulate realistic class overlap that credit card datasets lack.
ModelXGBoost with L1/L2 regularization, max_depth=4, MinMaxScaler normalization
XAI MethodSHAP (SHapley Additive exPlanations), game-theoretic feature attribution
Attack BudgetIdentical for both methods: 50 iterations, step_size=0.05, features clipped to [0,1]
Key DifferentiatorSHAP attack targets top-5 features by importance. Random attack selects blindly from all 30.
ReproducibilityFull implementation across 6 Jupyter notebooks, open and reproducible

Explore the Code

The full experiment is implemented across six Jupyter notebooks. Open any notebook directly in Google Colab.