ARKA ML model card

Model Card — ARKA Guideline-Concordance EBM

Model card format follows Model Cards for Model Reporting (Mitchell et al., 2019). This document accompanies the ARKA CDS Hooks ML service (ml-service/) and the Feature Rationale Catalogue (lib/cds-platform/ml/feature-catalog.ts).

Model details

Field	Value
Model name	ARKA Guideline-Concordance Calibrator
Version	2.0.0-ebm
Type	InterpretML Explainable Boosting Machine (ExplainableBoostingClassifier)
Output	Calibrated guideline-concordance probability ∈ [0, 1] only
Not an output	Appropriateness score (1–9) — that comes from the transparent AIIE 2.0 core (`lib/aiie-v2`)
Feature vector	`aiie_raw_posterior` (transparent-core posterior) + 23 structured case features
Explainability	Exact per-feature shape functions exported to `model/ebm_shape_functions.json` (Criterion 4)
Maintainer	ARKA Health — arkahealth.com

Retired: model/trained_model.json (XGBoost appropriateness regressor) is deprecated. It was a self-referential oracle trained to reproduce hand rules. It is not loaded for scoring. The rule-based fallback is the faithful reference when the EBM is unavailable.

Intended use

Primary use: Map the AIIE 2.0 raw posterior plus structured case features to a calibrated concordance (or, later, denial) probability for operational thresholds and glass-box review.
Users: Licensed clinicians and clinical staff; the probability is advisory context alongside the transparent-core appropriateness rating and guideline citations.
Integration: Consumed by ARKA CDS Hooks services. The appropriateness number shown to users is produced by lib/aiie-v2 (or the documented rule-based fallback), not by this EBM.

Out-of-scope use (do not use for)

Appropriateness oracle — do not treat EBM output as a 1–9 appropriateness score.
Clinical outcome prediction presented as validated performance from synthetic self-consistency metrics.
Time-critical alerts (stroke code, trauma activation, sepsis) without human review.
Image interpretation or CAD on pixel data (FDA Criterion 1: no signal processing).
Autonomous ordering without clinician review.

Training data

Attribute	Description
Source (current)	Guideline-concordance labels derived from signed-off / ACR-aligned stewardship heuristics (bootstrap). Synthetic feature vectors smoke-test plumbing only.
Source (planned / AIIE 3.0)	Federated clinician overrides and downstream outcomes (PHI-free hashes) under change-control — see `docs/AIIE_CHANGE_CONTROL.md`. Until then, pretest/LR/threshold calibration remains concordance-anchored.
Sample size	~5,000 bootstrap examples per training run (80% / 20% split)
Labels	Binary guideline concordance (not appropriateness score; not adjudicated outcomes)
PHI	None in bootstrap; future outcome data uses hashed identifiers only
Real-world data	Not yet used for weights; will refit only this calibration layer when available (change control)

Explicit non-claim: Synthetic / self-consistency / concordance figures are never clinical validity. Do not cite bootstrap concordance metrics as clinical-performance evidence. Concordance ≠ outcomes until real outcomes exist.

Evaluation data and metrics

Dataset	Role
Bootstrap held-out set	Smoke-test of EBM plumbing (`model/evaluation/metrics.json`) — guideline concordance only
Guideline-concordance calibration cohort	Built by `model/calibrate.py` from the human-signed-off knowledge matrix + signed-off ingest seed + CDR literature anchors (`model/evaluation/calibration.json`)

EBM metrics (bootstrap held-out — NOT clinical validity):

Metric	Description
AUC / Brier / log loss	Discrimination and calibration of P(concordance)
Accuracy @ 0.5	Thresholded concordance label agreement (plumbing check)

Every metrics artifact includes a caveat field stating these figures are not clinical validity.

AIIE 2.0 conformal calibration (guideline concordance — not outcomes)

Run: ml-service/.venv/bin/python ml-service/model/calibrate.py

Caveat: These figures measure guideline concordance (agreement with the human-signed-off matrix / signed-off ingest / CDR literature references) and interval coverage. They are not clinical outcome validity, denial-prediction accuracy, or patient-outcome claims.

Metric	Value (held-out, representative run)	Notes
Concordance (\	ŷ − ref\	≤ 1.5)	0.875	Guideline concordance rate
90% credible-interval coverage	0.865	Bayesian posterior CI from `scoreOrderV2` (in target [0.86, 0.94])
Split-conformal set coverage (ref ∈ [ŷ ± q])	0.914	Distribution-free (1−α) metric
ECE	0.136	Expected calibration error on confidence vs concordance
Calibration slope	0.880	OLS slope of reference ~ predicted (ideal = 1.0)

Artifacts: model/evaluation/calibration.json, model/evaluation/conformal_table.json, model/evaluation/reliability_diagram.png. Runtime table: lib/aiie-v2/data/conformal_table.json.

AIIE 3.0 conformal risk control (important-miss rate — not outcomes)

Run: python ml-service/model/crc_calibrate.py [--cohort path] [--alpha 0.1]

Loss definition (explicit — no hidden objective):

Symbol	Meaning
`s(x)`	Retention-risk score ∈ [0, 1] (`retentionRiskScore`: CI width + band-boundary proximity)
`M`	Important-miss label (reviewer flag, or concordance proxy: band disagree AND \	ŷ−ref\	> 2)
`λ`	Abstain threshold: retain if `s ≤ λ`, defer if `s > λ`
`L_i(λ)`	`M_i · 1{s_i ≤ λ}` — loss only when a retained case is an important miss (bounded by `B=1`, monotone in `λ`)

CRC chooses λ̂ = max{λ : (n/(n+1)) R̂_n(λ) + B/(n+1) ≤ α} so that under exchangeability E[L_{n+1}(λ̂)] ≤ α (Angelopoulos, Bates, Fisch, Lei, Schuster, ICLR 2024).

Caveat: Bootstrap / concordance-proxy figures are not clinical outcome validity. Final abstention is v2.abstain OR crc.abstain (never less cautious than AIIE 2.0 interval-coverage abstain).

Artifact	Path
Runtime table	`lib/aiie-v3/risk_control_table.json`
Evaluation copy	`ml-service/model/evaluation/risk_control_table.json`
TS API	`lib/aiie-v3/risk-control.ts`

AIIE 3.0 decision-impact validation (P12 — not outcomes)

Run: python ml-service/model/validate_v3.py [--cohort path]

Training / calibration data: guideline-concordance labels from the human-signed-off knowledge matrix + signed-off ingest + CDR literature anchors; later federated clinician overrides and real-world outcomes under change-control. Until outcomes exist, every number below is concordance / decision-impact plumbing — not clinical validity.

What is measured (honest labels — §9 plan):

#	Metric	Meaning
1	Decision-impact	Fraction of holdout cases where A3 lands in a different 1–9 band than A2, with direction (up/down) and reason buckets (e.g. `low_p_mgmt_plus_radiation_harm`, `high_pretest_decisive_posttest`)
2	Net benefit	Vickers decision-curve net benefit of the A3 imaging recommendation vs image-all / image-none defaults (concordance-proxy positives: reference ≥ 7)
3	Risk-control coverage	Empirical CRC important-miss risk under fitted `λ` — must be ≤ α (+ finite-sample tolerance)
4	Interval coverage & calibration	Carried from AIIE 2.0 `calibration.json` (90% CI / conformal set, ECE, reliability diagram, calibration slope)
5	Pretest / post-test calibration	ECE of glass-box `p0` / `p1` vs concordance-proxy labels, overall and per target condition
6	Selective-prediction	Risk–coverage curve: deferred cases are higher retention-risk; retained reasoned-impact / concordance reported as coverage drops
7	Subgroup equity	Concordance + net benefit + decision-impact by age band, sex, modality, region (multi-calibration closes gaps >5 pts at runtime — P11)
8	Ablations	v2-only → +pretest → +post-test → +decision → +harm → full 3.0; `v2_only` impact = 0; graceful-degrade `A3 == A2` when uninformative
9	Prospective / outcomes	Deferred until federated overrides/outcomes exist — management-change concordance, denial-overturn lift, AUC vs approve/deny are not claimed
—	Degrade-to-v2 rate	Fraction of cases where the decision layer is uninformative and A3 mirrors A2
—	Latency	Core = table arithmetic (µs); only network cost is cached/bounded P10 external context; CDS p95 budget 800 ms

Artifact: model/evaluation/v3_validation.json. Promote gates (POST /api/ins/aiie/promote with engine: "aiie-3" / v3Validation) reject challengers that regress risk coverage, net benefit, subgroup concordance (>5 pts), degrade-to-v2 rate, or shadow divergence (409 + failingCheck). Health: GET /api/ins/aiie/health returns v3.championVersion, v3.decisionImpact, v3.riskCoverage, and v3.degradeRate in one call.

Pending real-world outcomes: management-change concordance, denial-overturn lift, and AUC vs approve/deny are not claimed until federated outcomes accrue. Concordance ≠ clinical validity.

Fairness and subgroup analysis

Subgroup performance is reported in calibration.json for the guideline-concordance holdout:

Subgroup	Stratification
Age	Pediatric (<18), adult (18–64), older adult (65+)
Sex	Male / female
Modality	CT, MRI, radiograph, ultrasound, …
Region	Knowledge-matrix body region

Target: Flag any subgroup >5 absolute percentage points off overall concordance. Flagged rows are listed under flagged_subgroups in calibration.json (investigate before treating concordance gaps as outcome disparities).

Known limitations

Bootstrap labels are guideline-concordance, not adjudicated real-world outcomes.
23-feature vector plus posterior may omit site-specific pathways or free-text nuance.
English-language, US guideline framing; not validated for non-US practice.
Fallback mode: If ebm_model.joblib is absent, the service uses the rule-based scorer as the faithful reference (more faithful than the retired XGBoost oracle) with lower stated confidence.
Shape functions are main-effect additive terms; interaction terms are disabled by default for independent reviewability.
Clinical sign-off for catalogue rationales is tracked in docs/CLINICAL_SIGN_OFF_LOG.md.

Regulatory posture

ARKA Imaging Intelligence Engine is positioned as an FDA Non-Device Clinical Decision Support tool under FD&C Act §520(o)(1)(E) (21st Century Cures Act). This model:

Does not process medical images.
Surfaces exact shape-function contributions and peer-reviewed / guideline-linked rationales (Criterion 4).
Requires clinician responsibility for the final order decision.

Ethics and safety

No autonomous patient-facing decisions.
Predictions are advisory; overrides must remain available in the EHR workflow.
Report safety or appropriateness concerns through the contact below before relying on the model in production.

Contact and issue reporting

Issues / safety reports: https://arkahealth.com (contact form / support channel)
Repository path: ml-service/MODEL_CARD.md
Change control (PCCP-style): docs/AIIE_CHANGE_CONTROL.md — what may auto-recalibrate vs clinician sign-off, shadow-before-promote, rollback
Catalogue source of truth: lib/cds-platform/ml/feature-catalog.ts → npm run export:feature-catalog
Train: python ml-service/model/train_ebm.py
Recalibrate (v2 conformal / EBM): python ml-service/model/calibrate.py [--feedback-cohort path] [--refit-ebm]
Recalibrate (v3 CRC abstain): python ml-service/model/crc_calibrate.py [--cohort path] [--alpha 0.1]
Validate (v3 decision-impact / NB / risk-coverage): python ml-service/model/validate_v3.py [--cohort path]

*Last updated: 2026-07-20. Regenerate EBM artifacts after catalogue or feature-engineering change. Re-run model/calibrate.py after matrix or CDR pack changes. Re-run model/crc_calibrate.py when important-miss labels accrue. Re-run model/validate_v3.py before any AIIE 3.0 promote. Feedback-driven refits follow docs/AIIE_CHANGE_CONTROL.md.*

This recommendation is intended to support, not replace, clinical judgment. It is generated by ARKA, software designed to meet the four criteria for Non-Device Clinical Decision Support under FD&C Act §520(o)(1)(E) and FDA's final guidance on Clinical Decision Support Software (January 2026). The clinician is responsible for the final decision.

ARKA ML model card

Model Card — ARKA Guideline-Concordance EBM

Model details

Field	Value
Model name	ARKA Guideline-Concordance Calibrator
Version	2.0.0-ebm
Type	InterpretML Explainable Boosting Machine (ExplainableBoostingClassifier)
Output	Calibrated guideline-concordance probability ∈ [0, 1] only
Not an output	Appropriateness score (1–9) — that comes from the transparent AIIE 2.0 core (`lib/aiie-v2`)
Feature vector	`aiie_raw_posterior` (transparent-core posterior) + 23 structured case features
Explainability	Exact per-feature shape functions exported to `model/ebm_shape_functions.json` (Criterion 4)
Maintainer	ARKA Health — arkahealth.com

Intended use

Primary use: Map the AIIE 2.0 raw posterior plus structured case features to a calibrated concordance (or, later, denial) probability for operational thresholds and glass-box review.
Users: Licensed clinicians and clinical staff; the probability is advisory context alongside the transparent-core appropriateness rating and guideline citations.
Integration: Consumed by ARKA CDS Hooks services. The appropriateness number shown to users is produced by lib/aiie-v2 (or the documented rule-based fallback), not by this EBM.

Out-of-scope use (do not use for)

Appropriateness oracle — do not treat EBM output as a 1–9 appropriateness score.
Clinical outcome prediction presented as validated performance from synthetic self-consistency metrics.
Time-critical alerts (stroke code, trauma activation, sepsis) without human review.
Image interpretation or CAD on pixel data (FDA Criterion 1: no signal processing).
Autonomous ordering without clinician review.

Training data

Attribute	Description
Source (current)	Guideline-concordance labels derived from signed-off / ACR-aligned stewardship heuristics (bootstrap). Synthetic feature vectors smoke-test plumbing only.
Source (planned / AIIE 3.0)	Federated clinician overrides and downstream outcomes (PHI-free hashes) under change-control — see `docs/AIIE_CHANGE_CONTROL.md`. Until then, pretest/LR/threshold calibration remains concordance-anchored.
Sample size	~5,000 bootstrap examples per training run (80% / 20% split)
Labels	Binary guideline concordance (not appropriateness score; not adjudicated outcomes)
PHI	None in bootstrap; future outcome data uses hashed identifiers only
Real-world data	Not yet used for weights; will refit only this calibration layer when available (change control)

Evaluation data and metrics

Dataset	Role
Bootstrap held-out set	Smoke-test of EBM plumbing (`model/evaluation/metrics.json`) — guideline concordance only
Guideline-concordance calibration cohort	Built by `model/calibrate.py` from the human-signed-off knowledge matrix + signed-off ingest seed + CDR literature anchors (`model/evaluation/calibration.json`)

EBM metrics (bootstrap held-out — NOT clinical validity):

Metric	Description
AUC / Brier / log loss	Discrimination and calibration of P(concordance)
Accuracy @ 0.5	Thresholded concordance label agreement (plumbing check)

Every metrics artifact includes a caveat field stating these figures are not clinical validity.

AIIE 2.0 conformal calibration (guideline concordance — not outcomes)

Run: ml-service/.venv/bin/python ml-service/model/calibrate.py

Metric	Value (held-out, representative run)	Notes
Concordance (\	ŷ − ref\	≤ 1.5)	0.875	Guideline concordance rate
90% credible-interval coverage	0.865	Bayesian posterior CI from `scoreOrderV2` (in target [0.86, 0.94])
Split-conformal set coverage (ref ∈ [ŷ ± q])	0.914	Distribution-free (1−α) metric
ECE	0.136	Expected calibration error on confidence vs concordance
Calibration slope	0.880	OLS slope of reference ~ predicted (ideal = 1.0)

Artifacts: model/evaluation/calibration.json, model/evaluation/conformal_table.json, model/evaluation/reliability_diagram.png. Runtime table: lib/aiie-v2/data/conformal_table.json.

AIIE 3.0 conformal risk control (important-miss rate — not outcomes)

Run: python ml-service/model/crc_calibrate.py [--cohort path] [--alpha 0.1]

Loss definition (explicit — no hidden objective):

Symbol	Meaning
`s(x)`	Retention-risk score ∈ [0, 1] (`retentionRiskScore`: CI width + band-boundary proximity)
`M`	Important-miss label (reviewer flag, or concordance proxy: band disagree AND \	ŷ−ref\	> 2)
`λ`	Abstain threshold: retain if `s ≤ λ`, defer if `s > λ`
`L_i(λ)`	`M_i · 1{s_i ≤ λ}` — loss only when a retained case is an important miss (bounded by `B=1`, monotone in `λ`)

CRC chooses λ̂ = max{λ : (n/(n+1)) R̂_n(λ) + B/(n+1) ≤ α} so that under exchangeability E[L_{n+1}(λ̂)] ≤ α (Angelopoulos, Bates, Fisch, Lei, Schuster, ICLR 2024).

Caveat: Bootstrap / concordance-proxy figures are not clinical outcome validity. Final abstention is v2.abstain OR crc.abstain (never less cautious than AIIE 2.0 interval-coverage abstain).

Artifact	Path
Runtime table	`lib/aiie-v3/risk_control_table.json`
Evaluation copy	`ml-service/model/evaluation/risk_control_table.json`
TS API	`lib/aiie-v3/risk-control.ts`

AIIE 3.0 decision-impact validation (P12 — not outcomes)

Run: python ml-service/model/validate_v3.py [--cohort path]

What is measured (honest labels — §9 plan):

#	Metric	Meaning
1	Decision-impact	Fraction of holdout cases where A3 lands in a different 1–9 band than A2, with direction (up/down) and reason buckets (e.g. `low_p_mgmt_plus_radiation_harm`, `high_pretest_decisive_posttest`)
2	Net benefit	Vickers decision-curve net benefit of the A3 imaging recommendation vs image-all / image-none defaults (concordance-proxy positives: reference ≥ 7)
3	Risk-control coverage	Empirical CRC important-miss risk under fitted `λ` — must be ≤ α (+ finite-sample tolerance)
4	Interval coverage & calibration	Carried from AIIE 2.0 `calibration.json` (90% CI / conformal set, ECE, reliability diagram, calibration slope)
5	Pretest / post-test calibration	ECE of glass-box `p0` / `p1` vs concordance-proxy labels, overall and per target condition
6	Selective-prediction	Risk–coverage curve: deferred cases are higher retention-risk; retained reasoned-impact / concordance reported as coverage drops
7	Subgroup equity	Concordance + net benefit + decision-impact by age band, sex, modality, region (multi-calibration closes gaps >5 pts at runtime — P11)
8	Ablations	v2-only → +pretest → +post-test → +decision → +harm → full 3.0; `v2_only` impact = 0; graceful-degrade `A3 == A2` when uninformative
9	Prospective / outcomes	Deferred until federated overrides/outcomes exist — management-change concordance, denial-overturn lift, AUC vs approve/deny are not claimed
—	Degrade-to-v2 rate	Fraction of cases where the decision layer is uninformative and A3 mirrors A2
—	Latency	Core = table arithmetic (µs); only network cost is cached/bounded P10 external context; CDS p95 budget 800 ms

Pending real-world outcomes: management-change concordance, denial-overturn lift, and AUC vs approve/deny are not claimed until federated outcomes accrue. Concordance ≠ clinical validity.

Fairness and subgroup analysis

Subgroup performance is reported in calibration.json for the guideline-concordance holdout:

Subgroup	Stratification
Age	Pediatric (<18), adult (18–64), older adult (65+)
Sex	Male / female
Modality	CT, MRI, radiograph, ultrasound, …
Region	Knowledge-matrix body region

Known limitations

Bootstrap labels are guideline-concordance, not adjudicated real-world outcomes.
23-feature vector plus posterior may omit site-specific pathways or free-text nuance.
English-language, US guideline framing; not validated for non-US practice.
Fallback mode: If ebm_model.joblib is absent, the service uses the rule-based scorer as the faithful reference (more faithful than the retired XGBoost oracle) with lower stated confidence.
Shape functions are main-effect additive terms; interaction terms are disabled by default for independent reviewability.
Clinical sign-off for catalogue rationales is tracked in docs/CLINICAL_SIGN_OFF_LOG.md.

Regulatory posture

ARKA Imaging Intelligence Engine is positioned as an FDA Non-Device Clinical Decision Support tool under FD&C Act §520(o)(1)(E) (21st Century Cures Act). This model:

Does not process medical images.
Surfaces exact shape-function contributions and peer-reviewed / guideline-linked rationales (Criterion 4).
Requires clinician responsibility for the final order decision.

Ethics and safety

No autonomous patient-facing decisions.
Predictions are advisory; overrides must remain available in the EHR workflow.
Report safety or appropriateness concerns through the contact below before relying on the model in production.

Contact and issue reporting

Issues / safety reports: https://arkahealth.com (contact form / support channel)
Repository path: ml-service/MODEL_CARD.md
Change control (PCCP-style): docs/AIIE_CHANGE_CONTROL.md — what may auto-recalibrate vs clinician sign-off, shadow-before-promote, rollback
Catalogue source of truth: lib/cds-platform/ml/feature-catalog.ts → npm run export:feature-catalog
Train: python ml-service/model/train_ebm.py
Recalibrate (v2 conformal / EBM): python ml-service/model/calibrate.py [--feedback-cohort path] [--refit-ebm]
Recalibrate (v3 CRC abstain): python ml-service/model/crc_calibrate.py [--cohort path] [--alpha 0.1]
Validate (v3 decision-impact / NB / risk-coverage): python ml-service/model/validate_v3.py [--cohort path]

Model Card — ARKA Guideline-Concordance EBM

#Model details

#Intended use

#Out-of-scope use (do not use for)

#Training data

#Evaluation data and metrics

#AIIE 2.0 conformal calibration (guideline concordance — not outcomes)

#AIIE 3.0 conformal risk control (important-miss rate — not outcomes)

#AIIE 3.0 decision-impact validation (P12 — not outcomes)

#Fairness and subgroup analysis

#Known limitations

#Regulatory posture

#Ethics and safety

#Contact and issue reporting

Model Card — ARKA Guideline-Concordance EBM

#Model details

#Intended use

#Out-of-scope use (do not use for)

#Training data

#Evaluation data and metrics

#AIIE 2.0 conformal calibration (guideline concordance — not outcomes)

#AIIE 3.0 conformal risk control (important-miss rate — not outcomes)

#AIIE 3.0 decision-impact validation (P12 — not outcomes)

#Fairness and subgroup analysis

#Known limitations

#Regulatory posture

#Ethics and safety

#Contact and issue reporting

Model details

Intended use

Out-of-scope use (do not use for)

Training data

Evaluation data and metrics

AIIE 2.0 conformal calibration (guideline concordance — not outcomes)

AIIE 3.0 conformal risk control (important-miss rate — not outcomes)

AIIE 3.0 decision-impact validation (P12 — not outcomes)

Fairness and subgroup analysis

Known limitations

Regulatory posture

Ethics and safety

Contact and issue reporting

Model details

Intended use

Out-of-scope use (do not use for)

Training data

Evaluation data and metrics

AIIE 2.0 conformal calibration (guideline concordance — not outcomes)

AIIE 3.0 conformal risk control (important-miss rate — not outcomes)

AIIE 3.0 decision-impact validation (P12 — not outcomes)

Fairness and subgroup analysis

Known limitations

Regulatory posture

Ethics and safety

Contact and issue reporting