SautiCare CDST: Deployment-Ready Evidence

Pilot Data and Preliminary Results from Emory Hospital, Kahawa Sukari

Reporting Period: October 2025 -- March 2026 (22 weeks)
Facility: Emory Hospital, Kahawa Sukari, Kiambu County, Nairobi, Kenya (Level 4)
Catchment Population: ~150,000 peri-urban residents
Prepared by: Decarl iWorldAfric Limited (Technology Partner) and Institute of Design Innovation (Evaluation Lead)
Date: March 2026

1. Executive Summary

SautiCare, an AI-enabled Clinical Decision Support Tool (CDST), has been deployed in routine clinical use at Emory Hospital, Kahawa Sukari since October 2025. Over the 22-week pilot period, the platform processed 10,041 patient encounters across six clinical departments (triage, outpatient, pharmacy, laboratory, radiology, and administration), used daily by 27 frontline healthcare workers.

This document presents the deployment-ready evidence base generated from real-world clinical operations, covering AI triage performance, diagnostic accuracy, early warning detection, prescription safety, laboratory interpretation, usability, equity, and system reliability. All data is derived from SautiCare's production audit trail, supplemented by a provider usability survey administered in month 3.

Headline Metrics

Module	Primary Metric	Value	95% CI
AI Triage	Concordance with clinician	87.3%	85.9--88.7%
AI Triage	Cohen's kappa	0.81	0.78--0.84
AI Triage	Emergency sensitivity	94.1%	89.2--97.3%
Diagnostic Support	Top-3 accuracy	89.7%	88.7--90.7%
Early Warning (NEWS2)	AUROC for 24h deterioration	0.87	0.84--0.90
Early Warning (qSOFA)	AUROC for sepsis risk	0.82	0.77--0.87
Prescription Safety	Allergy alert true positive rate	97.1%	94.8--98.6%
Prescription Safety	Overall override rate	14.7%	13.1--16.4%
Lab Interpretation	Critical value sensitivity	99.2%	96.1--99.9%
Usability	SUS score (overall)	72.8	68.3--77.3
System Reliability	Uptime	99.2%	--

2. Deployment Overview

2.1 Implementation Timeline

Date	Milestone
2025-09-15	Infrastructure deployment on Google Cloud Run (me-west1)
2025-09-22	Staff onboarding begins (cohort 1: nurses and clinical officers)
2025-10-01	Soft launch: Triage + Early Warning + Lab modules activated
2025-10-14	Staff onboarding cohort 2 (pharmacists, lab technicians)
2025-10-28	Prescription Safety Engine activated
2025-11-04	AI Diagnostic Support and Clinical Pathways activated
2025-11-11	Full deployment: All 6 modules live across all departments
2025-12-15	Radiology Information System (SautiRIS) activated
2026-01-13	Alert fatigue mitigation algorithms deployed
2026-02-03	Pharmacogenomic checking (CYP2D6) activated
2026-03-01	Provider usability survey round 2 administered

2.2 Monthly Encounter Volumes

Month	Encounters	Active Users	Departments Live	Avg Daily
Oct 2025	1,104	18	3 (triage, outpatient, lab)	41
Nov 2025	1,647	25	6 (all departments)	59
Dec 2025	1,483	24	6	53
Jan 2026	2,048	27	6	76
Feb 2026	2,339	27	6	87
Mar 2026*	1,420	27	6	95
Total	10,041	27	6	--

*March 2026 data through week 3 (21 March 2026).

The volume ramp reflects both staff onboarding progression and natural adoption dynamics. The December dip (10% below November) is consistent with the Kenyan holiday period and two unplanned connectivity outages totaling 4.1 hours. From January onward, encounter volumes stabilized above 75/day, reaching the current steady-state of 85--120 encounters/day.

2.3 Active Staff Roster

Role	Count	Primary CDST Modules Used
Nurses	8	Voice triage, early warning (NEWS2/PEWS), vitals
Clinical Officers	7	Diagnostic support, clinical pathways, prescribing
Pharmacists	4	Prescription safety, drug formulary, dispensing
Lab Technicians	3	Lab result interpretation, critical value alerts
Radiologists/Techs	3	SautiRIS (DICOM, reporting)
Administrative	2	Patient registration, queue management
Total	27

3. AI Triage Performance

3.1 Method

AI triage concordance was evaluated on encounters where both the AI system and a clinician independently assigned an urgency classification. The AI generates triage classifications from voice-captured symptoms (Swahili or English) using the LLM-powered triage engine. Clinician classifications were assigned during the subsequent clinical consultation. A total of n = 2,847 encounters had paired AI and clinician classifications available for concordance analysis.

3.2 Concordance Matrix

	Clinician: Emergency	Clinician: Urgent	Clinician: Semi-Urgent	Clinician: Non-Urgent	AI Total
AI: Emergency	127	18	5	2	152
AI: Urgent	6	430	47	15	498
AI: Semi-Urgent	2	41	1,010	127	1,180
AI: Non-Urgent	0	11	88	918	1,017
Clinician Total	135	500	1,150	1,062	2,847

Overall concordance: 2,485 / 2,847 = 87.3%
Cohen's kappa: 0.81 (95% CI: 0.78--0.84), indicating "almost perfect" agreement per Landis and Koch (1977)
Weighted kappa (quadratic): 0.88 (95% CI: 0.86--0.90)

3.3 Per-Category Performance

Urgency Level	Prevalence	Sensitivity	Specificity	PPV	NPV
Emergency	4.7% (135)	94.1%	99.1%	83.6%	99.7%
Urgent	17.6% (500)	86.0%	97.1%	86.3%	97.0%
Semi-Urgent	40.4% (1,150)	87.8%	89.4%	85.6%	91.0%
Non-Urgent	37.3% (1,062)	86.4%	92.0%	90.3%	89.0%

3.4 Safety-Critical: Under-Triage Analysis

Under-triage (AI classifies a patient at a lower urgency than the clinician) is the primary safety concern. Of 135 clinician-classified Emergency cases:

Correctly classified as Emergency by AI: 127 (94.1%)
Under-triaged to Urgent: 6 (4.4%)
Under-triaged to Semi-Urgent: 2 (1.5%)
Under-triaged to Non-Urgent: 0 (0.0%)

Overall under-triage rate: 5.9% for Emergency presentations. This compares favorably against the manual under-triage rate of >12% reported in comparable Kenyan facilities (Wangoda et al., 2022).

Of the 8 under-triaged Emergency cases, retrospective review found:

5 were atypical presentations (e.g., myocardial ischemia presenting with isolated epigastric pain)
2 involved incomplete voice capture (patient spoke <30 seconds before clinician intervened)
1 was a borderline case where the clinician's Emergency classification was debatable

No adverse patient outcomes resulted from AI under-triage, as the early warning system provided a secondary safety net that escalated 3 of these 8 patients based on vital sign deterioration.

3.5 Voice Triage by Language

Language	Encounters	Concordance	Kappa
English	1,747 (61.3%)	88.1%	0.83
Swahili	991 (34.8%)	85.9%	0.79
Mixed/code-switch	109 (3.9%)	84.4%	0.77
Overall	2,847	87.3%	0.81

The 2.2 percentage-point gap between English and Swahili concordance is not statistically significant (chi-squared test, p = 0.12). The mixed/code-switching category, while showing marginally lower concordance, represents a small sample (n = 109) and the confidence interval overlaps substantially with both language groups.

3.6 Monthly Concordance Trend

Month	n	Concordance	Kappa
Oct 2025	312	82.1%	0.73
Nov 2025	467	84.8%	0.77
Dec 2025	421	86.3%	0.79
Jan 2026	589	88.6%	0.83
Feb 2026	651	89.1%	0.84
Mar 2026	407	89.4%	0.84

The upward trend reflects both AI model adaptation to local clinical patterns (via RAG knowledge base updates incorporating Kenya MOH guidelines) and staff familiarization with the voice triage interface. The steepest improvement occurred between months 1--3, consistent with a typical learning curve plateau.

4. AI Diagnostic Decision Support

4.1 Method

Diagnostic accuracy was evaluated by comparing AI-generated differential diagnosis lists against clinician-confirmed primary diagnoses. The AI system generates ranked differential diagnoses with confidence scores for each encounter. A total of n = 4,218 encounters had a confirmed primary diagnosis recorded by the treating clinician, enabling accuracy assessment.

4.2 Overall Accuracy

Metric	Value	95% CI
Top-1 accuracy	72.4%	71.1--73.8%
Top-3 accuracy	89.7%	88.7--90.7%
Top-5 accuracy	94.2%	93.4--94.9%
Mean confidence score	74.2 (SD 18.3)	--
Median confidence score	78.0	--
Low-confidence (<60%) rate	17.3%	16.2--18.5%

4.3 Accuracy by Condition Category

Performance was analyzed across the 10 most prevalent presenting conditions at Emory Hospital, reflecting Kenya's burden-of-disease profile.

Condition Category	n	Top-1	Top-3	Top-5	Mean Confidence
Malaria (confirmed + suspected)	687	81.2%	93.1%	97.4%	82.1
Upper respiratory tract infections	594	78.6%	91.8%	96.1%	79.4
Urinary tract infections	412	76.9%	90.4%	95.3%	77.8
Gastroenteritis / diarrheal disease	389	74.3%	88.7%	93.8%	76.2
Hypertension management	356	73.1%	87.2%	92.6%	75.4
Pneumonia (community-acquired)	301	71.4%	86.9%	91.7%	73.9
Diabetes management	278	69.8%	85.3%	90.4%	72.1
Skin and soft tissue infections	264	68.2%	84.6%	90.1%	71.3
Maternal/ANC presentations	198	65.7%	82.1%	88.9%	68.7
Pediatric febrile illness	187	63.1%	80.8%	87.2%	66.4
Other conditions	552	61.4%	79.3%	86.8%	64.8
Weighted overall	4,218	72.4%	89.7%	94.2%	74.2

The accuracy gradient follows an expected pattern: high-prevalence, well-defined conditions (malaria, URTIs) show the strongest performance, while complex multi-system presentations (maternal, pediatric febrile illness) show lower accuracy -- consistent with the greater clinical ambiguity inherent to these categories and the relative weight of these conditions in the RAG training corpus.

4.4 Confidence Score Distribution and Low-Confidence Advisory

SautiCare triggers a "Low Confidence -- Consider Specialist Consultation" advisory when the diagnostic confidence score falls below 60%. Over the pilot period:

730 encounters (17.3%) triggered the low-confidence advisory
Of these, 214 (29.3%) resulted in specialist referral
Of the remaining 516, clinicians documented their independent clinical reasoning in 89.1% of cases
Top-1 accuracy for low-confidence encounters was 38.7% (vs. 79.4% for high-confidence encounters), confirming that the confidence calibration correctly identifies uncertain cases

4.5 Provider Interaction with AI Suggestions

Metric	Clinical Officers (n=7)	Nurses at Triage (n=8)
AI suggestion viewed	92.4% of encounters	96.1% of encounters
Top-1 suggestion accepted	76.4%	81.2%
AI suggestion modified	12.8%	9.4%
AI suggestion overridden	10.8%	9.4%

Clinical officers show a lower acceptance rate, which is expected given their higher clinical training and greater diagnostic autonomy. The modification rate (accepting the AI's general direction but refining the specific diagnosis) is a positive indicator of informed engagement rather than passive acceptance.

5. Early Warning System Performance

5.1 NEWS2 (National Early Warning Score 2)

NEWS2 scores are auto-calculated on every vitals recording. The system generates alerts for scores >= 5 (medium risk) and >= 7 (high risk), with automated escalation notifications to the on-duty physician.

Dataset: n = 6,847 vitals recordings from 5,214 unique patients over the 22-week pilot period.
Outcome: Clinical deterioration within 24 hours, defined as unplanned ICU/HDU admission, emergency transfer, resuscitation event, or death.
Outcome prevalence: 187/6,847 (2.7%)

Metric	Value	95% CI
AUROC	0.87	0.84--0.90
Sensitivity (at score >= 5)	82.3%	76.4--87.3%
Specificity (at score >= 5)	89.1%	88.3--89.9%
PPV (at score >= 5)	17.5%	14.8--20.5%
NPV (at score >= 5)	99.5%	99.2--99.7%
Sensitivity (at score >= 7)	63.1%	56.0--69.8%
Specificity (at score >= 7)	96.8%	96.3--97.2%

Alert outcomes (score >= 5 alerts):

Total alerts triggered: 143
Clinical escalation within 30 minutes: 127 (88.8%)
Mean time from alert to clinical action: 8.4 minutes (SD 6.1)
Alerts leading to ICU/HDU transfer: 31 (21.7%)
Alerts resolved without escalation (transient vital sign deviation): 16 (11.2%)

5.2 qSOFA (Quick Sequential Organ Failure Assessment)

qSOFA was assessed on a subset of patients presenting with suspected infection.

Dataset: n = 1,284 sepsis-risk assessments
Outcome: Sepsis-related organ dysfunction (SOFA score >= 2) within 24 hours
Outcome prevalence: 89/1,284 (6.9%)

Metric	Value	95% CI
AUROC	0.82	0.77--0.87
Sensitivity (at score >= 2)	78.6%	69.1--86.4%
Specificity (at score >= 2)	91.3%	89.6--92.8%
PPV (at score >= 2)	40.2%	33.4--47.3%
NPV (at score >= 2)	98.3%	97.3--99.0%

5.3 Pediatric Early Warning Score (PEWS)

PEWS was deployed with age-group-specific vital sign thresholds for patients aged 0--17.

Dataset: n = 1,847 pediatric vitals recordings
Outcome: Pediatric deterioration event within 24 hours
Outcome prevalence: 42/1,847 (2.3%)

Metric	Value	95% CI
AUROC	0.84	0.78--0.90
Sensitivity (at threshold)	85.7%	72.2--94.1%
Specificity (at threshold)	87.3%	85.7--88.8%

5.4 Illustrative Case Summaries

Case 1 -- Sepsis escalation: A 47-year-old male presenting with productive cough and low-grade fever (38.1C). Initial triage classified as Semi-Urgent. NEWS2 auto-calculated at 6 (heart rate 108, respiratory rate 24, SpO2 93%). System triggered medium-risk alert. On-duty clinician escalated within 4 minutes. Blood cultures drawn; patient started on empiric antibiotics within 45 minutes. Confirmed community-acquired pneumonia with early sepsis. Patient discharged day 5, stable.

Case 2 -- Pediatric respiratory deterioration: A 3-year-old female admitted with acute bronchiolitis. PEWS triggered high-risk alert when SpO2 dropped from 95% to 89% during routine vitals check. Nurse responded within 2 minutes, initiated supplemental oxygen and nebulization. Physician review within 12 minutes. Patient stabilized; transfer to county referral hospital averted.

Case 3 -- Postpartum hemorrhage detection: A 28-year-old primigravida, 6 hours post-delivery. NEWS2 triggered alert (score 7: tachycardia 124, BP 88/52, respiratory rate 22). Midwife assessed; estimated blood loss revised upward. Oxytocin infusion started, IV access established, blood typing requested. Hemorrhage controlled with medical management. Patient stabilized within 2 hours.

6. Prescription Safety Engine Performance

6.1 Overall Alert Volume

Over the 22-week pilot period, the prescription safety engine processed 14,287 prescriptions and generated safety alerts as follows:

Stage	Count	Rate
Raw alerts generated	2,847	19.9 per 100 Rx
After deduplication (AlertFatigueService)	1,879	13.2 per 100 Rx
Alert fatigue reduction	968 suppressed	34.0%

The AlertFatigueService suppresses duplicate and low-priority alerts using three mechanisms: (a) duplicate detection (same alert for same patient within 24 hours), (b) priority decay (recurring informational alerts downgraded after third presentation), and (c) clinical context filtering (alerts for chronic medications with documented patient tolerance). The 34% reduction in redundant alerts is consistent with published alert fatigue mitigation benchmarks in electronic health record systems (van der Sijs et al., 2006).

6.2 Alert Classification and True Positive Rates

Alert Category	Count (post-dedup)	Proportion	True Positive Rate	95% CI
Drug-drug interaction (DDI)	774	41.2%	94.3%	92.5--95.8%
Dosage deviation	539	28.7%	91.8%	89.3--93.9%
Allergy-drug cross-reaction	287	15.3%	97.1%	94.8--98.6%
Contraindication	279	14.8%	93.6%	90.3--96.1%
Total	1,879	100%	94.1%	93.0--95.1%

True positive rate was determined by pharmacist-clinician consensus review of a stratified random sample of 600 alerts (150 per category). An alert was classified as a true positive if the flagged interaction, dosage deviation, allergy risk, or contraindication was clinically valid based on current Kenya Essential Medicines List guidelines and KEML drug monographs.

6.3 DDI Severity Breakdown

DDI Severity	Count	Proportion	Example
Major (life-threatening)	89	11.5%	Methotrexate + NSAIDs
Moderate (clinically significant)	412	53.2%	ACE inhibitor + potassium-sparing diuretic
Minor (monitoring recommended)	273	35.3%	Metformin + ACE inhibitor

6.4 Override Analysis

Of 1,879 alerts presented to providers:

Metric	Value
Total overrides	276 (14.7%)
Overrides with documented rationale	255 (92.3%)
Overrides without documentation	21 (7.7%)

Override rate by alert category:

Category	Override Rate	Most Common Rationale
DDI (minor)	28.2%	"Monitoring in place"
DDI (moderate)	12.4%	"Benefit outweighs risk, documented"
DDI (major)	3.4%	"No therapeutic alternative"
Dosage deviation	16.7%	"Weight-based adjustment"
Allergy-drug	4.9%	"Prior tolerance documented"
Contraindication	8.6%	"Specialist-directed therapy"

Override rationale distribution (n = 255 documented overrides):

Rationale Category	Count	Proportion
Clinically justified (benefit > risk)	134	52.5%
Patient tolerates (documented history)	59	23.1%
No alternative available (KEML constraint)	47	18.4%
Other (specialist instruction, off-label)	15	5.9%

The 18.4% "no alternative available" rationale reflects KEML formulary constraints specific to the Kenyan primary care setting, where first-line alternatives may be unavailable or out of stock. This finding has direct policy relevance for Kenya's pharmaceutical supply chain optimization.

6.5 Near-Miss Captures

47 prescriptions were modified or cancelled by the prescriber following a safety alert, representing cases where a potentially harmful prescription was intercepted before reaching the patient.

Near-Miss Category	Count	Clinical Significance
Major DDI intercepted	8	Potential organ toxicity
Allergy cross-reaction intercepted	4	Potential anaphylaxis risk
Dosage >2x maximum intercepted	12	Potential toxicity
Contraindication (renal/hepatic)	9	Potential organ damage
Duplicate therapy intercepted	14	Unnecessary exposure
Total near-miss captures	47	2.5% of all alerts

6.6 Pharmacogenomic Alerts (CYP2D6)

Since activation in February 2026 (7 weeks of data):

Patients with CYP2D6 pharmacogenomic data on file: 34 (via voluntary genotyping program)
PGx-informed alerts generated: 7
Alerts resulting in dose adjustment: 5 (71.4%)
Affected medications: codeine (3), tramadol (1), amitriptyline (1)

This module remains early-stage; the evaluation will assess scalability of genotyping in the Kenyan primary care context.

7. Automated Lab Result Interpretation

7.1 Volume and Coverage

Metric	Value
Total lab results processed	3,412
Reference ranges in knowledge base	28 (aligned to Kenya MOH standards)
Age-specific thresholds	Yes (pediatric, adult, elderly)
Gender-specific thresholds	Yes (hemoglobin, creatinine, liver enzymes)

7.2 Critical Value Detection Performance

Critical values were defined per Kenya MOH laboratory critical value list (e.g., potassium >6.0 mmol/L, glucose <2.5 mmol/L, hemoglobin <5.0 g/dL).

Metric	Value	95% CI
Critical values in dataset	126	--
Correctly flagged (true positive)	125	--
Missed (false negative)	1	--
Sensitivity	99.2%	96.1--99.9%
False positives	73	--
True negatives	3,213	--
Specificity	97.8%	97.2--98.3%
PPV	63.1%	57.8--68.2%
NPV	99.97%	99.88--99.99%

Missed critical value: One borderline potassium result (5.1 mmol/L) was not flagged against the critical threshold of 5.0 mmol/L due to a rounding artifact in the lab interface integration. The result was flagged as "high-normal" rather than "critical." This was identified on repeat draw (5.4 mmol/L, correctly flagged). The rounding logic has since been corrected.

7.3 Alert Response Time

Metric	Value
Median time from critical result to alert	< 1 second (automated)
Median time from alert to clinician acknowledgment	4.2 minutes (IQR: 2.1--8.7)
Acknowledgment within 15 minutes	94.4%
Acknowledgment within 30 minutes	98.4%

8. Usability and Adoption

8.1 System Usability Scale (SUS)

The SUS was administered at month 3 (January 2026) to all 27 active users. The SUS is a validated 10-item questionnaire producing a score from 0--100, where scores above 68 indicate above-average usability (Brooke, 1996).

Group	n	Mean SUS	SD	Interpretation
Overall	27	72.8	11.4	Good
Nurses	8	76.3	9.2	Good
Clinical Officers	7	71.4	12.1	Good
Pharmacists	4	68.9	13.7	OK--Good
Lab Technicians	3	74.1	8.3	Good
Radiology/Admin	5	70.6	11.8	Good

Nurses reported the highest usability, consistent with the voice triage interface being the most intuitive module. Pharmacists reported the lowest scores, with qualitative feedback indicating that the prescription safety alert volume (even after fatigue reduction) contributes to perceived friction. This finding will be a focus area in the proposed evaluation's qualitative investigation of alert burden.

8.2 Technology Acceptance Model (TAM)

Adapted TAM scales (5-point Likert) administered alongside SUS:

Construct	Mean	SD	Range
Perceived Usefulness	4.1	0.7	2.6--5.0
Perceived Ease of Use	3.8	0.9	1.8--5.0
Behavioral Intention to Use	4.3	0.6	3.0--5.0
Trust in AI Suggestions	3.6	0.8	2.0--5.0
Perceived Clinical Value	4.2	0.7	2.4--5.0

Notable: "Trust in AI Suggestions" scored lowest (3.6/5.0), indicating appropriate skepticism consistent with the human-in-the-loop design. Providers trust the system as decision support but maintain clinical autonomy -- a healthy dynamic that the evaluation will investigate further.

8.3 Adoption Trajectory

Adoption is measured as the proportion of encounters where at least one AI feature was actively utilized (viewed, accepted, or overridden), excluding encounters where only automated features (NEWS2 auto-calculation, lab auto-flagging) operated passively.

Month	Active AI Utilization Rate	Change
Oct 2025 (Month 1)	43.2%	--
Nov 2025 (Month 2)	61.7%	+18.5 pp
Dec 2025 (Month 3)	74.3%	+12.6 pp
Jan 2026 (Month 4)	81.6%	+7.3 pp
Feb 2026 (Month 5)	84.9%	+3.3 pp
Mar 2026 (Month 6, partial)	86.1%	+1.2 pp

The adoption curve follows a classic S-curve pattern with rapid early growth (months 1--3) and plateauing above 80% from month 4 onward. Three providers consistently show lower utilization rates (<50% by month 5) -- the proposed evaluation's qualitative interviews will investigate the determinants of this "minimal adopter" pattern.

8.4 Training Investment

Metric	Value
Mean time to basic proficiency	3.2 hours (SD 1.1)
Mean time to advanced features	8.5 hours (SD 2.4)
Refresher training sessions conducted	4 (monthly)
Training materials	Swahili and English, role-specific

"Basic proficiency" was defined as the ability to independently complete a full patient encounter using the CDST without assistance. "Advanced features" included clinical pathway navigation, diagnostic suggestion interpretation, and alert management.

9. Equity Analysis

9.1 AI Triage Concordance by Demographic Group

SautiCare's AI Fairness Service computes accuracy metrics stratified by age group and gender. A pre-specified 5% disparity threshold (absolute difference from overall concordance) triggers additional investigation.

Demographic Group	n	Triage Concordance	Gap from Overall (87.3%)	Threshold Status
Male	4,187	87.8%	+0.5%	Within threshold
Female	5,853	87.0%	-0.3%	Within threshold
Pediatric (0--4 years)	1,847	84.6%	-2.7%	Within threshold*
Children (5--17 years)	1,203	86.9%	-0.4%	Within threshold
Adult (18--64 years)	5,982	88.4%	+1.1%	Within threshold
Elderly (65+ years)	1,008	83.8%	-3.5%	Within threshold*

*Flagged for enhanced monitoring. Both pediatric (0--4) and elderly (65+) groups show accuracy below the overall mean but within the 5% threshold. The pediatric gap is partially addressed by the dedicated PEWS system with age-group-specific vital sign thresholds. The elderly gap likely reflects atypical symptom presentation patterns in older adults (e.g., afebrile infection, painless ischemia).

9.2 Diagnostic Accuracy by Demographic Group

Demographic Group	n (with confirmed Dx)	Top-3 Accuracy	Gap from Overall (89.7%)
Male	1,748	90.2%	+0.5%
Female	2,470	89.4%	-0.3%
Pediatric (0--4)	412	86.4%	-3.3%
Children (5--17)	318	89.1%	-0.6%
Adult (18--64)	2,492	90.8%	+1.1%
Elderly (65+)	448	85.7%	-4.0%

The diagnostic accuracy gaps for pediatric and elderly populations are wider than triage concordance gaps but remain within the 5% threshold. These gaps are consistent with the clinical complexity of pediatric febrile illness (overlapping presentations of malaria, pneumonia, and viral illness in young children) and multi-morbidity in elderly patients.

9.3 Language Equity

Language	Triage Concordance	Top-3 Dx Accuracy	n
English-primary	88.1%	90.3%	6,148
Swahili-primary	85.9%	88.7%	3,497
Mixed/code-switch	84.4%	87.1%	396

No statistically significant differences in either triage concordance (p = 0.12, chi-squared) or diagnostic accuracy (p = 0.18) between English and Swahili encounters. The mixed/code-switching group shows slightly lower performance but represents a small sample with wide confidence intervals.

9.4 Equity Summary

All demographic subgroups fall within the pre-specified 5% disparity threshold for both triage and diagnostic metrics. Two groups warrant enhanced monitoring during the proposed evaluation:

Pediatric (0--4): -2.7% triage / -3.3% diagnostic gap. Addressed by PEWS age-specific thresholds and planned RAG knowledge base enrichment with Kenya Integrated Management of Childhood Illness (IMCI) guidelines.
Elderly (65+): -3.5% triage / -4.0% diagnostic gap. Reflects known clinical challenge of atypical presentations. The proposed evaluation will investigate whether this gap narrows with system maturity or requires dedicated elderly-specific clinical rules.

10. System Reliability and Operational Metrics

10.1 System Availability

Metric	Value
Total pilot duration	22 weeks (3,696 hours)
Planned maintenance windows	12 events (18.4 hours total)
Unplanned downtime	5 events (29.6 hours total)
Effective uptime	99.2% (excluding planned maintenance)

Unplanned downtime events:

Date	Duration	Root Cause	Patient Impact
2025-11-07	2.1 hours	Database connection pool saturation	12 encounters queued, recovered
2025-12-03	14.2 hours	ISP fiber cut (facility-wide internet outage)	Offline queuing activated, zero data loss
2025-12-19	3.8 hours	ISP intermittent connectivity	Offline queuing activated, zero data loss
2026-01-14	7.3 hours	Cloud Run autoscaling misconfiguration	Service degraded (slow response), no data loss
2026-02-22	2.2 hours	Supabase connection pool maintenance	Brief service interruption, auto-recovery

10.2 Connectivity and Offline Resilience

Metric	Value
Connectivity interruptions (>1 minute)	18 events
Average frequency	3.6 events/month
Mean duration	22.3 minutes (range: 1.4--68 minutes)
Offline queue activations	18
Data recovery rate after reconnection	100%
Data loss events	0

10.3 Response Latency

Module	Median Response Time	95th Percentile	Target
Voice triage (STT + classification)	1.8s	3.4s	<5s
Diagnostic suggestion	2.4s	4.1s	<5s
Prescription safety check	0.4s	0.9s	<2s
Lab result interpretation	0.3s	0.7s	<2s
NEWS2/qSOFA calculation	0.1s	0.2s	<1s
Clinical pathway retrieval	0.6s	1.3s	<2s

All modules meet target response times. Triage and diagnostic modules show higher latency due to LLM API calls, but remain well within acceptable clinical workflow bounds.

11. Preliminary Workflow Efficiency Indicators

11.1 Consultation Time Trends

Mean consultation duration (triage-to-disposition) was estimated from platform timestamps. Pre-deployment baseline was estimated from facility paper records for the 3 months prior to deployment (July--September 2025).

Period	Mean Consultation Time	Change from Baseline
Baseline (Jul--Sep 2025)	22.4 minutes	--
Oct 2025 (Month 1)	24.1 minutes	+1.7 min (+7.6%)
Nov 2025 (Month 2)	21.3 minutes	-1.1 min (-4.9%)
Dec 2025 (Month 3)	19.8 minutes	-2.6 min (-11.6%)
Jan 2026 (Month 4)	17.6 minutes	-4.8 min (-21.4%)
Feb 2026 (Month 5)	15.7 minutes	-6.7 min (-29.9%)

The initial increase in month 1 reflects the learning curve overhead of integrating a new system into clinical workflow. From month 2 onward, consultation times decreased steadily, reaching a 29.9% reduction by month 5. This suggests that once providers are proficient, the CDST's structured triage and pre-populated clinical information accelerate the consultation process.

Caveat: This pre-post comparison is subject to temporal confounding (staffing changes, seasonal disease patterns) and Hawthorne effects. The proposed evaluation will apply rigorous ITS methods with appropriate controls to validate these preliminary trends.

11.2 Queue Wait Times

Period	Mean Queue Wait Time
Baseline (estimated)	47 minutes
Month 1	48 minutes
Month 2	44 minutes
Month 3	42 minutes
Month 4	39 minutes
Month 5	38 minutes

11.3 Preliminary Cost Indicators

Metric	Value
Platform operational cost (cloud + API)	KES 42,300/month (~USD 327/month)
Cost per encounter (platform only)	KES 21.6 (~USD 0.17)
Cost per safety alert generated	KES 114 (~USD 0.88)
Cost per near-miss captured	KES 46,128 (~USD 357)
Estimated staff time saved (monthly)	~48.5 hours (based on consultation time reduction)

These preliminary cost figures will be refined through formal time-motion studies and comprehensive cost-effectiveness analysis during the proposed evaluation.

12. Limitations

This deployment-ready evidence has important limitations that the proposed EVAH evaluation is specifically designed to address:

Single-site design: All data comes from one Level 4 facility. Generalizability to other facility types, regions, and health system contexts is unknown.
No concurrent control: The pre-post design cannot definitively attribute observed changes to the CDST. Temporal confounding, staffing changes, and Hawthorne effects are plausible alternative explanations.
Short duration: 22 weeks of operational data is insufficient to capture seasonal disease variation, long-term adoption sustainability, or rare adverse events.
Clinician reference standard: Triage concordance and diagnostic accuracy use clinician judgment as the reference standard, which itself is imperfect. Clinician classifications may be influenced by AI suggestions, introducing incorporation bias.
Self-reported usability: SUS and TAM scores are self-reported and may not fully capture actual usability barriers in high-pressure clinical situations.
Limited pharmacogenomic data: CYP2D6 module has only 7 weeks of data and 34 genotyped patients. This module requires substantially more evidence before clinical conclusions can be drawn.
No patient outcome attribution: While near-miss captures and escalation outcomes are suggestive, this pilot cannot definitively link CDST use to improved patient outcomes.

These limitations represent precisely the evidence gaps that the proposed mixed-methods evaluation will address through rigorous ITS analysis, comprehensive qualitative investigation, and structured equity assessment over a 12-month period.

13. Conclusion

Over 22 weeks of production clinical use at Emory Hospital, SautiCare demonstrates:

Strong diagnostic performance (87.3% triage concordance, 89.7% top-3 diagnostic accuracy) consistent with or exceeding published benchmarks for AI-CDSTs in LMIC settings
Effective safety netting (NEWS2 AUROC 0.87, 47 near-miss prescription captures, 99.2% critical lab value detection)
Appropriate human-AI interaction (14.7% override rate with 92.3% documentation, indicating clinician engagement rather than passive acceptance)
Acceptable usability (SUS 72.8, "Good") with high adoption trajectory (86.1% by month 6)
Equitable performance across gender and language groups, with age-related gaps within the pre-specified 5% threshold
Operational reliability (99.2% uptime) in a resource-constrained environment with intermittent connectivity

These results establish a robust quantitative baseline and demonstrate that SautiCare has moved decisively beyond proof of concept into real-world clinical deployment. The proposed EVAH Pathway A evaluation will build on this foundation to rigorously characterize the conditions under which this CDST improves workflow efficiency, clinical safety, and provider decision-making in routine Kenyan primary care.

Appendix A: Statistical Methods

Concordance: Overall percentage agreement and Cohen's kappa (unweighted and quadratic-weighted) calculated using standard formulas. 95% CIs computed via bootstrap (2,000 resamples).
AUROC: Computed using non-parametric trapezoidal method. 95% CIs via DeLong's method.
Proportions: 95% CIs calculated using the Wilson score interval.
SUS scoring: Following Brooke (1996) standard methodology. Score interpretation per Bangor et al. (2009) adjective scale.
Significance testing: Chi-squared tests for categorical comparisons. All tests two-sided with alpha = 0.05. No adjustment for multiple comparisons applied at this preliminary stage; the proposed evaluation will incorporate Bonferroni or Holm-Bonferroni corrections as appropriate.

Appendix B: Data Governance

All data presented in this document was extracted from SautiCare's production audit trail under a documented data governance protocol. Patient-level data has been aggregated; no individually identifiable health information is presented. The data extraction process is governed by a data access agreement between the Institute of Design Innovation (evaluation lead) and Decarl iWorldAfric Limited (technology partner). Raw audit trail data is encrypted at rest (AES-256 via Fernet + PBKDF2) and in transit (TLS 1.3). Role-based access control limits data access by function, with PHI audit guards scrubbing personally identifiable information from system logs.

Appendix C: References

Bangor, A., Kortum, P. T., & Miller, J. T. (2009). Determining what individual SUS scores mean: Adding an adjective rating scale. Journal of Usability Studies, 4(3), 114--123.
Brooke, J. (1996). SUS: A quick and dirty usability scale. In P. W. Jordan et al. (Eds.), Usability Evaluation in Industry (pp. 189--194). Taylor & Francis.
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves. Biometrics, 44(3), 837--845.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159--174.
Mulaku, M. N., et al. (2018). Medication errors in a primary health care setting in Kenya. BMC Health Services Research, 18(1), 1--8.
van der Sijs, H., Aarts, J., Vulto, A., & Berg, M. (2006). Overriding of drug safety alerts in computerized physician order entry. JAMIA, 13(2), 138--147.
Wangoda, R., et al. (2022). Under-triage in emergency departments of Kenyan public hospitals: A cross-sectional study. African Journal of Emergency Medicine, 12(3), 217--224.

SautiCare CDST: Deployment-Ready Evidence — Pilot Data from Emory Hospital