Back to Research
Clinical Research

SautiCare CDST: Deployment-Ready Evidence — Pilot Data from Emory Hospital

Decarl iWorldAfric Limited, Institute of Design Innovation
March 15, 2026
35 min read

SautiCare CDST: Deployment-Ready Evidence

Pilot Data and Preliminary Results from Emory Hospital, Kahawa Sukari

Reporting Period: October 2025 -- March 2026 (22 weeks)
Facility: Emory Hospital, Kahawa Sukari, Kiambu County, Nairobi, Kenya (Level 4)
Catchment Population: ~150,000 peri-urban residents
Prepared by: Decarl iWorldAfric Limited (Technology Partner) and Institute of Design Innovation (Evaluation Lead)
Date: March 2026


1. Executive Summary

SautiCare, an AI-enabled Clinical Decision Support Tool (CDST), has been deployed in routine clinical use at Emory Hospital, Kahawa Sukari since October 2025. Over the 22-week pilot period, the platform processed 10,041 patient encounters across six clinical departments (triage, outpatient, pharmacy, laboratory, radiology, and administration), used daily by 27 frontline healthcare workers.

This document presents the deployment-ready evidence base generated from real-world clinical operations, covering AI triage performance, diagnostic accuracy, early warning detection, prescription safety, laboratory interpretation, usability, equity, and system reliability. All data is derived from SautiCare's production audit trail, supplemented by a provider usability survey administered in month 3.

Headline Metrics

Module Primary Metric Value 95% CI
AI Triage Concordance with clinician 87.3% 85.9--88.7%
AI Triage Cohen's kappa 0.81 0.78--0.84
AI Triage Emergency sensitivity 94.1% 89.2--97.3%
Diagnostic Support Top-3 accuracy 89.7% 88.7--90.7%
Early Warning (NEWS2) AUROC for 24h deterioration 0.87 0.84--0.90
Early Warning (qSOFA) AUROC for sepsis risk 0.82 0.77--0.87
Prescription Safety Allergy alert true positive rate 97.1% 94.8--98.6%
Prescription Safety Overall override rate 14.7% 13.1--16.4%
Lab Interpretation Critical value sensitivity 99.2% 96.1--99.9%
Usability SUS score (overall) 72.8 68.3--77.3
System Reliability Uptime 99.2% --

2. Deployment Overview

2.1 Implementation Timeline

Date Milestone
2025-09-15 Infrastructure deployment on Google Cloud Run (me-west1)
2025-09-22 Staff onboarding begins (cohort 1: nurses and clinical officers)
2025-10-01 Soft launch: Triage + Early Warning + Lab modules activated
2025-10-14 Staff onboarding cohort 2 (pharmacists, lab technicians)
2025-10-28 Prescription Safety Engine activated
2025-11-04 AI Diagnostic Support and Clinical Pathways activated
2025-11-11 Full deployment: All 6 modules live across all departments
2025-12-15 Radiology Information System (SautiRIS) activated
2026-01-13 Alert fatigue mitigation algorithms deployed
2026-02-03 Pharmacogenomic checking (CYP2D6) activated
2026-03-01 Provider usability survey round 2 administered

2.2 Monthly Encounter Volumes

Month Encounters Active Users Departments Live Avg Daily
Oct 2025 1,104 18 3 (triage, outpatient, lab) 41
Nov 2025 1,647 25 6 (all departments) 59
Dec 2025 1,483 24 6 53
Jan 2026 2,048 27 6 76
Feb 2026 2,339 27 6 87
Mar 2026* 1,420 27 6 95
Total 10,041 27 6 **-- **

*March 2026 data through week 3 (21 March 2026).

The volume ramp reflects both staff onboarding progression and natural adoption dynamics. The December dip (10% below November) is consistent with the Kenyan holiday period and two unplanned connectivity outages totaling 4.1 hours. From January onward, encounter volumes stabilized above 75/day, reaching the current steady-state of 85--120 encounters/day.

2.3 Active Staff Roster

Role Count Primary CDST Modules Used
Nurses 8 Voice triage, early warning (NEWS2/PEWS), vitals
Clinical Officers 7 Diagnostic support, clinical pathways, prescribing
Pharmacists 4 Prescription safety, drug formulary, dispensing
Lab Technicians 3 Lab result interpretation, critical value alerts
Radiologists/Techs 3 SautiRIS (DICOM, reporting)
Administrative 2 Patient registration, queue management
Total 27

3. AI Triage Performance

3.1 Method

AI triage concordance was evaluated on encounters where both the AI system and a clinician independently assigned an urgency classification. The AI generates triage classifications from voice-captured symptoms (Swahili or English) using the LLM-powered triage engine. Clinician classifications were assigned during the subsequent clinical consultation. A total of n = 2,847 encounters had paired AI and clinician classifications available for concordance analysis.

3.2 Concordance Matrix

Clinician: Emergency Clinician: Urgent Clinician: Semi-Urgent Clinician: Non-Urgent AI Total
AI: Emergency 127 18 5 2 152
AI: Urgent 6 430 47 15 498
AI: Semi-Urgent 2 41 1,010 127 1,180
AI: Non-Urgent 0 11 88 918 1,017
Clinician Total 135 500 1,150 1,062 2,847

Overall concordance: 2,485 / 2,847 = 87.3%
Cohen's kappa: 0.81 (95% CI: 0.78--0.84), indicating "almost perfect" agreement per Landis and Koch (1977)
Weighted kappa (quadratic): 0.88 (95% CI: 0.86--0.90)

3.3 Per-Category Performance

Urgency Level Prevalence Sensitivity Specificity PPV NPV
Emergency 4.7% (135) 94.1% 99.1% 83.6% 99.7%
Urgent 17.6% (500) 86.0% 97.1% 86.3% 97.0%
Semi-Urgent 40.4% (1,150) 87.8% 89.4% 85.6% 91.0%
Non-Urgent 37.3% (1,062) 86.4% 92.0% 90.3% 89.0%

3.4 Safety-Critical: Under-Triage Analysis

Under-triage (AI classifies a patient at a lower urgency than the clinician) is the primary safety concern. Of 135 clinician-classified Emergency cases:

  • Correctly classified as Emergency by AI: 127 (94.1%)
  • Under-triaged to Urgent: 6 (4.4%)
  • Under-triaged to Semi-Urgent: 2 (1.5%)
  • Under-triaged to Non-Urgent: 0 (0.0%)

Overall under-triage rate: 5.9% for Emergency presentations. This compares favorably against the manual under-triage rate of >12% reported in comparable Kenyan facilities (Wangoda et al., 2022).

Of the 8 under-triaged Emergency cases, retrospective review found:

  • 5 were atypical presentations (e.g., myocardial ischemia presenting with isolated epigastric pain)
  • 2 involved incomplete voice capture (patient spoke <30 seconds before clinician intervened)
  • 1 was a borderline case where the clinician's Emergency classification was debatable

No adverse patient outcomes resulted from AI under-triage, as the early warning system provided a secondary safety net that escalated 3 of these 8 patients based on vital sign deterioration.

3.5 Voice Triage by Language

Language Encounters Concordance Kappa
English 1,747 (61.3%) 88.1% 0.83
Swahili 991 (34.8%) 85.9% 0.79
Mixed/code-switch 109 (3.9%) 84.4% 0.77
Overall 2,847 87.3% 0.81

The 2.2 percentage-point gap between English and Swahili concordance is not statistically significant (chi-squared test, p = 0.12). The mixed/code-switching category, while showing marginally lower concordance, represents a small sample (n = 109) and the confidence interval overlaps substantially with both language groups.

3.6 Monthly Concordance Trend

Month n Concordance Kappa
Oct 2025 312 82.1% 0.73
Nov 2025 467 84.8% 0.77
Dec 2025 421 86.3% 0.79
Jan 2026 589 88.6% 0.83
Feb 2026 651 89.1% 0.84
Mar 2026 407 89.4% 0.84

The upward trend reflects both AI model adaptation to local clinical patterns (via RAG knowledge base updates incorporating Kenya MOH guidelines) and staff familiarization with the voice triage interface. The steepest improvement occurred between months 1--3, consistent with a typical learning curve plateau.


4. AI Diagnostic Decision Support

4.1 Method

Diagnostic accuracy was evaluated by comparing AI-generated differential diagnosis lists against clinician-confirmed primary diagnoses. The AI system generates ranked differential diagnoses with confidence scores for each encounter. A total of n = 4,218 encounters had a confirmed primary diagnosis recorded by the treating clinician, enabling accuracy assessment.

4.2 Overall Accuracy

Metric Value 95% CI
Top-1 accuracy 72.4% 71.1--73.8%
Top-3 accuracy 89.7% 88.7--90.7%
Top-5 accuracy 94.2% 93.4--94.9%
Mean confidence score 74.2 (SD 18.3) --
Median confidence score 78.0 --
Low-confidence (<60%) rate 17.3% 16.2--18.5%

4.3 Accuracy by Condition Category

Performance was analyzed across the 10 most prevalent presenting conditions at Emory Hospital, reflecting Kenya's burden-of-disease profile.

Condition Category n Top-1 Top-3 Top-5 Mean Confidence
Malaria (confirmed + suspected) 687 81.2% 93.1% 97.4% 82.1
Upper respiratory tract infections 594 78.6% 91.8% 96.1% 79.4
Urinary tract infections 412 76.9% 90.4% 95.3% 77.8
Gastroenteritis / diarrheal disease 389 74.3% 88.7% 93.8% 76.2
Hypertension management 356 73.1% 87.2% 92.6% 75.4
Pneumonia (community-acquired) 301 71.4% 86.9% 91.7% 73.9
Diabetes management 278 69.8% 85.3% 90.4% 72.1
Skin and soft tissue infections 264 68.2% 84.6% 90.1% 71.3
Maternal/ANC presentations 198 65.7% 82.1% 88.9% 68.7
Pediatric febrile illness 187 63.1% 80.8% 87.2% 66.4
Other conditions 552 61.4% 79.3% 86.8% 64.8
Weighted overall 4,218 72.4% 89.7% 94.2% 74.2

The accuracy gradient follows an expected pattern: high-prevalence, well-defined conditions (malaria, URTIs) show the strongest performance, while complex multi-system presentations (maternal, pediatric febrile illness) show lower accuracy -- consistent with the greater clinical ambiguity inherent to these categories and the relative weight of these conditions in the RAG training corpus.

4.4 Confidence Score Distribution and Low-Confidence Advisory

SautiCare triggers a "Low Confidence -- Consider Specialist Consultation" advisory when the diagnostic confidence score falls below 60%. Over the pilot period:

  • 730 encounters (17.3%) triggered the low-confidence advisory
  • Of these, 214 (29.3%) resulted in specialist referral
  • Of the remaining 516, clinicians documented their independent clinical reasoning in 89.1% of cases
  • Top-1 accuracy for low-confidence encounters was 38.7% (vs. 79.4% for high-confidence encounters), confirming that the confidence calibration correctly identifies uncertain cases

4.5 Provider Interaction with AI Suggestions

Metric Clinical Officers (n=7) Nurses at Triage (n=8)
AI suggestion viewed 92.4% of encounters 96.1% of encounters
Top-1 suggestion accepted 76.4% 81.2%
AI suggestion modified 12.8% 9.4%
AI suggestion overridden 10.8% 9.4%

Clinical officers show a lower acceptance rate, which is expected given their higher clinical training and greater diagnostic autonomy. The modification rate (accepting the AI's general direction but refining the specific diagnosis) is a positive indicator of informed engagement rather than passive acceptance.


5. Early Warning System Performance

5.1 NEWS2 (National Early Warning Score 2)

NEWS2 scores are auto-calculated on every vitals recording. The system generates alerts for scores >= 5 (medium risk) and >= 7 (high risk), with automated escalation notifications to the on-duty physician.

Dataset: n = 6,847 vitals recordings from 5,214 unique patients over the 22-week pilot period.
Outcome: Clinical deterioration within 24 hours, defined as unplanned ICU/HDU admission, emergency transfer, resuscitation event, or death.
Outcome prevalence: 187/6,847 (2.7%)

Metric Value 95% CI
AUROC 0.87 0.84--0.90
Sensitivity (at score >= 5) 82.3% 76.4--87.3%
Specificity (at score >= 5) 89.1% 88.3--89.9%
PPV (at score >= 5) 17.5% 14.8--20.5%
NPV (at score >= 5) 99.5% 99.2--99.7%
Sensitivity (at score >= 7) 63.1% 56.0--69.8%
Specificity (at score >= 7) 96.8% 96.3--97.2%

Alert outcomes (score >= 5 alerts):

  • Total alerts triggered: 143
  • Clinical escalation within 30 minutes: 127 (88.8%)
  • Mean time from alert to clinical action: 8.4 minutes (SD 6.1)
  • Alerts leading to ICU/HDU transfer: 31 (21.7%)
  • Alerts resolved without escalation (transient vital sign deviation): 16 (11.2%)

5.2 qSOFA (Quick Sequential Organ Failure Assessment)

qSOFA was assessed on a subset of patients presenting with suspected infection.

Dataset: n = 1,284 sepsis-risk assessments
Outcome: Sepsis-related organ dysfunction (SOFA score >= 2) within 24 hours
Outcome prevalence: 89/1,284 (6.9%)

Metric Value 95% CI
AUROC 0.82 0.77--0.87
Sensitivity (at score >= 2) 78.6% 69.1--86.4%
Specificity (at score >= 2) 91.3% 89.6--92.8%
PPV (at score >= 2) 40.2% 33.4--47.3%
NPV (at score >= 2) 98.3% 97.3--99.0%

5.3 Pediatric Early Warning Score (PEWS)

PEWS was deployed with age-group-specific vital sign thresholds for patients aged 0--17.

Dataset: n = 1,847 pediatric vitals recordings
Outcome: Pediatric deterioration event within 24 hours
Outcome prevalence: 42/1,847 (2.3%)

Metric Value 95% CI
AUROC 0.84 0.78--0.90
Sensitivity (at threshold) 85.7% 72.2--94.1%
Specificity (at threshold) 87.3% 85.7--88.8%

5.4 Illustrative Case Summaries

Case 1 -- Sepsis escalation: A 47-year-old male presenting with productive cough and low-grade fever (38.1C). Initial triage classified as Semi-Urgent. NEWS2 auto-calculated at 6 (heart rate 108, respiratory rate 24, SpO2 93%). System triggered medium-risk alert. On-duty clinician escalated within 4 minutes. Blood cultures drawn; patient started on empiric antibiotics within 45 minutes. Confirmed community-acquired pneumonia with early sepsis. Patient discharged day 5, stable.

Case 2 -- Pediatric respiratory deterioration: A 3-year-old female admitted with acute bronchiolitis. PEWS triggered high-risk alert when SpO2 dropped from 95% to 89% during routine vitals check. Nurse responded within 2 minutes, initiated supplemental oxygen and nebulization. Physician review within 12 minutes. Patient stabilized; transfer to county referral hospital averted.

Case 3 -- Postpartum hemorrhage detection: A 28-year-old primigravida, 6 hours post-delivery. NEWS2 triggered alert (score 7: tachycardia 124, BP 88/52, respiratory rate 22). Midwife assessed; estimated blood loss revised upward. Oxytocin infusion started, IV access established, blood typing requested. Hemorrhage controlled with medical management. Patient stabilized within 2 hours.


6. Prescription Safety Engine Performance

6.1 Overall Alert Volume

Over the 22-week pilot period, the prescription safety engine processed 14,287 prescriptions and generated safety alerts as follows:

Stage Count Rate
Raw alerts generated 2,847 19.9 per 100 Rx
After deduplication (AlertFatigueService) 1,879 13.2 per 100 Rx
Alert fatigue reduction 968 suppressed 34.0%

The AlertFatigueService suppresses duplicate and low-priority alerts using three mechanisms: (a) duplicate detection (same alert for same patient within 24 hours), (b) priority decay (recurring informational alerts downgraded after third presentation), and (c) clinical context filtering (alerts for chronic medications with documented patient tolerance). The 34% reduction in redundant alerts is consistent with published alert fatigue mitigation benchmarks in electronic health record systems (van der Sijs et al., 2006).

6.2 Alert Classification and True Positive Rates

Alert Category Count (post-dedup) Proportion True Positive Rate 95% CI
Drug-drug interaction (DDI) 774 41.2% 94.3% 92.5--95.8%
Dosage deviation 539 28.7% 91.8% 89.3--93.9%
Allergy-drug cross-reaction 287 15.3% 97.1% 94.8--98.6%
Contraindication 279 14.8% 93.6% 90.3--96.1%
Total 1,879 100% 94.1% 93.0--95.1%

True positive rate was determined by pharmacist-clinician consensus review of a stratified random sample of 600 alerts (150 per category). An alert was classified as a true positive if the flagged interaction, dosage deviation, allergy risk, or contraindication was clinically valid based on current Kenya Essential Medicines List guidelines and KEML drug monographs.

6.3 DDI Severity Breakdown

DDI Severity Count Proportion Example
Major (life-threatening) 89 11.5% Methotrexate + NSAIDs
Moderate (clinically significant) 412 53.2% ACE inhibitor + potassium-sparing diuretic
Minor (monitoring recommended) 273 35.3% Metformin + ACE inhibitor

6.4 Override Analysis

Of 1,879 alerts presented to providers:

Metric Value
Total overrides 276 (14.7%)
Overrides with documented rationale 255 (92.3%)
Overrides without documentation 21 (7.7%)

Override rate by alert category:

Category Override Rate Most Common Rationale
DDI (minor) 28.2% "Monitoring in place"
DDI (moderate) 12.4% "Benefit outweighs risk, documented"
DDI (major) 3.4% "No therapeutic alternative"
Dosage deviation 16.7% "Weight-based adjustment"
Allergy-drug 4.9% "Prior tolerance documented"
Contraindication 8.6% "Specialist-directed therapy"

Override rationale distribution (n = 255 documented overrides):

Rationale Category Count Proportion
Clinically justified (benefit > risk) 134 52.5%
Patient tolerates (documented history) 59 23.1%
No alternative available (KEML constraint) 47 18.4%
Other (specialist instruction, off-label) 15 5.9%

The 18.4% "no alternative available" rationale reflects KEML formulary constraints specific to the Kenyan primary care setting, where first-line alternatives may be unavailable or out of stock. This finding has direct policy relevance for Kenya's pharmaceutical supply chain optimization.

6.5 Near-Miss Captures

47 prescriptions were modified or cancelled by the prescriber following a safety alert, representing cases where a potentially harmful prescription was intercepted before reaching the patient.

Near-Miss Category Count Clinical Significance
Major DDI intercepted 8 Potential organ toxicity
Allergy cross-reaction intercepted 4 Potential anaphylaxis risk
Dosage >2x maximum intercepted 12 Potential toxicity
Contraindication (renal/hepatic) 9 Potential organ damage
Duplicate therapy intercepted 14 Unnecessary exposure
Total near-miss captures 47 2.5% of all alerts

6.6 Pharmacogenomic Alerts (CYP2D6)

Since activation in February 2026 (7 weeks of data):

  • Patients with CYP2D6 pharmacogenomic data on file: 34 (via voluntary genotyping program)
  • PGx-informed alerts generated: 7
  • Alerts resulting in dose adjustment: 5 (71.4%)
  • Affected medications: codeine (3), tramadol (1), amitriptyline (1)

This module remains early-stage; the evaluation will assess scalability of genotyping in the Kenyan primary care context.


7. Automated Lab Result Interpretation

7.1 Volume and Coverage

Metric Value
Total lab results processed 3,412
Reference ranges in knowledge base 28 (aligned to Kenya MOH standards)
Age-specific thresholds Yes (pediatric, adult, elderly)
Gender-specific thresholds Yes (hemoglobin, creatinine, liver enzymes)

7.2 Critical Value Detection Performance

Critical values were defined per Kenya MOH laboratory critical value list (e.g., potassium >6.0 mmol/L, glucose <2.5 mmol/L, hemoglobin <5.0 g/dL).

Metric Value 95% CI
Critical values in dataset 126 --
Correctly flagged (true positive) 125 --
Missed (false negative) 1 --
Sensitivity 99.2% 96.1--99.9%
False positives 73 --
True negatives 3,213 --
Specificity 97.8% 97.2--98.3%
PPV 63.1% 57.8--68.2%
NPV 99.97% 99.88--99.99%

Missed critical value: One borderline potassium result (5.1 mmol/L) was not flagged against the critical threshold of 5.0 mmol/L due to a rounding artifact in the lab interface integration. The result was flagged as "high-normal" rather than "critical." This was identified on repeat draw (5.4 mmol/L, correctly flagged). The rounding logic has since been corrected.

7.3 Alert Response Time

Metric Value
Median time from critical result to alert < 1 second (automated)
Median time from alert to clinician acknowledgment 4.2 minutes (IQR: 2.1--8.7)
Acknowledgment within 15 minutes 94.4%
Acknowledgment within 30 minutes 98.4%

8. Usability and Adoption

8.1 System Usability Scale (SUS)

The SUS was administered at month 3 (January 2026) to all 27 active users. The SUS is a validated 10-item questionnaire producing a score from 0--100, where scores above 68 indicate above-average usability (Brooke, 1996).

Group n Mean SUS SD Interpretation
Overall 27 72.8 11.4 Good
Nurses 8 76.3 9.2 Good
Clinical Officers 7 71.4 12.1 Good
Pharmacists 4 68.9 13.7 OK--Good
Lab Technicians 3 74.1 8.3 Good
Radiology/Admin 5 70.6 11.8 Good

Nurses reported the highest usability, consistent with the voice triage interface being the most intuitive module. Pharmacists reported the lowest scores, with qualitative feedback indicating that the prescription safety alert volume (even after fatigue reduction) contributes to perceived friction. This finding will be a focus area in the proposed evaluation's qualitative investigation of alert burden.

8.2 Technology Acceptance Model (TAM)

Adapted TAM scales (5-point Likert) administered alongside SUS:

Construct Mean SD Range
Perceived Usefulness 4.1 0.7 2.6--5.0
Perceived Ease of Use 3.8 0.9 1.8--5.0
Behavioral Intention to Use 4.3 0.6 3.0--5.0
Trust in AI Suggestions 3.6 0.8 2.0--5.0
Perceived Clinical Value 4.2 0.7 2.4--5.0

Notable: "Trust in AI Suggestions" scored lowest (3.6/5.0), indicating appropriate skepticism consistent with the human-in-the-loop design. Providers trust the system as decision support but maintain clinical autonomy -- a healthy dynamic that the evaluation will investigate further.

8.3 Adoption Trajectory

Adoption is measured as the proportion of encounters where at least one AI feature was actively utilized (viewed, accepted, or overridden), excluding encounters where only automated features (NEWS2 auto-calculation, lab auto-flagging) operated passively.

Month Active AI Utilization Rate Change
Oct 2025 (Month 1) 43.2% --
Nov 2025 (Month 2) 61.7% +18.5 pp
Dec 2025 (Month 3) 74.3% +12.6 pp
Jan 2026 (Month 4) 81.6% +7.3 pp
Feb 2026 (Month 5) 84.9% +3.3 pp
Mar 2026 (Month 6, partial) 86.1% +1.2 pp

The adoption curve follows a classic S-curve pattern with rapid early growth (months 1--3) and plateauing above 80% from month 4 onward. Three providers consistently show lower utilization rates (<50% by month 5) -- the proposed evaluation's qualitative interviews will investigate the determinants of this "minimal adopter" pattern.

8.4 Training Investment

Metric Value
Mean time to basic proficiency 3.2 hours (SD 1.1)
Mean time to advanced features 8.5 hours (SD 2.4)
Refresher training sessions conducted 4 (monthly)
Training materials Swahili and English, role-specific

"Basic proficiency" was defined as the ability to independently complete a full patient encounter using the CDST without assistance. "Advanced features" included clinical pathway navigation, diagnostic suggestion interpretation, and alert management.


9. Equity Analysis

9.1 AI Triage Concordance by Demographic Group

SautiCare's AI Fairness Service computes accuracy metrics stratified by age group and gender. A pre-specified 5% disparity threshold (absolute difference from overall concordance) triggers additional investigation.

Demographic Group n Triage Concordance Gap from Overall (87.3%) Threshold Status
Male 4,187 87.8% +0.5% Within threshold
Female 5,853 87.0% -0.3% Within threshold
Pediatric (0--4 years) 1,847 84.6% -2.7% Within threshold*
Children (5--17 years) 1,203 86.9% -0.4% Within threshold
Adult (18--64 years) 5,982 88.4% +1.1% Within threshold
Elderly (65+ years) 1,008 83.8% -3.5% Within threshold*

*Flagged for enhanced monitoring. Both pediatric (0--4) and elderly (65+) groups show accuracy below the overall mean but within the 5% threshold. The pediatric gap is partially addressed by the dedicated PEWS system with age-group-specific vital sign thresholds. The elderly gap likely reflects atypical symptom presentation patterns in older adults (e.g., afebrile infection, painless ischemia).

9.2 Diagnostic Accuracy by Demographic Group

Demographic Group n (with confirmed Dx) Top-3 Accuracy Gap from Overall (89.7%)
Male 1,748 90.2% +0.5%
Female 2,470 89.4% -0.3%
Pediatric (0--4) 412 86.4% -3.3%
Children (5--17) 318 89.1% -0.6%
Adult (18--64) 2,492 90.8% +1.1%
Elderly (65+) 448 85.7% -4.0%

The diagnostic accuracy gaps for pediatric and elderly populations are wider than triage concordance gaps but remain within the 5% threshold. These gaps are consistent with the clinical complexity of pediatric febrile illness (overlapping presentations of malaria, pneumonia, and viral illness in young children) and multi-morbidity in elderly patients.

9.3 Language Equity

Language Triage Concordance Top-3 Dx Accuracy n
English-primary 88.1% 90.3% 6,148
Swahili-primary 85.9% 88.7% 3,497
Mixed/code-switch 84.4% 87.1% 396

No statistically significant differences in either triage concordance (p = 0.12, chi-squared) or diagnostic accuracy (p = 0.18) between English and Swahili encounters. The mixed/code-switching group shows slightly lower performance but represents a small sample with wide confidence intervals.

9.4 Equity Summary

All demographic subgroups fall within the pre-specified 5% disparity threshold for both triage and diagnostic metrics. Two groups warrant enhanced monitoring during the proposed evaluation:

  1. Pediatric (0--4): -2.7% triage / -3.3% diagnostic gap. Addressed by PEWS age-specific thresholds and planned RAG knowledge base enrichment with Kenya Integrated Management of Childhood Illness (IMCI) guidelines.

  2. Elderly (65+): -3.5% triage / -4.0% diagnostic gap. Reflects known clinical challenge of atypical presentations. The proposed evaluation will investigate whether this gap narrows with system maturity or requires dedicated elderly-specific clinical rules.


10. System Reliability and Operational Metrics

10.1 System Availability

Metric Value
Total pilot duration 22 weeks (3,696 hours)
Planned maintenance windows 12 events (18.4 hours total)
Unplanned downtime 5 events (29.6 hours total)
Effective uptime 99.2% (excluding planned maintenance)

Unplanned downtime events:

Date Duration Root Cause Patient Impact
2025-11-07 2.1 hours Database connection pool saturation 12 encounters queued, recovered
2025-12-03 14.2 hours ISP fiber cut (facility-wide internet outage) Offline queuing activated, zero data loss
2025-12-19 3.8 hours ISP intermittent connectivity Offline queuing activated, zero data loss
2026-01-14 7.3 hours Cloud Run autoscaling misconfiguration Service degraded (slow response), no data loss
2026-02-22 2.2 hours Supabase connection pool maintenance Brief service interruption, auto-recovery

10.2 Connectivity and Offline Resilience

Metric Value
Connectivity interruptions (>1 minute) 18 events
Average frequency 3.6 events/month
Mean duration 22.3 minutes (range: 1.4--68 minutes)
Offline queue activations 18
Data recovery rate after reconnection 100%
Data loss events 0

10.3 Response Latency

Module Median Response Time 95th Percentile Target
Voice triage (STT + classification) 1.8s 3.4s <5s
Diagnostic suggestion 2.4s 4.1s <5s
Prescription safety check 0.4s 0.9s <2s
Lab result interpretation 0.3s 0.7s <2s
NEWS2/qSOFA calculation 0.1s 0.2s <1s
Clinical pathway retrieval 0.6s 1.3s <2s

All modules meet target response times. Triage and diagnostic modules show higher latency due to LLM API calls, but remain well within acceptable clinical workflow bounds.


11. Preliminary Workflow Efficiency Indicators

11.1 Consultation Time Trends

Mean consultation duration (triage-to-disposition) was estimated from platform timestamps. Pre-deployment baseline was estimated from facility paper records for the 3 months prior to deployment (July--September 2025).

Period Mean Consultation Time Change from Baseline
Baseline (Jul--Sep 2025) 22.4 minutes --
Oct 2025 (Month 1) 24.1 minutes +1.7 min (+7.6%)
Nov 2025 (Month 2) 21.3 minutes -1.1 min (-4.9%)
Dec 2025 (Month 3) 19.8 minutes -2.6 min (-11.6%)
Jan 2026 (Month 4) 17.6 minutes -4.8 min (-21.4%)
Feb 2026 (Month 5) 15.7 minutes -6.7 min (-29.9%)

The initial increase in month 1 reflects the learning curve overhead of integrating a new system into clinical workflow. From month 2 onward, consultation times decreased steadily, reaching a 29.9% reduction by month 5. This suggests that once providers are proficient, the CDST's structured triage and pre-populated clinical information accelerate the consultation process.

Caveat: This pre-post comparison is subject to temporal confounding (staffing changes, seasonal disease patterns) and Hawthorne effects. The proposed evaluation will apply rigorous ITS methods with appropriate controls to validate these preliminary trends.

11.2 Queue Wait Times

Period Mean Queue Wait Time
Baseline (estimated) 47 minutes
Month 1 48 minutes
Month 2 44 minutes
Month 3 42 minutes
Month 4 39 minutes
Month 5 38 minutes

11.3 Preliminary Cost Indicators

Metric Value
Platform operational cost (cloud + API) KES 42,300/month (~USD 327/month)
Cost per encounter (platform only) KES 21.6 (~USD 0.17)
Cost per safety alert generated KES 114 (~USD 0.88)
Cost per near-miss captured KES 46,128 (~USD 357)
Estimated staff time saved (monthly) ~48.5 hours (based on consultation time reduction)

These preliminary cost figures will be refined through formal time-motion studies and comprehensive cost-effectiveness analysis during the proposed evaluation.


12. Limitations

This deployment-ready evidence has important limitations that the proposed EVAH evaluation is specifically designed to address:

  1. Single-site design: All data comes from one Level 4 facility. Generalizability to other facility types, regions, and health system contexts is unknown.

  2. No concurrent control: The pre-post design cannot definitively attribute observed changes to the CDST. Temporal confounding, staffing changes, and Hawthorne effects are plausible alternative explanations.

  3. Short duration: 22 weeks of operational data is insufficient to capture seasonal disease variation, long-term adoption sustainability, or rare adverse events.

  4. Clinician reference standard: Triage concordance and diagnostic accuracy use clinician judgment as the reference standard, which itself is imperfect. Clinician classifications may be influenced by AI suggestions, introducing incorporation bias.

  5. Self-reported usability: SUS and TAM scores are self-reported and may not fully capture actual usability barriers in high-pressure clinical situations.

  6. Limited pharmacogenomic data: CYP2D6 module has only 7 weeks of data and 34 genotyped patients. This module requires substantially more evidence before clinical conclusions can be drawn.

  7. No patient outcome attribution: While near-miss captures and escalation outcomes are suggestive, this pilot cannot definitively link CDST use to improved patient outcomes.

These limitations represent precisely the evidence gaps that the proposed mixed-methods evaluation will address through rigorous ITS analysis, comprehensive qualitative investigation, and structured equity assessment over a 12-month period.


13. Conclusion

Over 22 weeks of production clinical use at Emory Hospital, SautiCare demonstrates:

  • Strong diagnostic performance (87.3% triage concordance, 89.7% top-3 diagnostic accuracy) consistent with or exceeding published benchmarks for AI-CDSTs in LMIC settings
  • Effective safety netting (NEWS2 AUROC 0.87, 47 near-miss prescription captures, 99.2% critical lab value detection)
  • Appropriate human-AI interaction (14.7% override rate with 92.3% documentation, indicating clinician engagement rather than passive acceptance)
  • Acceptable usability (SUS 72.8, "Good") with high adoption trajectory (86.1% by month 6)
  • Equitable performance across gender and language groups, with age-related gaps within the pre-specified 5% threshold
  • Operational reliability (99.2% uptime) in a resource-constrained environment with intermittent connectivity

These results establish a robust quantitative baseline and demonstrate that SautiCare has moved decisively beyond proof of concept into real-world clinical deployment. The proposed EVAH Pathway A evaluation will build on this foundation to rigorously characterize the conditions under which this CDST improves workflow efficiency, clinical safety, and provider decision-making in routine Kenyan primary care.


Appendix A: Statistical Methods

  • Concordance: Overall percentage agreement and Cohen's kappa (unweighted and quadratic-weighted) calculated using standard formulas. 95% CIs computed via bootstrap (2,000 resamples).
  • AUROC: Computed using non-parametric trapezoidal method. 95% CIs via DeLong's method.
  • Proportions: 95% CIs calculated using the Wilson score interval.
  • SUS scoring: Following Brooke (1996) standard methodology. Score interpretation per Bangor et al. (2009) adjective scale.
  • Significance testing: Chi-squared tests for categorical comparisons. All tests two-sided with alpha = 0.05. No adjustment for multiple comparisons applied at this preliminary stage; the proposed evaluation will incorporate Bonferroni or Holm-Bonferroni corrections as appropriate.

Appendix B: Data Governance

All data presented in this document was extracted from SautiCare's production audit trail under a documented data governance protocol. Patient-level data has been aggregated; no individually identifiable health information is presented. The data extraction process is governed by a data access agreement between the Institute of Design Innovation (evaluation lead) and Decarl iWorldAfric Limited (technology partner). Raw audit trail data is encrypted at rest (AES-256 via Fernet + PBKDF2) and in transit (TLS 1.3). Role-based access control limits data access by function, with PHI audit guards scrubbing personally identifiable information from system logs.

Appendix C: References

  • Bangor, A., Kortum, P. T., & Miller, J. T. (2009). Determining what individual SUS scores mean: Adding an adjective rating scale. Journal of Usability Studies, 4(3), 114--123.
  • Brooke, J. (1996). SUS: A quick and dirty usability scale. In P. W. Jordan et al. (Eds.), Usability Evaluation in Industry (pp. 189--194). Taylor & Francis.
  • DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves. Biometrics, 44(3), 837--845.
  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159--174.
  • Mulaku, M. N., et al. (2018). Medication errors in a primary health care setting in Kenya. BMC Health Services Research, 18(1), 1--8.
  • van der Sijs, H., Aarts, J., Vulto, A., & Berg, M. (2006). Overriding of drug safety alerts in computerized physician order entry. JAMIA, 13(2), 138--147.
  • Wangoda, R., et al. (2022). Under-triage in emergency departments of Kenyan public hospitals: A cross-sectional study. African Journal of Emergency Medicine, 12(3), 217--224.