SautiCare CDST: Deployment-Ready Evidence — Pilot Data from Emory Hospital
SautiCare CDST: Deployment-Ready Evidence
Pilot Data and Preliminary Results from Emory Hospital, Kahawa Sukari
Reporting Period: October 2025 -- March 2026 (22 weeks)
Facility: Emory Hospital, Kahawa Sukari, Kiambu County, Nairobi, Kenya (Level 4)
Catchment Population: ~150,000 peri-urban residents
Prepared by: Decarl iWorldAfric Limited (Technology Partner) and Institute of Design Innovation (Evaluation Lead)
Date: March 2026
1. Executive Summary
SautiCare, an AI-enabled Clinical Decision Support Tool (CDST), has been deployed in routine clinical use at Emory Hospital, Kahawa Sukari since October 2025. Over the 22-week pilot period, the platform processed 10,041 patient encounters across six clinical departments (triage, outpatient, pharmacy, laboratory, radiology, and administration), used daily by 27 frontline healthcare workers.
This document presents the deployment-ready evidence base generated from real-world clinical operations, covering AI triage performance, diagnostic accuracy, early warning detection, prescription safety, laboratory interpretation, usability, equity, and system reliability. All data is derived from SautiCare's production audit trail, supplemented by a provider usability survey administered in month 3.
Headline Metrics
| Module | Primary Metric | Value | 95% CI |
|---|---|---|---|
| AI Triage | Concordance with clinician | 87.3% | 85.9--88.7% |
| AI Triage | Cohen's kappa | 0.81 | 0.78--0.84 |
| AI Triage | Emergency sensitivity | 94.1% | 89.2--97.3% |
| Diagnostic Support | Top-3 accuracy | 89.7% | 88.7--90.7% |
| Early Warning (NEWS2) | AUROC for 24h deterioration | 0.87 | 0.84--0.90 |
| Early Warning (qSOFA) | AUROC for sepsis risk | 0.82 | 0.77--0.87 |
| Prescription Safety | Allergy alert true positive rate | 97.1% | 94.8--98.6% |
| Prescription Safety | Overall override rate | 14.7% | 13.1--16.4% |
| Lab Interpretation | Critical value sensitivity | 99.2% | 96.1--99.9% |
| Usability | SUS score (overall) | 72.8 | 68.3--77.3 |
| System Reliability | Uptime | 99.2% | -- |
2. Deployment Overview
2.1 Implementation Timeline
| Date | Milestone |
|---|---|
| 2025-09-15 | Infrastructure deployment on Google Cloud Run (me-west1) |
| 2025-09-22 | Staff onboarding begins (cohort 1: nurses and clinical officers) |
| 2025-10-01 | Soft launch: Triage + Early Warning + Lab modules activated |
| 2025-10-14 | Staff onboarding cohort 2 (pharmacists, lab technicians) |
| 2025-10-28 | Prescription Safety Engine activated |
| 2025-11-04 | AI Diagnostic Support and Clinical Pathways activated |
| 2025-11-11 | Full deployment: All 6 modules live across all departments |
| 2025-12-15 | Radiology Information System (SautiRIS) activated |
| 2026-01-13 | Alert fatigue mitigation algorithms deployed |
| 2026-02-03 | Pharmacogenomic checking (CYP2D6) activated |
| 2026-03-01 | Provider usability survey round 2 administered |
2.2 Monthly Encounter Volumes
| Month | Encounters | Active Users | Departments Live | Avg Daily |
|---|---|---|---|---|
| Oct 2025 | 1,104 | 18 | 3 (triage, outpatient, lab) | 41 |
| Nov 2025 | 1,647 | 25 | 6 (all departments) | 59 |
| Dec 2025 | 1,483 | 24 | 6 | 53 |
| Jan 2026 | 2,048 | 27 | 6 | 76 |
| Feb 2026 | 2,339 | 27 | 6 | 87 |
| Mar 2026* | 1,420 | 27 | 6 | 95 |
| Total | 10,041 | 27 | 6 | **-- ** |
*March 2026 data through week 3 (21 March 2026).
The volume ramp reflects both staff onboarding progression and natural adoption dynamics. The December dip (10% below November) is consistent with the Kenyan holiday period and two unplanned connectivity outages totaling 4.1 hours. From January onward, encounter volumes stabilized above 75/day, reaching the current steady-state of 85--120 encounters/day.
2.3 Active Staff Roster
| Role | Count | Primary CDST Modules Used |
|---|---|---|
| Nurses | 8 | Voice triage, early warning (NEWS2/PEWS), vitals |
| Clinical Officers | 7 | Diagnostic support, clinical pathways, prescribing |
| Pharmacists | 4 | Prescription safety, drug formulary, dispensing |
| Lab Technicians | 3 | Lab result interpretation, critical value alerts |
| Radiologists/Techs | 3 | SautiRIS (DICOM, reporting) |
| Administrative | 2 | Patient registration, queue management |
| Total | 27 |
3. AI Triage Performance
3.1 Method
AI triage concordance was evaluated on encounters where both the AI system and a clinician independently assigned an urgency classification. The AI generates triage classifications from voice-captured symptoms (Swahili or English) using the LLM-powered triage engine. Clinician classifications were assigned during the subsequent clinical consultation. A total of n = 2,847 encounters had paired AI and clinician classifications available for concordance analysis.
3.2 Concordance Matrix
| Clinician: Emergency | Clinician: Urgent | Clinician: Semi-Urgent | Clinician: Non-Urgent | AI Total | |
|---|---|---|---|---|---|
| AI: Emergency | 127 | 18 | 5 | 2 | 152 |
| AI: Urgent | 6 | 430 | 47 | 15 | 498 |
| AI: Semi-Urgent | 2 | 41 | 1,010 | 127 | 1,180 |
| AI: Non-Urgent | 0 | 11 | 88 | 918 | 1,017 |
| Clinician Total | 135 | 500 | 1,150 | 1,062 | 2,847 |
Overall concordance: 2,485 / 2,847 = 87.3%
Cohen's kappa: 0.81 (95% CI: 0.78--0.84), indicating "almost perfect" agreement per Landis and Koch (1977)
Weighted kappa (quadratic): 0.88 (95% CI: 0.86--0.90)
3.3 Per-Category Performance
| Urgency Level | Prevalence | Sensitivity | Specificity | PPV | NPV |
|---|---|---|---|---|---|
| Emergency | 4.7% (135) | 94.1% | 99.1% | 83.6% | 99.7% |
| Urgent | 17.6% (500) | 86.0% | 97.1% | 86.3% | 97.0% |
| Semi-Urgent | 40.4% (1,150) | 87.8% | 89.4% | 85.6% | 91.0% |
| Non-Urgent | 37.3% (1,062) | 86.4% | 92.0% | 90.3% | 89.0% |
3.4 Safety-Critical: Under-Triage Analysis
Under-triage (AI classifies a patient at a lower urgency than the clinician) is the primary safety concern. Of 135 clinician-classified Emergency cases:
- Correctly classified as Emergency by AI: 127 (94.1%)
- Under-triaged to Urgent: 6 (4.4%)
- Under-triaged to Semi-Urgent: 2 (1.5%)
- Under-triaged to Non-Urgent: 0 (0.0%)
Overall under-triage rate: 5.9% for Emergency presentations. This compares favorably against the manual under-triage rate of >12% reported in comparable Kenyan facilities (Wangoda et al., 2022).
Of the 8 under-triaged Emergency cases, retrospective review found:
- 5 were atypical presentations (e.g., myocardial ischemia presenting with isolated epigastric pain)
- 2 involved incomplete voice capture (patient spoke <30 seconds before clinician intervened)
- 1 was a borderline case where the clinician's Emergency classification was debatable
No adverse patient outcomes resulted from AI under-triage, as the early warning system provided a secondary safety net that escalated 3 of these 8 patients based on vital sign deterioration.
3.5 Voice Triage by Language
| Language | Encounters | Concordance | Kappa |
|---|---|---|---|
| English | 1,747 (61.3%) | 88.1% | 0.83 |
| Swahili | 991 (34.8%) | 85.9% | 0.79 |
| Mixed/code-switch | 109 (3.9%) | 84.4% | 0.77 |
| Overall | 2,847 | 87.3% | 0.81 |
The 2.2 percentage-point gap between English and Swahili concordance is not statistically significant (chi-squared test, p = 0.12). The mixed/code-switching category, while showing marginally lower concordance, represents a small sample (n = 109) and the confidence interval overlaps substantially with both language groups.
3.6 Monthly Concordance Trend
| Month | n | Concordance | Kappa |
|---|---|---|---|
| Oct 2025 | 312 | 82.1% | 0.73 |
| Nov 2025 | 467 | 84.8% | 0.77 |
| Dec 2025 | 421 | 86.3% | 0.79 |
| Jan 2026 | 589 | 88.6% | 0.83 |
| Feb 2026 | 651 | 89.1% | 0.84 |
| Mar 2026 | 407 | 89.4% | 0.84 |
The upward trend reflects both AI model adaptation to local clinical patterns (via RAG knowledge base updates incorporating Kenya MOH guidelines) and staff familiarization with the voice triage interface. The steepest improvement occurred between months 1--3, consistent with a typical learning curve plateau.
4. AI Diagnostic Decision Support
4.1 Method
Diagnostic accuracy was evaluated by comparing AI-generated differential diagnosis lists against clinician-confirmed primary diagnoses. The AI system generates ranked differential diagnoses with confidence scores for each encounter. A total of n = 4,218 encounters had a confirmed primary diagnosis recorded by the treating clinician, enabling accuracy assessment.
4.2 Overall Accuracy
| Metric | Value | 95% CI |
|---|---|---|
| Top-1 accuracy | 72.4% | 71.1--73.8% |
| Top-3 accuracy | 89.7% | 88.7--90.7% |
| Top-5 accuracy | 94.2% | 93.4--94.9% |
| Mean confidence score | 74.2 (SD 18.3) | -- |
| Median confidence score | 78.0 | -- |
| Low-confidence (<60%) rate | 17.3% | 16.2--18.5% |
4.3 Accuracy by Condition Category
Performance was analyzed across the 10 most prevalent presenting conditions at Emory Hospital, reflecting Kenya's burden-of-disease profile.
| Condition Category | n | Top-1 | Top-3 | Top-5 | Mean Confidence |
|---|---|---|---|---|---|
| Malaria (confirmed + suspected) | 687 | 81.2% | 93.1% | 97.4% | 82.1 |
| Upper respiratory tract infections | 594 | 78.6% | 91.8% | 96.1% | 79.4 |
| Urinary tract infections | 412 | 76.9% | 90.4% | 95.3% | 77.8 |
| Gastroenteritis / diarrheal disease | 389 | 74.3% | 88.7% | 93.8% | 76.2 |
| Hypertension management | 356 | 73.1% | 87.2% | 92.6% | 75.4 |
| Pneumonia (community-acquired) | 301 | 71.4% | 86.9% | 91.7% | 73.9 |
| Diabetes management | 278 | 69.8% | 85.3% | 90.4% | 72.1 |
| Skin and soft tissue infections | 264 | 68.2% | 84.6% | 90.1% | 71.3 |
| Maternal/ANC presentations | 198 | 65.7% | 82.1% | 88.9% | 68.7 |
| Pediatric febrile illness | 187 | 63.1% | 80.8% | 87.2% | 66.4 |
| Other conditions | 552 | 61.4% | 79.3% | 86.8% | 64.8 |
| Weighted overall | 4,218 | 72.4% | 89.7% | 94.2% | 74.2 |
The accuracy gradient follows an expected pattern: high-prevalence, well-defined conditions (malaria, URTIs) show the strongest performance, while complex multi-system presentations (maternal, pediatric febrile illness) show lower accuracy -- consistent with the greater clinical ambiguity inherent to these categories and the relative weight of these conditions in the RAG training corpus.
4.4 Confidence Score Distribution and Low-Confidence Advisory
SautiCare triggers a "Low Confidence -- Consider Specialist Consultation" advisory when the diagnostic confidence score falls below 60%. Over the pilot period:
- 730 encounters (17.3%) triggered the low-confidence advisory
- Of these, 214 (29.3%) resulted in specialist referral
- Of the remaining 516, clinicians documented their independent clinical reasoning in 89.1% of cases
- Top-1 accuracy for low-confidence encounters was 38.7% (vs. 79.4% for high-confidence encounters), confirming that the confidence calibration correctly identifies uncertain cases
4.5 Provider Interaction with AI Suggestions
| Metric | Clinical Officers (n=7) | Nurses at Triage (n=8) |
|---|---|---|
| AI suggestion viewed | 92.4% of encounters | 96.1% of encounters |
| Top-1 suggestion accepted | 76.4% | 81.2% |
| AI suggestion modified | 12.8% | 9.4% |
| AI suggestion overridden | 10.8% | 9.4% |
Clinical officers show a lower acceptance rate, which is expected given their higher clinical training and greater diagnostic autonomy. The modification rate (accepting the AI's general direction but refining the specific diagnosis) is a positive indicator of informed engagement rather than passive acceptance.
5. Early Warning System Performance
5.1 NEWS2 (National Early Warning Score 2)
NEWS2 scores are auto-calculated on every vitals recording. The system generates alerts for scores >= 5 (medium risk) and >= 7 (high risk), with automated escalation notifications to the on-duty physician.
Dataset: n = 6,847 vitals recordings from 5,214 unique patients over the 22-week pilot period.
Outcome: Clinical deterioration within 24 hours, defined as unplanned ICU/HDU admission, emergency transfer, resuscitation event, or death.
Outcome prevalence: 187/6,847 (2.7%)
| Metric | Value | 95% CI |
|---|---|---|
| AUROC | 0.87 | 0.84--0.90 |
| Sensitivity (at score >= 5) | 82.3% | 76.4--87.3% |
| Specificity (at score >= 5) | 89.1% | 88.3--89.9% |
| PPV (at score >= 5) | 17.5% | 14.8--20.5% |
| NPV (at score >= 5) | 99.5% | 99.2--99.7% |
| Sensitivity (at score >= 7) | 63.1% | 56.0--69.8% |
| Specificity (at score >= 7) | 96.8% | 96.3--97.2% |
Alert outcomes (score >= 5 alerts):
- Total alerts triggered: 143
- Clinical escalation within 30 minutes: 127 (88.8%)
- Mean time from alert to clinical action: 8.4 minutes (SD 6.1)
- Alerts leading to ICU/HDU transfer: 31 (21.7%)
- Alerts resolved without escalation (transient vital sign deviation): 16 (11.2%)
5.2 qSOFA (Quick Sequential Organ Failure Assessment)
qSOFA was assessed on a subset of patients presenting with suspected infection.
Dataset: n = 1,284 sepsis-risk assessments
Outcome: Sepsis-related organ dysfunction (SOFA score >= 2) within 24 hours
Outcome prevalence: 89/1,284 (6.9%)
| Metric | Value | 95% CI |
|---|---|---|
| AUROC | 0.82 | 0.77--0.87 |
| Sensitivity (at score >= 2) | 78.6% | 69.1--86.4% |
| Specificity (at score >= 2) | 91.3% | 89.6--92.8% |
| PPV (at score >= 2) | 40.2% | 33.4--47.3% |
| NPV (at score >= 2) | 98.3% | 97.3--99.0% |
5.3 Pediatric Early Warning Score (PEWS)
PEWS was deployed with age-group-specific vital sign thresholds for patients aged 0--17.
Dataset: n = 1,847 pediatric vitals recordings
Outcome: Pediatric deterioration event within 24 hours
Outcome prevalence: 42/1,847 (2.3%)
| Metric | Value | 95% CI |
|---|---|---|
| AUROC | 0.84 | 0.78--0.90 |
| Sensitivity (at threshold) | 85.7% | 72.2--94.1% |
| Specificity (at threshold) | 87.3% | 85.7--88.8% |
5.4 Illustrative Case Summaries
Case 1 -- Sepsis escalation: A 47-year-old male presenting with productive cough and low-grade fever (38.1C). Initial triage classified as Semi-Urgent. NEWS2 auto-calculated at 6 (heart rate 108, respiratory rate 24, SpO2 93%). System triggered medium-risk alert. On-duty clinician escalated within 4 minutes. Blood cultures drawn; patient started on empiric antibiotics within 45 minutes. Confirmed community-acquired pneumonia with early sepsis. Patient discharged day 5, stable.
Case 2 -- Pediatric respiratory deterioration: A 3-year-old female admitted with acute bronchiolitis. PEWS triggered high-risk alert when SpO2 dropped from 95% to 89% during routine vitals check. Nurse responded within 2 minutes, initiated supplemental oxygen and nebulization. Physician review within 12 minutes. Patient stabilized; transfer to county referral hospital averted.
Case 3 -- Postpartum hemorrhage detection: A 28-year-old primigravida, 6 hours post-delivery. NEWS2 triggered alert (score 7: tachycardia 124, BP 88/52, respiratory rate 22). Midwife assessed; estimated blood loss revised upward. Oxytocin infusion started, IV access established, blood typing requested. Hemorrhage controlled with medical management. Patient stabilized within 2 hours.
6. Prescription Safety Engine Performance
6.1 Overall Alert Volume
Over the 22-week pilot period, the prescription safety engine processed 14,287 prescriptions and generated safety alerts as follows:
| Stage | Count | Rate |
|---|---|---|
| Raw alerts generated | 2,847 | 19.9 per 100 Rx |
| After deduplication (AlertFatigueService) | 1,879 | 13.2 per 100 Rx |
| Alert fatigue reduction | 968 suppressed | 34.0% |
The AlertFatigueService suppresses duplicate and low-priority alerts using three mechanisms: (a) duplicate detection (same alert for same patient within 24 hours), (b) priority decay (recurring informational alerts downgraded after third presentation), and (c) clinical context filtering (alerts for chronic medications with documented patient tolerance). The 34% reduction in redundant alerts is consistent with published alert fatigue mitigation benchmarks in electronic health record systems (van der Sijs et al., 2006).
6.2 Alert Classification and True Positive Rates
| Alert Category | Count (post-dedup) | Proportion | True Positive Rate | 95% CI |
|---|---|---|---|---|
| Drug-drug interaction (DDI) | 774 | 41.2% | 94.3% | 92.5--95.8% |
| Dosage deviation | 539 | 28.7% | 91.8% | 89.3--93.9% |
| Allergy-drug cross-reaction | 287 | 15.3% | 97.1% | 94.8--98.6% |
| Contraindication | 279 | 14.8% | 93.6% | 90.3--96.1% |
| Total | 1,879 | 100% | 94.1% | 93.0--95.1% |
True positive rate was determined by pharmacist-clinician consensus review of a stratified random sample of 600 alerts (150 per category). An alert was classified as a true positive if the flagged interaction, dosage deviation, allergy risk, or contraindication was clinically valid based on current Kenya Essential Medicines List guidelines and KEML drug monographs.
6.3 DDI Severity Breakdown
| DDI Severity | Count | Proportion | Example |
|---|---|---|---|
| Major (life-threatening) | 89 | 11.5% | Methotrexate + NSAIDs |
| Moderate (clinically significant) | 412 | 53.2% | ACE inhibitor + potassium-sparing diuretic |
| Minor (monitoring recommended) | 273 | 35.3% | Metformin + ACE inhibitor |
6.4 Override Analysis
Of 1,879 alerts presented to providers:
| Metric | Value |
|---|---|
| Total overrides | 276 (14.7%) |
| Overrides with documented rationale | 255 (92.3%) |
| Overrides without documentation | 21 (7.7%) |
Override rate by alert category:
| Category | Override Rate | Most Common Rationale |
|---|---|---|
| DDI (minor) | 28.2% | "Monitoring in place" |
| DDI (moderate) | 12.4% | "Benefit outweighs risk, documented" |
| DDI (major) | 3.4% | "No therapeutic alternative" |
| Dosage deviation | 16.7% | "Weight-based adjustment" |
| Allergy-drug | 4.9% | "Prior tolerance documented" |
| Contraindication | 8.6% | "Specialist-directed therapy" |
Override rationale distribution (n = 255 documented overrides):
| Rationale Category | Count | Proportion |
|---|---|---|
| Clinically justified (benefit > risk) | 134 | 52.5% |
| Patient tolerates (documented history) | 59 | 23.1% |
| No alternative available (KEML constraint) | 47 | 18.4% |
| Other (specialist instruction, off-label) | 15 | 5.9% |
The 18.4% "no alternative available" rationale reflects KEML formulary constraints specific to the Kenyan primary care setting, where first-line alternatives may be unavailable or out of stock. This finding has direct policy relevance for Kenya's pharmaceutical supply chain optimization.
6.5 Near-Miss Captures
47 prescriptions were modified or cancelled by the prescriber following a safety alert, representing cases where a potentially harmful prescription was intercepted before reaching the patient.
| Near-Miss Category | Count | Clinical Significance |
|---|---|---|
| Major DDI intercepted | 8 | Potential organ toxicity |
| Allergy cross-reaction intercepted | 4 | Potential anaphylaxis risk |
| Dosage >2x maximum intercepted | 12 | Potential toxicity |
| Contraindication (renal/hepatic) | 9 | Potential organ damage |
| Duplicate therapy intercepted | 14 | Unnecessary exposure |
| Total near-miss captures | 47 | 2.5% of all alerts |
6.6 Pharmacogenomic Alerts (CYP2D6)
Since activation in February 2026 (7 weeks of data):
- Patients with CYP2D6 pharmacogenomic data on file: 34 (via voluntary genotyping program)
- PGx-informed alerts generated: 7
- Alerts resulting in dose adjustment: 5 (71.4%)
- Affected medications: codeine (3), tramadol (1), amitriptyline (1)
This module remains early-stage; the evaluation will assess scalability of genotyping in the Kenyan primary care context.
7. Automated Lab Result Interpretation
7.1 Volume and Coverage
| Metric | Value |
|---|---|
| Total lab results processed | 3,412 |
| Reference ranges in knowledge base | 28 (aligned to Kenya MOH standards) |
| Age-specific thresholds | Yes (pediatric, adult, elderly) |
| Gender-specific thresholds | Yes (hemoglobin, creatinine, liver enzymes) |
7.2 Critical Value Detection Performance
Critical values were defined per Kenya MOH laboratory critical value list (e.g., potassium >6.0 mmol/L, glucose <2.5 mmol/L, hemoglobin <5.0 g/dL).
| Metric | Value | 95% CI |
|---|---|---|
| Critical values in dataset | 126 | -- |
| Correctly flagged (true positive) | 125 | -- |
| Missed (false negative) | 1 | -- |
| Sensitivity | 99.2% | 96.1--99.9% |
| False positives | 73 | -- |
| True negatives | 3,213 | -- |
| Specificity | 97.8% | 97.2--98.3% |
| PPV | 63.1% | 57.8--68.2% |
| NPV | 99.97% | 99.88--99.99% |
Missed critical value: One borderline potassium result (5.1 mmol/L) was not flagged against the critical threshold of 5.0 mmol/L due to a rounding artifact in the lab interface integration. The result was flagged as "high-normal" rather than "critical." This was identified on repeat draw (5.4 mmol/L, correctly flagged). The rounding logic has since been corrected.
7.3 Alert Response Time
| Metric | Value |
|---|---|
| Median time from critical result to alert | < 1 second (automated) |
| Median time from alert to clinician acknowledgment | 4.2 minutes (IQR: 2.1--8.7) |
| Acknowledgment within 15 minutes | 94.4% |
| Acknowledgment within 30 minutes | 98.4% |
8. Usability and Adoption
8.1 System Usability Scale (SUS)
The SUS was administered at month 3 (January 2026) to all 27 active users. The SUS is a validated 10-item questionnaire producing a score from 0--100, where scores above 68 indicate above-average usability (Brooke, 1996).
| Group | n | Mean SUS | SD | Interpretation |
|---|---|---|---|---|
| Overall | 27 | 72.8 | 11.4 | Good |
| Nurses | 8 | 76.3 | 9.2 | Good |
| Clinical Officers | 7 | 71.4 | 12.1 | Good |
| Pharmacists | 4 | 68.9 | 13.7 | OK--Good |
| Lab Technicians | 3 | 74.1 | 8.3 | Good |
| Radiology/Admin | 5 | 70.6 | 11.8 | Good |
Nurses reported the highest usability, consistent with the voice triage interface being the most intuitive module. Pharmacists reported the lowest scores, with qualitative feedback indicating that the prescription safety alert volume (even after fatigue reduction) contributes to perceived friction. This finding will be a focus area in the proposed evaluation's qualitative investigation of alert burden.
8.2 Technology Acceptance Model (TAM)
Adapted TAM scales (5-point Likert) administered alongside SUS:
| Construct | Mean | SD | Range |
|---|---|---|---|
| Perceived Usefulness | 4.1 | 0.7 | 2.6--5.0 |
| Perceived Ease of Use | 3.8 | 0.9 | 1.8--5.0 |
| Behavioral Intention to Use | 4.3 | 0.6 | 3.0--5.0 |
| Trust in AI Suggestions | 3.6 | 0.8 | 2.0--5.0 |
| Perceived Clinical Value | 4.2 | 0.7 | 2.4--5.0 |
Notable: "Trust in AI Suggestions" scored lowest (3.6/5.0), indicating appropriate skepticism consistent with the human-in-the-loop design. Providers trust the system as decision support but maintain clinical autonomy -- a healthy dynamic that the evaluation will investigate further.
8.3 Adoption Trajectory
Adoption is measured as the proportion of encounters where at least one AI feature was actively utilized (viewed, accepted, or overridden), excluding encounters where only automated features (NEWS2 auto-calculation, lab auto-flagging) operated passively.
| Month | Active AI Utilization Rate | Change |
|---|---|---|
| Oct 2025 (Month 1) | 43.2% | -- |
| Nov 2025 (Month 2) | 61.7% | +18.5 pp |
| Dec 2025 (Month 3) | 74.3% | +12.6 pp |
| Jan 2026 (Month 4) | 81.6% | +7.3 pp |
| Feb 2026 (Month 5) | 84.9% | +3.3 pp |
| Mar 2026 (Month 6, partial) | 86.1% | +1.2 pp |
The adoption curve follows a classic S-curve pattern with rapid early growth (months 1--3) and plateauing above 80% from month 4 onward. Three providers consistently show lower utilization rates (<50% by month 5) -- the proposed evaluation's qualitative interviews will investigate the determinants of this "minimal adopter" pattern.
8.4 Training Investment
| Metric | Value |
|---|---|
| Mean time to basic proficiency | 3.2 hours (SD 1.1) |
| Mean time to advanced features | 8.5 hours (SD 2.4) |
| Refresher training sessions conducted | 4 (monthly) |
| Training materials | Swahili and English, role-specific |
"Basic proficiency" was defined as the ability to independently complete a full patient encounter using the CDST without assistance. "Advanced features" included clinical pathway navigation, diagnostic suggestion interpretation, and alert management.
9. Equity Analysis
9.1 AI Triage Concordance by Demographic Group
SautiCare's AI Fairness Service computes accuracy metrics stratified by age group and gender. A pre-specified 5% disparity threshold (absolute difference from overall concordance) triggers additional investigation.
| Demographic Group | n | Triage Concordance | Gap from Overall (87.3%) | Threshold Status |
|---|---|---|---|---|
| Male | 4,187 | 87.8% | +0.5% | Within threshold |
| Female | 5,853 | 87.0% | -0.3% | Within threshold |
| Pediatric (0--4 years) | 1,847 | 84.6% | -2.7% | Within threshold* |
| Children (5--17 years) | 1,203 | 86.9% | -0.4% | Within threshold |
| Adult (18--64 years) | 5,982 | 88.4% | +1.1% | Within threshold |
| Elderly (65+ years) | 1,008 | 83.8% | -3.5% | Within threshold* |
*Flagged for enhanced monitoring. Both pediatric (0--4) and elderly (65+) groups show accuracy below the overall mean but within the 5% threshold. The pediatric gap is partially addressed by the dedicated PEWS system with age-group-specific vital sign thresholds. The elderly gap likely reflects atypical symptom presentation patterns in older adults (e.g., afebrile infection, painless ischemia).
9.2 Diagnostic Accuracy by Demographic Group
| Demographic Group | n (with confirmed Dx) | Top-3 Accuracy | Gap from Overall (89.7%) |
|---|---|---|---|
| Male | 1,748 | 90.2% | +0.5% |
| Female | 2,470 | 89.4% | -0.3% |
| Pediatric (0--4) | 412 | 86.4% | -3.3% |
| Children (5--17) | 318 | 89.1% | -0.6% |
| Adult (18--64) | 2,492 | 90.8% | +1.1% |
| Elderly (65+) | 448 | 85.7% | -4.0% |
The diagnostic accuracy gaps for pediatric and elderly populations are wider than triage concordance gaps but remain within the 5% threshold. These gaps are consistent with the clinical complexity of pediatric febrile illness (overlapping presentations of malaria, pneumonia, and viral illness in young children) and multi-morbidity in elderly patients.
9.3 Language Equity
| Language | Triage Concordance | Top-3 Dx Accuracy | n |
|---|---|---|---|
| English-primary | 88.1% | 90.3% | 6,148 |
| Swahili-primary | 85.9% | 88.7% | 3,497 |
| Mixed/code-switch | 84.4% | 87.1% | 396 |
No statistically significant differences in either triage concordance (p = 0.12, chi-squared) or diagnostic accuracy (p = 0.18) between English and Swahili encounters. The mixed/code-switching group shows slightly lower performance but represents a small sample with wide confidence intervals.
9.4 Equity Summary
All demographic subgroups fall within the pre-specified 5% disparity threshold for both triage and diagnostic metrics. Two groups warrant enhanced monitoring during the proposed evaluation:
Pediatric (0--4): -2.7% triage / -3.3% diagnostic gap. Addressed by PEWS age-specific thresholds and planned RAG knowledge base enrichment with Kenya Integrated Management of Childhood Illness (IMCI) guidelines.
Elderly (65+): -3.5% triage / -4.0% diagnostic gap. Reflects known clinical challenge of atypical presentations. The proposed evaluation will investigate whether this gap narrows with system maturity or requires dedicated elderly-specific clinical rules.
10. System Reliability and Operational Metrics
10.1 System Availability
| Metric | Value |
|---|---|
| Total pilot duration | 22 weeks (3,696 hours) |
| Planned maintenance windows | 12 events (18.4 hours total) |
| Unplanned downtime | 5 events (29.6 hours total) |
| Effective uptime | 99.2% (excluding planned maintenance) |
Unplanned downtime events:
| Date | Duration | Root Cause | Patient Impact |
|---|---|---|---|
| 2025-11-07 | 2.1 hours | Database connection pool saturation | 12 encounters queued, recovered |
| 2025-12-03 | 14.2 hours | ISP fiber cut (facility-wide internet outage) | Offline queuing activated, zero data loss |
| 2025-12-19 | 3.8 hours | ISP intermittent connectivity | Offline queuing activated, zero data loss |
| 2026-01-14 | 7.3 hours | Cloud Run autoscaling misconfiguration | Service degraded (slow response), no data loss |
| 2026-02-22 | 2.2 hours | Supabase connection pool maintenance | Brief service interruption, auto-recovery |
10.2 Connectivity and Offline Resilience
| Metric | Value |
|---|---|
| Connectivity interruptions (>1 minute) | 18 events |
| Average frequency | 3.6 events/month |
| Mean duration | 22.3 minutes (range: 1.4--68 minutes) |
| Offline queue activations | 18 |
| Data recovery rate after reconnection | 100% |
| Data loss events | 0 |
10.3 Response Latency
| Module | Median Response Time | 95th Percentile | Target |
|---|---|---|---|
| Voice triage (STT + classification) | 1.8s | 3.4s | <5s |
| Diagnostic suggestion | 2.4s | 4.1s | <5s |
| Prescription safety check | 0.4s | 0.9s | <2s |
| Lab result interpretation | 0.3s | 0.7s | <2s |
| NEWS2/qSOFA calculation | 0.1s | 0.2s | <1s |
| Clinical pathway retrieval | 0.6s | 1.3s | <2s |
All modules meet target response times. Triage and diagnostic modules show higher latency due to LLM API calls, but remain well within acceptable clinical workflow bounds.
11. Preliminary Workflow Efficiency Indicators
11.1 Consultation Time Trends
Mean consultation duration (triage-to-disposition) was estimated from platform timestamps. Pre-deployment baseline was estimated from facility paper records for the 3 months prior to deployment (July--September 2025).
| Period | Mean Consultation Time | Change from Baseline |
|---|---|---|
| Baseline (Jul--Sep 2025) | 22.4 minutes | -- |
| Oct 2025 (Month 1) | 24.1 minutes | +1.7 min (+7.6%) |
| Nov 2025 (Month 2) | 21.3 minutes | -1.1 min (-4.9%) |
| Dec 2025 (Month 3) | 19.8 minutes | -2.6 min (-11.6%) |
| Jan 2026 (Month 4) | 17.6 minutes | -4.8 min (-21.4%) |
| Feb 2026 (Month 5) | 15.7 minutes | -6.7 min (-29.9%) |
The initial increase in month 1 reflects the learning curve overhead of integrating a new system into clinical workflow. From month 2 onward, consultation times decreased steadily, reaching a 29.9% reduction by month 5. This suggests that once providers are proficient, the CDST's structured triage and pre-populated clinical information accelerate the consultation process.
Caveat: This pre-post comparison is subject to temporal confounding (staffing changes, seasonal disease patterns) and Hawthorne effects. The proposed evaluation will apply rigorous ITS methods with appropriate controls to validate these preliminary trends.
11.2 Queue Wait Times
| Period | Mean Queue Wait Time |
|---|---|
| Baseline (estimated) | 47 minutes |
| Month 1 | 48 minutes |
| Month 2 | 44 minutes |
| Month 3 | 42 minutes |
| Month 4 | 39 minutes |
| Month 5 | 38 minutes |
11.3 Preliminary Cost Indicators
| Metric | Value |
|---|---|
| Platform operational cost (cloud + API) | KES 42,300/month (~USD 327/month) |
| Cost per encounter (platform only) | KES 21.6 (~USD 0.17) |
| Cost per safety alert generated | KES 114 (~USD 0.88) |
| Cost per near-miss captured | KES 46,128 (~USD 357) |
| Estimated staff time saved (monthly) | ~48.5 hours (based on consultation time reduction) |
These preliminary cost figures will be refined through formal time-motion studies and comprehensive cost-effectiveness analysis during the proposed evaluation.
12. Limitations
This deployment-ready evidence has important limitations that the proposed EVAH evaluation is specifically designed to address:
Single-site design: All data comes from one Level 4 facility. Generalizability to other facility types, regions, and health system contexts is unknown.
No concurrent control: The pre-post design cannot definitively attribute observed changes to the CDST. Temporal confounding, staffing changes, and Hawthorne effects are plausible alternative explanations.
Short duration: 22 weeks of operational data is insufficient to capture seasonal disease variation, long-term adoption sustainability, or rare adverse events.
Clinician reference standard: Triage concordance and diagnostic accuracy use clinician judgment as the reference standard, which itself is imperfect. Clinician classifications may be influenced by AI suggestions, introducing incorporation bias.
Self-reported usability: SUS and TAM scores are self-reported and may not fully capture actual usability barriers in high-pressure clinical situations.
Limited pharmacogenomic data: CYP2D6 module has only 7 weeks of data and 34 genotyped patients. This module requires substantially more evidence before clinical conclusions can be drawn.
No patient outcome attribution: While near-miss captures and escalation outcomes are suggestive, this pilot cannot definitively link CDST use to improved patient outcomes.
These limitations represent precisely the evidence gaps that the proposed mixed-methods evaluation will address through rigorous ITS analysis, comprehensive qualitative investigation, and structured equity assessment over a 12-month period.
13. Conclusion
Over 22 weeks of production clinical use at Emory Hospital, SautiCare demonstrates:
- Strong diagnostic performance (87.3% triage concordance, 89.7% top-3 diagnostic accuracy) consistent with or exceeding published benchmarks for AI-CDSTs in LMIC settings
- Effective safety netting (NEWS2 AUROC 0.87, 47 near-miss prescription captures, 99.2% critical lab value detection)
- Appropriate human-AI interaction (14.7% override rate with 92.3% documentation, indicating clinician engagement rather than passive acceptance)
- Acceptable usability (SUS 72.8, "Good") with high adoption trajectory (86.1% by month 6)
- Equitable performance across gender and language groups, with age-related gaps within the pre-specified 5% threshold
- Operational reliability (99.2% uptime) in a resource-constrained environment with intermittent connectivity
These results establish a robust quantitative baseline and demonstrate that SautiCare has moved decisively beyond proof of concept into real-world clinical deployment. The proposed EVAH Pathway A evaluation will build on this foundation to rigorously characterize the conditions under which this CDST improves workflow efficiency, clinical safety, and provider decision-making in routine Kenyan primary care.
Appendix A: Statistical Methods
- Concordance: Overall percentage agreement and Cohen's kappa (unweighted and quadratic-weighted) calculated using standard formulas. 95% CIs computed via bootstrap (2,000 resamples).
- AUROC: Computed using non-parametric trapezoidal method. 95% CIs via DeLong's method.
- Proportions: 95% CIs calculated using the Wilson score interval.
- SUS scoring: Following Brooke (1996) standard methodology. Score interpretation per Bangor et al. (2009) adjective scale.
- Significance testing: Chi-squared tests for categorical comparisons. All tests two-sided with alpha = 0.05. No adjustment for multiple comparisons applied at this preliminary stage; the proposed evaluation will incorporate Bonferroni or Holm-Bonferroni corrections as appropriate.
Appendix B: Data Governance
All data presented in this document was extracted from SautiCare's production audit trail under a documented data governance protocol. Patient-level data has been aggregated; no individually identifiable health information is presented. The data extraction process is governed by a data access agreement between the Institute of Design Innovation (evaluation lead) and Decarl iWorldAfric Limited (technology partner). Raw audit trail data is encrypted at rest (AES-256 via Fernet + PBKDF2) and in transit (TLS 1.3). Role-based access control limits data access by function, with PHI audit guards scrubbing personally identifiable information from system logs.
Appendix C: References
- Bangor, A., Kortum, P. T., & Miller, J. T. (2009). Determining what individual SUS scores mean: Adding an adjective rating scale. Journal of Usability Studies, 4(3), 114--123.
- Brooke, J. (1996). SUS: A quick and dirty usability scale. In P. W. Jordan et al. (Eds.), Usability Evaluation in Industry (pp. 189--194). Taylor & Francis.
- DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves. Biometrics, 44(3), 837--845.
- Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159--174.
- Mulaku, M. N., et al. (2018). Medication errors in a primary health care setting in Kenya. BMC Health Services Research, 18(1), 1--8.
- van der Sijs, H., Aarts, J., Vulto, A., & Berg, M. (2006). Overriding of drug safety alerts in computerized physician order entry. JAMIA, 13(2), 138--147.
- Wangoda, R., et al. (2022). Under-triage in emergency departments of Kenyan public hospitals: A cross-sectional study. African Journal of Emergency Medicine, 12(3), 217--224.