Get in touch
team@kovazu.comGovernment Digital Transformation Initiative
Key Outcome
Evaluation framework deployed across 4 ministries — identified 34% reduction in LLM vendor spend
The Challenge
The initiative had received budget approval to integrate AI across citizen-facing services but lacked any internal capability to evaluate which models were actually fit for purpose. Vendors were self-reporting performance claims with no independent verification. Procurement decisions were being made on marketing material rather than evidence — a significant risk given the regulatory and accountability requirements of the public sector.
Our Approach
We built a vendor-neutral evaluation framework from scratch, designed to be operated by non-technical policy analysts after handoff. The platform supports plug-and-play model comparison (any OpenAI-compatible API endpoint), includes a library of 1,200+ government-domain evaluation prompts built with input from ministry subject matter experts, and produces audit-ready reports with full methodology documentation.
Tech Stack
System Architecture
Build Pipeline
Worked with policy analysts across four ministries to define 12 task categories specific to government operations — from multilingual citizen correspondence to legislative document summarisation. Each category was mapped to measurable output criteria agreed upon before any model was tested.
Built a 1,200-prompt evaluation dataset with human-authored reference outputs, curated by domain experts and reviewed for bias, sensitivity, and edge-case coverage. Datasets are versioned and stored in PostgreSQL with full audit trails for regulatory compliance.
Built a unified adapter layer that wraps any OpenAI-compatible API, allowing the platform to benchmark GPT-5.5, Claude Opus 4.8, Gemini 3.1, Mistral, and open-source models through a single interface. Authentication, rate limiting, and cost tracking are handled per-vendor with no data sent outside defined security boundaries.
Each model runs the full evaluation dataset in parallel via async FastAPI workers. Outputs are scored on a combination of automated metrics (ROUGE, BERTScore, exact match) and LLM-as-judge scoring using a separate evaluation model. All scores are stored with full input/output logs for manual review.
A dedicated evaluation pass tests each model for hallucination rate on factual government data, harmful output generation, and PII leakage risk. Results are flagged with severity classifications aligned to the ministry's existing IT security risk framework.
The platform generates structured benchmark reports in both PDF (for procurement committees) and JSON (for internal IT systems). Reports include model scorecards, cost-per-task estimates, and a recommended vendor matrix. Ministry teams can run new evaluations on demand as models update.
Results
4
Ministries Deployed
The platform was adopted by four ministry departments within six months of initial delivery, with two additional departments in onboarding.
34%
LLM Vendor Cost Reduction
Benchmark data revealed that a lower-cost model outperformed the incumbent vendor on 9 of 12 task categories, enabling a vendor switch that reduced annual AI spend by 34%.
1,200+
Evaluation Prompts in Library
The evaluation dataset is now maintained as an internal asset and has been submitted as a reference implementation to the national AI governance body.
TOGETHERWork with us if average isn't your thing. Drop it, we'll build it!
SAY HELLO