AI Integration

AI Performance Benchmarking Suite for Public Sector

Government Digital Transformation Initiative

Key Outcome

Evaluation framework deployed across 4 ministries — identified 34% reduction in LLM vendor spend

The Challenge

The initiative had received budget approval to integrate AI across citizen-facing services but lacked any internal capability to evaluate which models were actually fit for purpose. Vendors were self-reporting performance claims with no independent verification. Procurement decisions were being made on marketing material rather than evidence — a significant risk given the regulatory and accountability requirements of the public sector.

Our Approach

We built a vendor-neutral evaluation framework from scratch, designed to be operated by non-technical policy analysts after handoff. The platform supports plug-and-play model comparison (any OpenAI-compatible API endpoint), includes a library of 1,200+ government-domain evaluation prompts built with input from ministry subject matter experts, and produces audit-ready reports with full methodology documentation.

Tech Stack

PythonFastAPIPostgreSQLReactLLM-as-judgeEvalsLangSmith
Explore our AI Integration service

System Architecture

How the system flows

Task Suite12 categoriesGolden Set1,200 promptsModel AdapterOpenAI-compatibleGPT-5.5Claude Opus 4.8Gemini 3.1Eval EngineMetrics · LLM-judgeScorecardsProcurement

Build Pipeline

How We Built It

01

Domain Taxonomy & Task Definition

Worked with policy analysts across four ministries to define 12 task categories specific to government operations — from multilingual citizen correspondence to legislative document summarisation. Each category was mapped to measurable output criteria agreed upon before any model was tested.

02

Golden Dataset Construction

Built a 1,200-prompt evaluation dataset with human-authored reference outputs, curated by domain experts and reviewed for bias, sensitivity, and edge-case coverage. Datasets are versioned and stored in PostgreSQL with full audit trails for regulatory compliance.

03

Model Adapter Layer

Built a unified adapter layer that wraps any OpenAI-compatible API, allowing the platform to benchmark GPT-5.5, Claude Opus 4.8, Gemini 3.1, Mistral, and open-source models through a single interface. Authentication, rate limiting, and cost tracking are handled per-vendor with no data sent outside defined security boundaries.

04

Automated Evaluation Pipeline

Each model runs the full evaluation dataset in parallel via async FastAPI workers. Outputs are scored on a combination of automated metrics (ROUGE, BERTScore, exact match) and LLM-as-judge scoring using a separate evaluation model. All scores are stored with full input/output logs for manual review.

05

Hallucination & Safety Audit

A dedicated evaluation pass tests each model for hallucination rate on factual government data, harmful output generation, and PII leakage risk. Results are flagged with severity classifications aligned to the ministry's existing IT security risk framework.

06

Reporting & Procurement Integration

The platform generates structured benchmark reports in both PDF (for procurement committees) and JSON (for internal IT systems). Reports include model scorecards, cost-per-task estimates, and a recommended vendor matrix. Ministry teams can run new evaluations on demand as models update.

Results

What We Delivered

4

Ministries Deployed

The platform was adopted by four ministry departments within six months of initial delivery, with two additional departments in onboarding.

34%

LLM Vendor Cost Reduction

Benchmark data revealed that a lower-cost model outperformed the incumbent vendor on 9 of 12 task categories, enabling a vendor switch that reduced annual AI spend by 34%.

1,200+

Evaluation Prompts in Library

The evaluation dataset is now maintained as an internal asset and has been submitted as a reference implementation to the national AI governance body.

LET'S WORKTOGETHER

Work with us if average isn't your thing. Drop it, we'll build it!

SAY HELLO