Written by: Wayne Birch
This article was originally posted in The Magazine of the Public Sector HR Association Public Eye. Read more.
Job classification remains one of HR’s most consequential yet inconsistently executed processes. Despite its direct impact on FLSA compliance, compensation equity, workforce planning and employee trust, most organizations still rely on manual review by individual HR professionals—each bringing different experience levels, contextual knowledge and decision heuristics.
At scale, this approach creates operational bottlenecks, inconsistency and audit exposure. At Metro Nashville Public Schools (MNPS), these challenges were acute: 435 employees held “Specialist” titles across 24 pay grades, creating salary compression, retention risk and persistent classification appeals. A June 2025 pay equity analysis concluded that sustainable reform required a standardized, scalable classification approach grounded in explicit decision criteria.
Recent advances in large language models (LLMs) suggest automation potential, but naïvesingle model implementations all short of enterprise requirements. A single classification output offers no insight into confidence, ambiguity, financial risk, or when human review is warranted. HR leaders require decision intelligence—not just predictions.
These circumstances led to the development of PRISM (Progressive Refinement & Intelligence Synthesis Model), a staged hybrid architecture that adapts multi-model AI principles—most notably Andrej Karpathy’s LLM Council—into a domain-specific system optimized for HR classification.
Rather than relying solely on parallel model consensus, PRISM combines iterative refinement with structured validation to support risk-aware, auditable decision-making.
A staged hybrid approach
Karpathy’s LLM Council demonstrated that multiple AI models working together outperform single-model systems through parallel querying, peer review and synthesis. PRISM builds on this insight, but applies a different architectural pattern optimized for enterprise HR use cases. For example, using the PRISM framework:
- Karpathy Council: Parallel multi-model querying → peer critique → synthesis
- PRISM: Iterative single-model refinement → parallel multi-model analysis and validation →independent multi-model audit→ cross-model narrative synthesis
This staged hybrid approach separates deep analysis from independent validation, aligning with HR’s need for consistency, transparency and governance rather than open-ended deliberation. Central to this design is what we term architected friction—the deliberate introduction of structured analytical checkpoints that prevent the system from converging too quickly on a single answer.
Where a naïve implementation would produce one classification in a single pass, PRISM forces the model through multiple distinct lenses, each designed to surface a specific category of error—scope inflation, technical misalignment and supervisory ambiguity, that the previous pass may have missed. The result is a system that delivers at scale what would otherwise require months of manual review—without sacrificing the rigor HR professionals expect from human classification and compensation decisions.
PRISM’s primary classifier uses a single foundation model (GPT4o) in a fivepass iterative refinement pipeline. Each pass is executed as a separate API call with distinct analytical intent, allowing cumulative reasoning rather than probabilistic averaging.
Pass 1 – Initial classification
Pass 2 – Self-consistency verification
Pass 3 – Strategic scope and role elevation review
Pass 4 – Technical role disambiguation
Pass 5 – Supervisory role clarification
This structure emerged empirically through error analysis, which showed that common misclassifications stemmed from distinct failure modes best addressed through targeted analytical lenses rather than additional training data alone.
Multi-model validation and risk intelligence
Following primary classification, PRISM applies six independent validation components. Two operate as true multi-model ensembles using different foundation models; others use specialized analytical logic optimized for their function.
Component 1: Analyst—Generates plausible alternative classifications, serving both validation and job description quality diagnostics.
Component 2: Likelihood Judge—Produces a likelihood of error score based on human baseline error rates, ambiguity signals, consensus failures and KSAC semantic similarity.
Component 3: Cost Accountant—Quantifies financial exposure associated with potential misclassification, incorporating salary differentials, benefit load, asymmetric risk weighting and administrative correction costs.
Components 4 and 5: External Auditors—Two independent models (OpenAI GPT4o and Anthropic Claude Sonnet) classify roles without access to PRISM outputs, providing unbiased ensemble validation.
Component 6: Narrative Synthesizer—Integrates outputs from all preceding stages into a unified risk-informed analysis for each position, using a separate foundation model (Meta Llama 4 Maverick) to ensure cross-model objectivity in final reporting.
Disagreement between auditors and the primary classifier automatically escalates review priority, mirroring how independent consulting reviews are used in high-risk HR decisions. Expert HR professionals independently classified a representative sample of positions, achieving 82% interrater agreement. This established a realistic performance ceiling for any classification system, human or AI.
Against this benchmark, PRISM achieved 76% to 79% accuracy across test and production-validated samples. Importantly, residual variance reflected genuine professional judgment rather than systematic model failure.
Production deployment revealed a critical insight: a significant majority of positions exhibited legitimate classification ambiguity due to organizational role design, not AI error. PRISM successfully distinguished the minority of clear, low-risk cases from those requiring human review, enabling risk stratified workflows.
From an HR practitioner standpoint, there are a number of factors that must be considered. For example, AI classification systems should prioritize risk stratification over raw accuracy.
Iterative refinement is necessary to address failure modes that additional training data cannot. Multi-model validation is most valuable when applied selectively, not universally. Financial impact analysis transforms model output into executive-relevant decisions. And transparent and auditable reasoning is essential for HR adoption and trust. Organizations without explicit classification criteria should standardize first; AI will otherwise amplify existing ambiguity.
HRIS integration implications
PRISM is designed to function as a decision-support layer rather than a system of record, aligning with HRIS governance best practices. In its current deployment, classifications are exported via structured files for human approval prior to HRIS updates, preserving auditability and change control.
For HRIS leaders, the architectural implication is clear: AI classification engines should integrate upstream of position management workflows, not replace them. This pattern enables scalable analysis, risk stratification and financial impact modeling while maintaining HRIS integrity, approval hierarchies and compliance safeguards. Over time, APIs can automate ingestion of approved classifications, but only after governance thresholds are met.
Beyond simple integration, the API enables risk-tiered routing: “Auto-Approve” decisions can flow directly to the HRIS draft stage, while ”Critical Risk” flags are automatically routed to a specialist’s queue. This ensures the HRIS remains a system of validated record, not just a repository for automated outputs.
As public sector HR organizations modernize in response to growing complexity and public accountability, governance-first approaches to AI adoption are becoming essential. PRISM demonstrates that staged hybrid architectures, combining iterative refinement with multi-model validation, offer a viable, governance-aligned approach to AI-assisted HR classification. By integrating confidence scoring, financial risk, and independent validation, PRISM moves beyond automation toward decision intelligence.
For HR technology leaders, the key lesson is not model selection, but architectural intent. Systems should surface uncertainty, prioritize human judgment where it matters most, and scale responsibly. PRISM provides a replicable framework for achieving these goals at modest cost while maintaining professional accountability.
_______________________
The author wishes to thank Dr. Charreau Bell of Vanderbilt University’s Data Science Institute for her expertise in AI systems and contributions to the development and validation of the PRISM multi-model validation architecture and framework.
Wayne Birch is a strategic compensation and people analytics professional with Metro Nashville Public Schools. He can be reached at wayne.birch@mnps.org.