Building a Shared Benchmarking Framework for AI in Financial Services

Generative AI is transforming every sector, but in financial services its adoption comes with unique stakes: compliance, trust, and systemic risk. While large language models (LLMs) and other AI systems promise productivity gains across research, compliance, trading, and customer engagement, they also introduce challenges that generic benchmarks cannot adequately address.

This tension was at the heart of the recent FINOS workshop and techsprint in London, where engineers, architects, developers and open-source contributors working in the FSI converged to shape an Evaluation Framework for AI in Finance.

Interested in FINOS open source AI? Click the link below to see how to get involved in the FINOS AI Community.

A personal perspective by Luca Borella

Why Finance Needs Its Own AI Evaluation Framework

AI systems are non-deterministic by design, and financial tasks rarely yield a single “correct” answer. Traditional model-level benchmarks—whether measuring factual accuracy, reasoning, or language fluency—fall short of capturing what matters most in financial contexts:

Domain-specific correctness. Interpreting regulations, parsing contracts, or analyzing risk requires precise alignment with financial ontologies and standards (e.g., CDM, DORA).
System-level evaluation. Real-world deployments combine AI and non-AI components. Testing an LLM alone is insufficient; we need to evaluate workflows, agents, and orchestration layers.
Trust and compliance. A model’s performance must be measurable against bias, fairness, robustness, and explainability standards—factors essential to regulatory compliance (e.g. EU AI Act).

As Vincent Caldeira (Red Hat CTO for APAC & FINOS TOC) put it in the workshop: “General-purpose AI benchmarks often fall short for the unique requirements of the finance sector. The challenge is to design fine-grained and scalable evaluation methods that reflect cost-efficiency, safety, and robustness—without falling into the trap of ‘safetywashing’."

From Use Cases to a Taxonomy

The FINOS initiative takes a taxonomy-first approach: mapping financial services use cases to risks, and then to metrics.

This taxonomy underpins every deliverable:

Use case catalogues (e.g., credit risk decisioning, equity research summarization, regulatory parsing etc.).
Risk categories (hallucinations, bias, robustness, compliance gaps).
Metrics and thresholds that define “trustworthy AI” in practice.

During the workshop, participants worked on identifying the initiative identity: The project is not about building models or competing on leaderboards, but about establishing guardrails, repeatable test datasets, and reference architectures that financial institutions can trust.

Screenshot 2025-09-26 at 7.15.31 AM

IDENTIFYING THE INITIATIVE IDENTITY IN PRACTICE

Screenshot 2025-09-26 at 7.15.50 AM

A MORE DETAILED DESCRIPTION OF WHAT IS, IS NOT, DOES, & DOES NOT

Prioritizing Deliverables: Quick Wins vs. Strategic Goals

The initiative is carefully staged. As shown in “Artefact 2: Prioritizing Deliverables”, the first quick wins include:

Publishing a taxonomy of use cases and risks.
Developing synthetic data pipelines and test datasets.
Drafting evaluation guidelines (e.g., for LLM-as-judge).

Longer-term strategic deliverables include reference architectures for retrieval-augmented generation (RAG) and agentic workflows, and monitoring guidance for production systems.

This roadmap ensures that the community can start testing and sharing results quickly while building toward a comprehensive benchmarking suite.

Screenshot 2025-09-26 at 7.16.00 AM

PRIORITIZING DELIVERABLES IN PRACTICE

Community Contributions and Open Assets

A key strength of the initiative is its open-source backbone. Several assets are already live:

Synthetic datasets from partners like NayaOne, covering 10 product types, 21 sub-categories, and nearly 900 unique market identifiers — publicly available on Hugging Face
An agent quickstart repo for building and testing financial workflows.

These assets allow contributors to move beyond theory and experiment with real, production-grade financial workflows today.

What Comes Next

Looking ahead, the initiative is entering a crucial phase:

Q4 2025: Publishing curated literature reviews and prototype reference architectures.
Q1 2026: Piloting with financial institutions.
Q2 2026: Expanding the taxonomy across the industry and scaling shared benchmarks

This trajectory reflects the urgency: as regulators like the EU implement the AI Act, financial institutions will need sector-specific tools to demonstrate compliance and trustworthiness of their AI systems.

A Call to Action

The momentum around FINOS is more than an industry experiment—it’s a recognition that AI in finance requires collective stewardship. Just as the Linux Foundation helped standardize software infrastructure globally, this initiative seeks to define the evaluation infrastructure for AI in financial services.

As I noted in my LinkedIn reflections: this is not only about testing models but about building the conditions for safe, efficient, and compliant AI adoption across the sector. By participating in FINOS, institutions are not just consuming benchmarks—they are shaping the very standards by which the industry will measure trust in AI.

Screenshot 2025-09-26 at 7.16.13 AM SOME OF THE GREAT WORKSHOP & TECHSPRINT PARTICIPANTS IN LONDON

Final Thought

The financial services industry cannot afford to adopt AI blindly. Benchmarks designed for consumer apps won’t cut it in a world of derivatives, compliance, and systemic risk. What FINOS is building—an open, transparent, and finance-specific evaluation suite—isn’t just desirable. It’s necessary.

This initiative would not be possible without the energy, insights, and contributions of everyone who joined the workshop and techsprint. A heartfelt thank you to all participants who helped define the taxonomy, prioritize deliverables, and bring the first version of the evaluation and benchmarking framework to life.

A special thanks goes to Vincent Caldeira and the Red Hat team for generously providing the space in central London and for helping anchor the discussion with a vision for trustworthy, domain-specific AI evaluations in finance. Your support and leadership made these sessions not only productive but inspiring.

Together, we are building the foundations for safe and scalable AI adoption in financial services—and this is only the beginning.

Interested in FINOS open source AI? Click the link below to see how to get involved in the FINOS AI Community.

FINOS Good First Issues - Looking for a place to contribute? Take a look at good first issues across FINOS projects and get your feet wet in the FINOS community.

State of Open Source in Financial Services Report 2024 - Learn about what is really happening around open source in FSI.

This Week at FINOS Blog - See what is happening at FINOS each week.

FINOS Landscape - See our landscape of FINOS open source and open standard projects.

Community Calendar - Scroll through the calendar to find a meeting to join.

FINOS Slack Channels - The FINOS Slack provides our Community another public channel to discuss work in FINOS and open source in finance more generally.

Project Status Dashboard - See a live snapshot of our community contributors and activity.

Events - Check out our upcoming events or email marketing@finos.org if you'd like to partner with us or have an event idea.

FINOS Open Source in Finance Podcasts - Listen and subscribe to the first open source in fintech and banking podcasts for deeper dives on our virtual "meetup" and other topics.

Community Blog