Who decides what AI tells you? Campbell Brown, once Meta’s news chief, has thoughts
At a glance:
- Forum AI aims for 90% consensus between AI judges and expert panels on high‑stakes topics.
- Founded 17 months ago, the startup has recruited experts like Niall Ferguson and former Secretary of State Tony Blinken for geopolitics benchmarks.
- Brown says enterprise liability concerns could push companies to prioritize truth over engagement.
What forum ai is doing
Campbell Brown, former Facebook (now Meta) news chief, launched Forum AI in New York 17 months ago to address a gap she sees in foundation‑model evaluation. The company’s mission is to build “AI judges” that can assess model outputs on what she calls “high‑stakes topics” – geopolitics, mental health, finance, hiring – areas where answers are rarely black‑and‑white. Brown told TechCrunch’s Tim Fernholz that the goal is to reach roughly 90% consensus between these AI judges and human subject‑matter experts, a threshold she says Forum AI has already achieved in its pilot work.
The approach is deliberately layered. First, Forum AI identifies the world’s foremost experts for a given domain, then works with them to design rigorous benchmarks. Once the benchmark is set, the company trains AI models to evaluate new model outputs at scale, comparing the AI judge’s rating to the experts’ consensus. This feedback loop is intended to surface systematic failures that would otherwise be hidden in generic, engagement‑driven metrics.
Expert panel and benchmarking approach
For its geopolitics track, Brown assembled a high‑profile advisory group that includes historian Niall Ferguson, journalist Fareed Zakaria, former Secretary of State Tony Blinken, former House Speaker Kevin McCarthy, and Anne Neuberger, who led cybersecurity in the Obama administration. The panel’s role is to craft scenario‑based prompts and define what a correct, nuanced answer looks like. Forum AI then runs leading foundation models – such as Google’s Gemini – against these prompts and measures how often the AI judge’s verdict aligns with the experts.
Brown says the process has already yielded a 90% agreement rate on many of the test cases, suggesting that AI judges can approximate expert judgment when the benchmark is well‑defined. The company hopes this methodology can be replicated across other domains, from mental‑health advice to financial risk assessment, where regulatory compliance and user safety are paramount.
Findings on current foundation models
When Forum AI evaluated the leading models, the results were sobering. Brown highlighted that Gemini was pulling information from Chinese Communist Party websites for stories unrelated to China, indicating a data‑source bias that could skew answers. Across most models, she observed a left‑leaning political tilt, as well as subtler issues such as missing context, absent perspectives, and occasional straw‑manning of arguments without acknowledgment.
These failures are not merely academic; they translate into real‑world risk when users rely on chatbots for health advice, legal guidance, or hiring decisions. Brown argues that many of the problems are “easy fixes” – better curation of training data, more transparent source attribution, and explicit bias‑mitigation steps – yet they remain unaddressed because the industry’s current evaluation standards are too superficial.
Why accuracy has been ignored
Brown points to a cultural mismatch within the AI sector. “Foundation model companies are extremely focused on coding and math,” she said, noting that news and information quality are harder to quantify and therefore often deprioritized. Her experience at Meta, where she built a fact‑checking program that was later dismantled, reinforced the lesson that optimizing for engagement harms societal knowledge.
The prevailing compliance landscape compounds the problem. New York City’s first hiring‑bias law required AI audits, but the state comptroller found that more than half of the audited systems had undetected violations. Brown calls the existing “checkbox audits and standardized benchmarks” a joke, arguing that true evaluation demands deep domain expertise and the willingness to explore edge cases that could land companies in legal trouble.
Enterprise demand as a lever for change
Despite the technical challenges, Brown sees a pragmatic path forward: enterprise liability. Companies that use AI for credit decisions, insurance underwriting, or hiring have a direct financial incentive to avoid inaccurate or biased outputs. “They’re going to want you to optimize for getting it right,” she explained, suggesting that regulatory risk could push firms to adopt Forum AI’s more rigorous evaluation framework.
Forum AI’s business model is built around this premise, offering bespoke audit services and continuous monitoring for enterprises that need to demonstrate compliance. However, turning sporadic compliance interest into recurring revenue remains difficult, especially when many organizations are satisfied with one‑off, surface‑level checks.
Challenges ahead and funding outlook
The startup raised $3 million in a fall round led by Lerer Hippeau, giving it the runway to expand its expert panels and refine its AI‑judge technology. Brown acknowledges that scaling the approach will require more experts, more nuanced benchmarks, and a cultural shift within the AI industry to value truth over click‑through rates.
She remains cautiously optimistic. “Right now it could go either way,” Brown said. “Companies could give users what they want, or they could give people what’s real, honest, and truthful.” The next few years will likely see a tug‑of‑war between engagement‑driven product roadmaps and the emerging demand from regulated enterprises for verifiable, accurate AI outputs.
FAQ
What consensus level does Forum AI aim to achieve between its AI judges and human experts?
Which experts are part of Forum AI’s geopolitics advisory panel?
What are some of the key shortcomings Brown identified in current foundation models?
More in the feed
Prepared by the editorial stack from public data and external sources.
Original article