Written by:
Date:
March 13, 2025
A lot has changed since our last white paper on OpenAI’s ChatGPT two years ago. Whilst it then represented the only accessible consumer model, today there is an overwhelming and ever-expanding set of competing base models, from Facebook’s Llama and X’s Grok to Anthropic’s Claude and French company Mistral’s 7B. On top of these base models, increasingly sophisticated services are being built, such as Perplexity.ai’s novel search engine, in addition to specialised academic and legal research tools. Not to mention the Chinese company DeepSeek’s widely heralded r1 and v3 models, which are considered competitive with other frontier models on all major benchmarks, but which, it is claimed, cost the company substantially less money to train, and less power to run. And just last week, Chinese company Butterfly Effect released its own model ‘Manus’ to much fanfare, claiming that it represents the world’s first ‘fully autonomous’ AI agent.
Sam Altman of OpenAI and Jensen Huang of Nvidia have forecast we will soon see the emergence of ‘agentic’ AI – AI models which, once set an end goal, can determine for themselves what steps are necessary, and then execute these with minimal human intervention. How well do these current models serve us in the intelligence sector?
At CRI we have been testing various market offerings alongside our casework, to evaluate where efficiencies may exist, and problems may lurk, and to ensure our accuracy, comprehensiveness, and discretion is not compromised. The undisputed leader for general-purpose research remains OpenAI, whose Deep Research agent, launched in early February, presently far exceeds the capacities of its nearest competition, such as the agent of the same name built by Perplexity. It is capable of producing a final report that is well put together, logically structured, and impressively detailed, citing up to at least 40 discrete sources. It can be a useful time saver, but only as a starting point.
To date we are yet to find it comprehensive on any subject, no matter how narrowly defined. This is partially a question of access regarding which sites and databases OpenAI is permitted or licensed to query. A great deal of information one might require in the course of conducting thorough due diligence on a subject appears to remain inaccessible to Deep Research, including litigation records, corporate filings, and much paywalled media. We have also observed that the model entirely lacks a hierarchy of value when it comes to its sources, citing reliable and unreliable material with minimal discrimination.
The model in its current form also lacks the deductive reasoning found in a proficient investigator. On one project, it missed the subject’s inflammatory social media posts because their profile did not explicitly bear their name. Cross-referencing identifying details contained in posts made to the account with personal identifiers confirmed elsewhere would prove a straightforward task for a researcher, but appears to remain beyond Deep Research’s capacity for the time being.
For the time being, then, Deep Research remains a supplementary research tool – useful for the purposes of performing one final check, to confirm that it is not able to identify anything which may have been missed. This, too, has been a help and a hindrance; with the agent, for example, apparently identifying new information which upon closer inspection has proved to be inaccurate – an erroneous conflation of two unrelated datapoints in a large database or a webpage with a counterintuitive layout.
A recent analysis of eight AI search tools by the Columbia Journalism Review concluded that chatbots remain too eager to please, attempting to answer questions to which they did not know the answer with speculative or inaccurate information, presented confidently as fact. Interestingly, the analysis found that even ‘content licensing deals with news sources provided no guarantee of accurate citation’ in chatbot responses.
It has been credibly argued in some quarters that inaccuracies and bias are inherent to LLMs on account of their probabilistic nature, and that these will never be entirely overcome so long as this remains the primary approach to building advanced AI (an article for another day), but whatever the cause, it is clear that questions of accuracy remain the most fundamental barrier to a more meaningful and widespread integration of existing tools into the due diligence research process.