How the Philadelphia Inquirer uses AI to open up its huge archive
One of the oldest newspapers in the USA wants to use semantic search, agents and personas to enable its journalists to research archive material more efficiently
In January 2025, The Philadelphia Inquirer took part in a two-week hackathon as part of an AI initiative organized by the Lenfest Institute for Journalism, Microsoft and Open AI. This resulted in various AI tools to improve the workflow.
The most interesting result of the hackathon is an AI-powered research tool that helps the daily newspaper unearth the treasures in its archive and develop potential new verticals and products from them. This should be worthwhile, as the Philadelphia Inquirer, founded in 1829, is one of the oldest daily newspapers in the U.S. with the third-longest uninterrupted publication period of 196 years and a wealth of historical material in its archive.
The goal: Optimize the archival research workflow
The tool is intended to solve the problem that reporters and editors at the Inquirer have had. They spend a considerable amount of time making multi-stage keyword searches for background material in fragmented parts of the in-house archive (print archive, digital archive, photo archive), researching external sources, working their way through the collected data volumes and then inserting the relevant data into their storytelling.
The AI-supported semantic search was designed to:
enable queries in natural language
recognize the meaning and context of queries
tap into a large number of archive sources simultaneously
create automated summaries
independently suggest meaningful follow-up queries
maintain journalistic integrity, i.e. transparently disclose research steps and sources and not fabricate
not deviate from the “human in the loop” principle
Three months after the hackathon, Matt Boggie, Chief Technology and Product Officer of the Inquirer, reported on the current status at the International Journalism Festival in Perugia and described details of the project: "We have full archive access to everything from 1978 forward, and that now includes images related to those stories as well. We have some more work to do on the image side, but what we hope to be able to do is to access any of the assets that we have in our archive just by asking for the sort of thing that we're looking for."
The Inquirer relies on GPT-4 for its new, as yet unnamed, archive research system and uses the functions of Azure Search and Azure OpenAI. Reporters and editors were involved in the development of the system from the outset so that the AI is optimally aligned with real workflows. The system is being successively optimized with internal user studies and observations of how journalists interact with the tool in their real workflow.
Simpler search with agents and personas
The archive tool uses multiple AI agents to handle different aspects of queries, including deriving data for historical context. Agents and personas perform various tasks in the query process. Agents can independently define relevant time periods and refine queries.
For example, if a reporter asks about the Reagan administration, the system understands that this refers to the period from January 20, 1981 to January 20, 1989.
An example of a persona, according to Boggie, is “a librarian who looks through files on famous people and companies in the region and is a research partner for reporters working on such topics.”
A research tool for journalists, but not for users
The Inquirer wants to focus on internal use with the tool. A version that allows readers to search the Inquirer archive with the help of AI is not planned for the time being.
According to Boggie, several factors play a role here:
Risk minimization: (errors and fakes are less likely to reach the public with an internal tool).
Historical classification: Perspectives change over the course of history and new facts come to light. The internal team is trusted to take this into account more than the readers.
Value of human journalism: According to Inquirer surveys, the public appreciates the human judgment of reporters and editors.
Matt noted that their readers specifically value human editorial judgment: "Every time we've done surveys about things like personalization, the pushback we always get is 'all of my other feeds are personalized. What I need is someone to tell me what is most important today.'"
Future plans for the research assistant:
create automated chronologies and timelines
automatically recognize and highlight historically relevant events
open up more distant time periods
make non-digitized historical materials (writings, books, photos, etc.) searchable using image recognition
Why is the Inquirer archive research tool relevant?
Basically, the Inquirer research tool doesn't seem to work much differently than the deep research versions of Open AI, Perplexity and other research tools. However, the Inquirer can minimize the risk of fakes and fabrications by developing its own model and training it on the basis of its own archive material. External sources are filtered using Microsoft's RAG model (Retrieval Augmented Generation) so that no unverified information is mixed with data from the archive in the answers.
According to the Lenfest Institute, the team wants to open up the code and technical documentation for the archive research tool so that other media can also benefit from it globally.
Future updates and enhancements are to be recorded in Microsoft's RAG data repository and also made generally available.