Philadelphia's City Archives, housed at 3101 Market Street in West Philadelphia, is sitting on a problem that has been building since the early 2000s: tens of thousands of duplicate digital image files that clog storage servers, slow down search systems, and make it harder for residents, researchers, and journalists to find what they're actually looking for. City officials have been working through a phased cleanup effort since late 2024, but the road to this point stretches back more than two decades.
The timing matters. Philadelphia is deep into a push to modernize its public records infrastructure, and the duplicate image problem has become a concrete obstacle. The Mayor's Office of Innovation and Technology has flagged redundant archival data as a cost driver in at least two budget cycles, and the issue surfaced again during City Council appropriations discussions earlier this year. With the city's fiscal year 2027 budget now under negotiation, the Archives is making a direct case for dedicated remediation funding.
How the Backlog Built Up
The roots go back to the early digitization drives of the late 1990s and early 2000s, when city agencies — the Department of Records, the Register of Wills, the Philadelphia Housing Authority — each began scanning their own document collections with little coordination. Equipment varied. File-naming conventions differed from office to office. The same photograph of a North Philadelphia rowhouse demolition, or the same land deed from Fishtown, might be scanned three times by three different departments using three different resolution settings, then uploaded to separate servers that were later merged when the city consolidated its data infrastructure around 2009 and again in 2017.
The Philadelphia City Planning Commission's historical survey work added another layer. Neighborhood documentation projects across Germantown, Kensington, and South Philadelphia generated large photo batches throughout the 2000s and 2010s. When those files migrated into the central archival system, automated deduplication tools — the kind that flag exact binary matches — caught some redundancies but missed thousands of near-duplicate images: the same scene photographed seconds apart, or the same document scanned at slightly different angles.
By one internal estimate cited during a 2024 City Council budget briefing, roughly 18 percent of the digital image holdings in the central archive at that time were functionally duplicate or near-duplicate files. The archive holds millions of items, meaning the redundant files numbered in the hundreds of thousands. Storage costs for city government data infrastructure have climbed steadily, with the city's overall IT operating expenditure crossing $120 million in fiscal year 2025 according to the published city budget.
The Remediation Effort and What Comes Next
The current cleanup program, which the Department of Records began piloting in the fourth quarter of 2024, uses a combination of perceptual hashing software — tools that can identify visually similar images even when file sizes differ — and manual review by archival staff. The pilot focused first on the circa-1960s urban renewal photography collections tied to the old Redevelopment Authority, a set of images documenting demolition in neighborhoods like the Eastwick section of Southwest Philadelphia and blocks near the former Penn Central rail yards in North Philadelphia.
The Free Library of Philadelphia's Digital Collections team has been coordinating with the Archives on best practices, given that the Library faced a comparable, if smaller, redundancy problem with its own neighborhood photograph holdings in 2022. That collaboration has helped establish a shared metadata standard that should prevent future duplications when collections are ingested from outside agencies.
Residents and genealogical researchers who use the Archives regularly — through both the Market Street walk-in facility and the online portal at phila.gov/departments/department-of-records — may notice improved search results as the cleanup progresses. The practical advice for anyone trying to access city historical images now: if a search returns what look like identical results, it's worth clicking through each, since the files may differ in resolution or associated metadata. The remediation effort is expected to reach the bulk of the pre-2010 photograph holdings by the end of calendar year 2026, though archival staff have cautioned that the more recent collections will take longer to process.