Philadelphia's Department of Records is sitting on a backlog of roughly 40,000 duplicate image files spread across at least three separate digital storage systems, according to city staff familiar with the remediation effort currently underway at the Municipal Services Building on JFK Boulevard. The problem didn't happen overnight. It is the product of nearly ten years of piecemeal scanning contracts, a mid-project switch in content management platforms, and at least two rounds of emergency budget cuts that stripped the quality-control staffing needed to catch redundant uploads before they compounded.
The timing matters because Philadelphia is in the middle of a broader push to modernise its public records infrastructure ahead of the 2026 municipal budget cycle. Duplicate files inflate storage costs, slow public-records search tools, and — critically — can cause confusion when archivists and researchers retrieve what they believe is a unique historical image only to discover it is one of several near-identical versions with conflicting metadata tags. For a city whose photographic holdings include building permit imagery from South Philadelphia rowhouse blocks, street-grid surveys from Kensington, and neighbourhood documentation going back to the mid-twentieth century, that confusion has real consequences for planning decisions and historical scholarship alike.
Where the Problem Started
The roots trace back to around 2016, when the city launched its first large-scale digitisation push under a contract awarded through the Philadelphia Water Department's facilities documentation programme and an adjacent initiative housed at the Free Library of Philadelphia's Parkway Central branch on Vine Street. Both projects used different scanning specifications and different file-naming conventions. When the city attempted to consolidate those collections into a single repository between 2019 and 2021, automated migration scripts pulled source files without deduplication checks. The result was a layered archive in which the same image sometimes exists in three formats — original TIFF, a compressed JPEG derivative, and a second JPEG generated during the failed consolidation — each tagged with different creation dates and different department codes.
A subsequent shift to a new content management system in 2022, part of a citywide technology modernisation contract, introduced a fourth layer. Because the legacy platform exported metadata inconsistently, the new system treated files it had already ingested as new assets when staff attempted manual re-uploads to fill gaps. Nobody had a single authoritative file manifest against which to check.
The Philadelphia City Archives, which operates under the Department of Records and maintains holdings at its repository on Broad Street, flagged the duplication issue in an internal review completed in late 2024. That review identified storage costs running at roughly $180,000 annually for the combined digital holdings — a figure that staff believe could be reduced by at least 20 percent once duplicate and derivative files are properly rationalised. The Archives has not released that review publicly.
The Path to Remediation
City staff began a structured deduplication project in early 2025 using open-source file-hashing tools adapted from a model piloted by the Temple University Libraries digital preservation team in North Philadelphia. The approach involves generating cryptographic hash values for every file in the archive, cross-referencing those values against a master index, and flagging matches for human review before any deletion is authorised. Deletion without human sign-off is a hard requirement, because not every duplicate is truly redundant — some apparent duplicates carry unique annotations or represent intentional format derivatives that must be retained under the city's own records retention schedule, last updated in January 2023.
Progress has been slow. The team working the project numbers fewer than five full-time-equivalent staff positions, and the July 4th holiday weekend has paused field work this week. Estimates from inside the department suggest the active deduplication phase could run through at least the first quarter of 2027 before a clean, consolidated image repository is operational.
For residents, the practical upshot is limited but real. Anyone requesting historical building photographs or neighbourhood survey images through the city's Right-to-Know portal may continue to receive inconsistently formatted files, or — in some cases — multiple versions of the same image. The department's records office has advised requesters to note in their submissions if they receive what appears to be a duplicate delivery, so staff can log it as part of the ongoing audit. That feedback loop, modest as it sounds, is currently one of the few mechanisms the city has for catching duplicates that automated hashing has not yet reached.