Authors: Zijian Luo, Xi Wu, Hong Jin Kang, Alan Fekete, Rahul Gopinath
Venue: ISSRE 2026
Modern data-processing pipelines rely on input records conforming to strict format specifications. In practice, however, data corruption can occur at numerous stages including data-entry error, corruption during input processing and retransmission, inconsistent formatting, and incompatible specifications. Such corrupted data can result in loss of records, reducing the accuracy of processing.
Rather than discarding corrupted records and losing valuable information, one can attempt to repair the data. Data-repair solutions such as regular expression based repair and error-correcting parsers require a specification to perform structural repairs.
Specification-free techniques such as ddmax and εRepair are limited in repair operations, repair location, and require specific parser properties that are often unavailable.
To tackle this challenge, we introduce βMax, a novel format-free data repair algorithm optimal with respect to the provided example data, with maximal data-recovery and minimal parser constraints.
Despite requiring less information than εRepair, βMax repairs 83% of all corrupt records—1.77×the rate achieved by εRepair, while using 27.7×fewer oracle calls.