Document Recovery from Bag-of-Word Indices
Nathanael Fillmore, Andrew B. Goldberg, Xiaojin Zhu
Motivated by computer privacy issues, we present the novel problem of document recovery from an index: given only a document's bag-of-words (BOW) vector or other type of index, reconstruct the original ordered document. We investigate a variety of index types, including count-based BOW vectors, stopwords-removed count BOW vectors, indicator BOW vectors, and bigram count vectors. We formulate the problem as hypothesis rescoring using A* search with the Google Web 1T 5-gram corpus. Our experiments on five domains indicate that if original documents are short, the documents can be recovered with high accuracy.
Download this report (PDF)
Return to tech report index