Computer Sciences Dept.

SLIC: On-The-Fly Extraction and Querying of Web Data

Robert McCann, Pedro DeRose, AnHai Doan, Raghu Ramakrishnan
2006

Increasingly, Web data is displayed in pages generated according to a template (e.g., product listings at amazon.com, faculty directories, class schedules). This trend makes structured querying of such Web data a valuable capability for a growing number of applications, including many ad hoc, exploratory, and short-lived tasks. Unfortunately, current methods for answering such queries require writing complex Perl scripts or creating customized wrappers and storing the extracted data in a DBMS, which is often overkill for these types of on-the-fly tasks. In this paper we propose SLIC, a solution to this problem. Given a set of Web pages generated according to some templates, SLIC allows the user to quickly pose SQL queries, obtain initial results, and then iterate with the system to get increasingly better results. At each step, SLIC asks relatively simple questions to solicit minimal structural information from the user in order to extract data and refine the answers. Extensive experiments on real-world domains show that for many practical queries (1) SLIC is significantly faster than current methods, and (2) the user needs to answer only a few relatively simple questions before obtaining useful answers. SLIC thus provides a promising first step toward a principled solution for on-the-fly extraction and querying of Web data.

Download this report (PDF)


Return to tech report index

 
Computer Science | UW Home