Effective early termination techniques for text similarity join operator

Ozalp, SELMA; Ulusoy, Özgür

Effective early termination techniques for text similarity join operator

COMPUTER AND INFORMATION SCIENCES - ISCIS 2005, PROCEEDINGS, cilt.3733, ss.791-793, 2005 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 3733
Basım Tarihi: 2005
Dergi Adı: COMPUTER AND INFORMATION SCIENCES - ISCIS 2005, PROCEEDINGS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED)
Sayfa Sayıları: ss.791-793
Çukurova Üniversitesi Adresli: Hayır

Özet

Text similarity join operator joins two relations if their join attributes are textually similar to each other, and it has a variety of application domains including integration and querying of data from heterogeneous resources; cleansing of data; and mining of data. Although, the text similarity join operator is widely used, its processing is expensive due to the huge number of similarity computations performed. In this paper, we incorporate some short cut evaluation techniques from the Information Retrieval domain, namely Harman, quit, continue, and maximal similarity filter heuristics, into the previously proposed text similarity join algorithms to reduce the amount of similarity computations needed during the join operation. We experimentally evaluate the original and the heuristic based similarity join algorithms using real data obtained from the DBLP Bibliography database, and observe performance improvements with continue and maximal similarity filter heuristics.