Michael Bendersky (Data)

Experimental Data and Annotations

In some of the published material annotated data was used, which was not readily available from traditional sources such as TREC. When possible, I will publish this data here, in order to promote the reproducibility of our research.

WIT : Wikipedia-based Image Text Dataset

The Wikipedia-based Image Text (WIT) is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models. More details on dataset user can be found in this paper.

GWikiMatch (Long-Form Document Matching)

The GWikiMatch is a benchmark dataset for long-form document matching. It contains Wikipedia article pairs that are human labeled for similarity The rating scale is 0 (not similar), 1 (somewhat similar) and 2 (strongly similar). In total, ~15K labeled pairs are available. More details on dataset user can be found in this paper.

Query Representation and Understanding Dataset (QRU-1)

The Query Representation and Understanding (QRU-1) data set contains a set of similar queries that can be used in web research such as query transformation and relevance ranking. QRU-1 contains similar queries that are related to existing benchmark data sets, such as TREC query sets. The data set was created by extracting 100 TREC queries, training a query-generation model and a commercial search engine, generating similar queries from TREC queries with the model, and removal of mistakenly generated queries. More details on dataset user can be found in this paper.

Syntactic Annotation of Search Queries

M. Bendersky, W. B. Croft, D.A. Smith: "Structural Annotation of Search Queries Using Pseudo-Relevance Feedback" In Proceedings of CIKM 2010 [pdf]
M. Bendersky, W. Bruce Croft and D. A. Smith: "Joint Annotation of Search Queries" In Proceedings of ACL-HLT 2011 [pdf]
In these two papers, we annotated 250 search queries from a search log with capitalization, POS tagging and segmentation annotations. The annotation can be found in this tar.gz file.

Finding Text Reuse on the Web

M. Bendersky, W. B. Croft: "Finding Text Reuse on the Web" In Proceedings of WSDM 2009 [pdf]
In this paper, we used a set of 50 "text-reuse" queries. These queries were essentially sentence-long excerpts from news articles
This text file provides a list of these queries. Each query is associated with a source date, to enable reproducing the source date detection results discussed in the paper.

Discovering Key Concepts in Verbose Queries

M. Bendersky, W. B. Croft: "Discovering Key Concepts in Verbose Queries" In Proceedings of SIGIR 2008 [pdf]
In this paper, we annotated 500 TREC "description" queries with a "key concept". The key concept is defined as a single noun phrase that best represents the information need underlying the query. These annotations were used to train a concept weighting method.
This tar.gz file contains the annotated key concepts, as well as the structured Indri queries containing the concept weights learned by our method. See README.txt in the file for details.