- In the context of Information Retrieval, given the following documents:
Document 1: Your dataset is corrupt. Corrupted data does not hash!!!
Document 2: Your data system will transfer corrupted data files to trash.
Document 3: Most politicians are corrupt in many developing countries.
and the query:
Query 1: hashing corrupted data
a) Apply the following term manipulations on document terms: stoplist removal, capi-
talisation and stemming, showing the transformed documents. Explain each of these
manipulations. Include in your answer the stoplist you used, making sure it includes
punctuation, but no content words. [20%]
b) Show how Document 1, Document 2 and Document 3 would be represented using an
inverted index which includes term frequency information. [10%]
c) Using term frequency (TF) to weight terms, represent the documents and query as
vectors. Produce rankings of Document 1, Document 2 and Document 3 according
to their relevance to Query 1 using two metrics: Cosine Similarity and Euclidean
Distance. Show which document is ranked first according to each of these metrics.
[30%]
d) Explain the intuition behind using TF.IDF (term frequency inverse document fre-
quency) to weight terms in documents. Include the formula (or formulae) for com-
puting TF.IDF values as part of your answer. For the ranking in the previous question
using cosine similarity, discuss whether and how using TF.IDF to weight terms in-
stead of TF only would change the results (assume here that the document collection
consists solely of Documents 1 – 3). [20%]
e) Explain the metrics Precision, Recall and F-measure in the context of evaluating an
Information Retrieval system against a gold-standard set. Discuss why it is not feasible
to compute recall in the context of searches performed on very large collections of
documents, such as the Web. [20%]
COM3110