Information Retrieval: TF-IDF

Posted on:

Note: the content of this post is mainly notes taken from professor Stefano Mizzaro’s lessons on (Web) Information Retrieval at Uniud.

Note: Some terms used in this post are taken from the glossary.

Boolean model

Uses boolean operators to perform queries. The terms therefore can either be present in the document or not, dividing N in 2 parts.

sim(d_j, q) ϵ {0, 1}

Pros

  • it is very simple and clear to understand

Contra

  • it is not much easy to understand for non-technical users, for instance because in boolean logic the AND operator corresponds to an “or” in natural language
  • it does not perform any kind of ranking
  • partial matches are not possible

Vector space model

Draws the documents and the keywords in a t dimensional space. Each keyword represents a point in this hyperspace from which we can draw a vector that starts from the origin, which represents the document. The query is also one of the vectors. This allows to rank documents by similarity.

keywordpoint
documentvector
queryvector

This model makes possible to compare keywords and their relationship with documents by means of mathematical operations, such as the cos. Each vector is independent, it is not related to the other vectors.

Vector space model similarity function

sim(dj, q) ϵ [0, 1]

Conversely from the Boolean model, now each keyword has a weight wi,j, which is not trivial to compute. Introducing TF-IDF. TF-IDF means: Term Frequency - Inverse Document Frequency.

Term Frequency

Term Frequency formula

The number of occurrencies of a term in a document compared to the maximum frequency of a term in that document. It is therefore a relationship between the selected term and the term that occures more times in a document. It is therefore in the interval (0, 1).

Inverse Document Frequency

Inverse Document Frequency formula

Document Frequency (log argument flipped upside down): relationship between the number of documents that contain the selected term and the total number of documents. The inverse of the document frequency now comes natural.

log is uset to mitigate the importance of IDF, giving more importance to TF.

TF is different for every document. IDF is always the same.

The multiplication of TF and IDF gives the weight wi,j:

Vector model weight using TF-IDF