Information retrieval (IR) task deals with finding all relevant documents related to user query. Central concepts to IR are removing stop words from corpus (collection of all documents) and query, stemming, lemmatization, representing documents and query to vector, and using some measure of proximity or distance to determine which documents could be relevant to query.
Each word in the document after going through stemming and lemmatization is called term. Each unique term in corpus is represented as one dimension in document space. Thus the vectors representing the documents can have more than 10,000 dimension and thus suffer from high dimensionality. Since not all words occur in each document, documents vectors are very sparse and these words seems to follow Zipf's distribution.
Value of each term in the document vector could be binary (term occur in the document or not), frequency (how often that term is found in the document), or using term frequency-inverse document frequency (TF-IDF). TF-IDF capture the notion that the terms that uniquely defines a document are given higher weightage then commonly occurring terms in the document.
Both documents in the corpus and user query are represented as high dimension vectors in document space. All these vectors are collected into a matrix with column could be representing terms and rows as documents. Then finding the relevant documents is equivalent to finding those row vectors which are more similar to query vector.
To reduce the dimensionality of document vectors, technique such as Latent Semantic Index (LSI) which is somewhat similar to PCA (principle component analysis applied on covariance matrix) can be applied on term-document frequency matrix.
Similarity measurement can be calcualted using cosine, Jaccard's, or Dice, some of the common techniques for similarity finding, to find the relevant documents.
Each word in the document after going through stemming and lemmatization is called term. Each unique term in corpus is represented as one dimension in document space. Thus the vectors representing the documents can have more than 10,000 dimension and thus suffer from high dimensionality. Since not all words occur in each document, documents vectors are very sparse and these words seems to follow Zipf's distribution.
Value of each term in the document vector could be binary (term occur in the document or not), frequency (how often that term is found in the document), or using term frequency-inverse document frequency (TF-IDF). TF-IDF capture the notion that the terms that uniquely defines a document are given higher weightage then commonly occurring terms in the document.
Both documents in the corpus and user query are represented as high dimension vectors in document space. All these vectors are collected into a matrix with column could be representing terms and rows as documents. Then finding the relevant documents is equivalent to finding those row vectors which are more similar to query vector.
To reduce the dimensionality of document vectors, technique such as Latent Semantic Index (LSI) which is somewhat similar to PCA (principle component analysis applied on covariance matrix) can be applied on term-document frequency matrix.
Similarity measurement can be calcualted using cosine, Jaccard's, or Dice, some of the common techniques for similarity finding, to find the relevant documents.
Comments
Post a Comment