Skip to content

TF-IDF

What is TF-IDF?

TF-IDF is like having TWO super-smart friends working together to help you find exactly what you're looking for!

  • Friend 1 (TF): "This query word appears a lot in this document!"
  • Friend 2 (IDF): "But wait... is this query word actually special or just common everywhere?"

TF-IDF = TF × IDF

It's the ULTIMATE combination that makes search engines actually work!

How It Works - The Formula

TF-IDF = TF × IDF

TF = (Word count in document) / (Total words in document)

**IDF(t) = log(N / df_t)**

Where:

- **N** = total number of documents in the corpus
- **df_t** = number of documents containing query term t
- **log** = logarithm (typically natural log or log base 10)

In Plain English:

TF-IDF gives HIGH scores to words that appear FREQUENTLY in a specific document but RARELY across all documents.

The Magic Formula Breakdown

High TF-IDF Score happens when:

  1. High TF - Word appears many times in the document
  2. High IDF - Word is rare across all documents
  3. Multiply them - Both conditions met!

Low TF-IDF Score happens when:

  1. Word doesn't appear in document (TF = 0)
  2. Word is super common like "the" (IDF ≈ 0)
  3. Either one being zero = final score is zero!

Key Insights

TF-IDF automatically filters out:

  • ❌ Common words: "the", "is", "and", "of", "in"
  • ❌ Irrelevant documents
  • ❌ Documents that just spam keywords

TF-IDF automatically promotes:

  • ✅ Meaningful, distinctive words
  • ✅ Documents where key terms appear frequently
  • ✅ Relevant, high-quality search results

Quick Summary

Remember the Formula:

TF-IDF = TF × IDF
       = (How often word appears in THIS doc) 
         × 
         (How rare word is ACROSS ALL docs)

The Golden Rules:

  1. High TF + High IDF = Very Relevant! 🌟🌟🌟
  2. High TF + Low IDF = Probably common word (the, is, and)
  3. Low TF + High IDF = Rare word but not in this doc
  4. Low TF + Low IDF = Not relevant at all

Think About It

Question: Why does TF-IDF work so well?

Answer: Because it mimics how humans think!

  • We care about words that appear OFTEN in a specific context (TF)
  • We ignore words that appear EVERYWHERE (low IDF)
  • We focus on what makes something UNIQUE and RELEVANT