The mathematical measure powering modern semantic search and AI retrieval
Cosine similarity is a metric used to measure how similar two vectors are in a multi-dimensional space. It calculates the cosine of the angle between the vectors, providing a value between -1 and 1 where:
Vectors point in exactly the same direction (angle = 0°)
Vectors are perpendicular (angle = 90°), no similarity
Vectors point in exactly opposite directions (angle = 180°)
The cosine similarity between two vectors A and B is calculated as:
cosine_similarity(A, B) = (A · B) / (||A|| * ||B||)
Where:
A · B is the dot product of A and B
||A|| and ||B|| are the magnitudes (Euclidean norms) of vectors A and B
The vectors are quite similar in direction
Cosine similarity only considers the angle between vectors, not their magnitudes. This makes it ideal for comparing documents of different lengths in NLP.
The calculation is computationally efficient, especially when using optimized linear algebra libraries, making it practical for large-scale applications.
In high-dimensional spaces (like word embeddings), cosine similarity captures semantic relationships better than Euclidean distance.
The bounded range (-1 to 1) provides a standardized way to compare similarities across different vector pairs.
Here's how to implement cosine similarity using NumPy:
import numpy as np
def cosine_similarity(a, b):
"""Compute cosine similarity between two vectors"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
# Example vectors
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
# Calculate similarity
similarity = cosine_similarity(vector1, vector2)
print(f"Cosine similarity: {similarity:.4f}")
For text similarity using TF-IDF vectors:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents = [
"Cosine similarity measures angle between vectors",
"Vectors can represent documents in NLP",
"Machine learning uses vector representations"
]
# Create TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Calculate pairwise similarities
similarities = cosine_similarity(tfidf_matrix)
print("Pairwise cosine similarities:")
print(similarities)
Used in RAG architectures and search engines to find documents with similar meaning to a query, even if they don't share exact keywords.
Helps recommend similar items (products, movies, articles) by comparing vector representations of user preferences and items.
Enables grouping similar documents together by measuring their pairwise cosine similarities.
Used in computer vision to find visually similar images by comparing their feature vectors.
Helps retrieve the most relevant responses by comparing question vectors with potential answer vectors.
Identifies unusual data points by measuring their cosine similarity to normal examples.