Cosine Similarity
The mathematical measure powering modern semantic search and AI retrieval
What is Cosine Similarity?
Cosine similarity is a metric used to measure how similar two vectors are in a multi-dimensional space. It calculates the cosine of the angle between the vectors, providing a value between -1 and 1 where:
1 (Identical)
Vectors point in exactly the same direction (angle = 0°)
0 (Orthogonal)
Vectors are perpendicular (angle = 90°), no similarity
-1 (Opposite)
Vectors point in exactly opposite directions (angle = 180°)
Mathematical Definition
The cosine similarity between two vectors A and B is calculated as:
cosine_similarity(A, B) = (A · B) / (||A|| * ||B||)
Where:
A · B is the dot product of A and B
||A|| and ||B|| are the magnitudes (Euclidean norms) of vectors A and B
Visualizing Cosine Similarity
Current Angle: 30°
Cosine Similarity: 0.87
Interpretation
The vectors are quite similar in direction
Why Use Cosine Similarity?
Magnitude Invariance
Cosine similarity only considers the angle between vectors, not their magnitudes. This makes it ideal for comparing documents of different lengths in NLP.
Efficient Computation
The calculation is computationally efficient, especially when using optimized linear algebra libraries, making it practical for large-scale applications.
Semantic Meaning
In high-dimensional spaces (like word embeddings), cosine similarity captures semantic relationships better than Euclidean distance.
Normalized Output
The bounded range (-1 to 1) provides a standardized way to compare similarities across different vector pairs.
Practical Implementation
Calculating Cosine Similarity in Python
Here's how to implement cosine similarity using NumPy:
import numpy as np
def cosine_similarity(a, b):
"""Compute cosine similarity between two vectors"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
# Example vectors
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
# Calculate similarity
similarity = cosine_similarity(vector1, vector2)
print(f"Cosine similarity: {similarity:.4f}")
For text similarity using TF-IDF vectors:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents = [
"Cosine similarity measures angle between vectors",
"Vectors can represent documents in NLP",
"Machine learning uses vector representations"
]
# Create TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Calculate pairwise similarities
similarities = cosine_similarity(tfidf_matrix)
print("Pairwise cosine similarities:")
print(similarities)
Applications in AI and ML
Semantic Search
Used in RAG architectures and search engines to find documents with similar meaning to a query, even if they don't share exact keywords.
Recommendation Systems
Helps recommend similar items (products, movies, articles) by comparing vector representations of user preferences and items.
Document Clustering
Enables grouping similar documents together by measuring their pairwise cosine similarities.
Image Retrieval
Used in computer vision to find visually similar images by comparing their feature vectors.
Chatbots & QA Systems
Helps retrieve the most relevant responses by comparing question vectors with potential answer vectors.
Anomaly Detection
Identifies unusual data points by measuring their cosine similarity to normal examples.