余弦相似度和欧氏距离_欧氏距离和余弦相似度
余弦相似度和欧氏距离

该作品是由Markus Winkler摄制于Unsplash,并发表于该平台上的作品
This serves as a concise and point-blank primer on the concepts of Euclidean distance and cosine similarity, centered on their applications in natural language processing.
这是对欧氏距离和余弦相似度的快速而直接的介绍,重点是NLP。
欧氏距离 (Euclidean Distance)
The Euclidean distance metric enables one to determine the degree to which two points or two vectors differ from one another.
欧几里德距离度量标准可让您确定两个点或两个向量彼此相距多远。
Imagine yourself as a high school student with three academic streams: mathematics, philosophy, and psychology. Your goal is to assess similarities among these streams based on the terminology your teachers employ within each discipline. To simplify this process, consider just two fundamental concepts: theory and harmony. To facilitate analysis, construct an occurrence matrix detailing word usage across each category.
假设你是一名高中生,并拥有三个班级:数学课、哲学课和心理学课。你希望根据教授在课堂上使用的词汇来评估这些课程之间的相似程度?为了简化说明,请考虑以下两个术语:"理论"和"和谐"。接着,请设计一个表格如下所示,请列出每个类别中的相关术语及其出现频率。

In the table presented here, the term "theory" occurs frequently—60 instances within mathematics lessons alone—while it surfaces just twice during philosophy sessions. In comparison to "harmony," which appears less often—only ten times—but shows up more substantially—40 instances—in philosophy courses and an even higher frequency of seventy occurrences within psychology studies. Let us visualize these data points on a two-dimensional coordinate system.
在此表中,“概念”一词在数学课程中被提及60次,在哲学课程中被提及20次,在心理学课程中被提及25次。而在数学、哲学与心理学课程中,“协调性”这一术语分别出现了10、40与70次。让我们将其转化为二维坐标系统。

The Euclidean distance represents both a measure of separation and a metric for quantifying distances in space. In this graph, you can observe how these distances are calculated.
欧几里得距离就是点之间的距离。 在下图中。

It is evident that d_{1}, representing the distance between psychology and philosophy, is less than d_{2}, which denotes the distance between philosophy and mathematics. However, how can one calculate d_{1} and d_{2}?
通过直观显示,心理学科与哲学学科之间的距离d₁低于哲学学科与数学学科之间的距离d₂。然而,在实际应用中,请说明具体的计算方法?具体计算方法可参考相关文献
The generic formula is the following.
通用公式如下。

In our case, for d1, `d(v, w) = d(philosophy, psychology)``, which is:
在我们的情况下,对于d1, d(v, w) = d(philosophy, psychology) `,即:

And d2
和d2

As expected d2 > d1.
如预期的那样,d2> d1。
How to do this in python?
如何在python中做到这一点?
import numpy as np# define the vectorsmath = np.array([60, 10])philosophy = np.array([20, 40])psychology = np.array([25, 70])# calculate d1d1 = np.linalg.norm(philosophy - psychology)# calculate d2d2 = np.linalg.norm(philosophy - math)
余弦相似度 (Cosine Similarity)
Assume that you have only 2 hours allocated to psychology classes each week, while simultaneously dedicating 5 hours to both mathematics and philosophy courses. Due to attending these two additional subjects more frequently, one can expect a notable increase in the frequency of "theory" and "harmony" within your readings. Hence, the updated table is presented below.
假设每周仅安排2小时用于心理学课程,则数学与哲学课程的时间分配为每周5小时。

And the updated 2D graph:
以及更新后的2D图形:

Referencing the formula established earlier for calculating Euclidean distance, our analysis indicates that d1 surpasses d2 in this particular scenario. However, it's evident that psychology's proximity to philosophy exceeds its closeness to mathematics. In certain cases, course frequency can distort the Euclidean distance metric. Cosine similarity emerges as a solution to address these challenges.
基于我们之前提供的欧几里得距离公式, 我们会得出结论, 在这种情况下,d1大于d2. 然而, 心理学与数学的关系更为接近于与哲学的关系. 课程中的频率可能导致了对欧几里德距离度量标准的误导. 余弦相似度在此场景中有效缓解了这一问题.
Rather than computing the Euclidean distance between two points, cosine similarity focuses on the angle between their corresponding vectors.
余弦相似度关心的是矢量之间的角度,而不是计算点之间的直线距离。

Focusing on the graph, we observe that angle α is smaller than angle β. What cosine similarity seeks to determine is simply whether two vectors are close or not. In other words, a smaller angle signifies greater vector similarity.
放大该图后能够观察到, 角度α确实小于角β, 这正是余弦相似度所关心的核心内容, 换句话说, 角度越小则向量之间的距离就越近
The generic formula goes as follows
通用公式如下

β represents the angle formed by the vectors v and w.
β是向量原理(用v表示)和数学(用w表示)之间的夹角。

Meanwhile, cos(alpha) = 0.99 exceeds cos(beta), signifying that philosophy, when compared to mathematics, is closer to psychology than it is.
而cos(alpha) = 0.99 (高于cos(beta)意味着哲学比数学更接近心理学。
Recall that
回想起那个

and
和

A smaller angle corresponds to a higher cosine similarity as does an increased cosine similarity indicate a higher degree of vector similarity.
这表明当角度值越小时,用户的余弦相似度也随之增大,并且随着余弦相似度值的增大,向量之间的相似程度也会随之提升。
Python implementation
Python实现
import numpy as npmath = np.array([80, 45])philosophy = np.array([50, 60])psychology = np.array([15, 20])cos_beta = np.dot(philosophy, math) / (np.linalg.norm(philosophy) * np.linalg.norm(math))print(cos_beta)
带走 (Takeaway)
I bet you should be familiar with how Euclidean metric and cosine similarity manifest. The former measures the linear distance between two points, while the latter focuses on the angular relationship between vectors.
我坚信您应该已经了解欧几里得距离和余弦相似度的工作原理。 欧几里得距离基于两个点之间的直线距离进行计算,而余弦相似度则基于这两个向量之间的夹角来评估它们的相关性。
Euclidean distance is simpler and always works when your features distribution is balanced. Typically, we often encounter unbalanced data. In such instances, it’s preferable to employ cosine similarity.
欧几里得距离更为简便直观,并且只要要素分布均衡,则能够有效地发挥作用。 通常情况下,在面对数据分布失衡的问题时,在这种背景下推荐采用余弦相似度。
本文探讨了欧几里得距离与余弦相似度在适用场景上的差异与选择原则。
欧几里得距离是一种基于n维空间中两点之间直线距离的差异性度量。
相较之下,余弦相似度通过计算两个非零向量之间的夹角余弦值来衡量它们的相似程度。
在数据特性和应用需求的不同背景下,选择合适的度量标准至关重要。
理解当使用每种方法时对机器学习模型效果的影响至关重要。
余弦相似度和欧氏距离
