By 苏剑林 | December 07, 2022
Recently, I learned a new concept called "Geodesic Distance" from a recent paper 《Unsupervised Opinion Summarization Using Approximate Geodesics》, which I found quite interesting and would like to share with you.
To me, what is "new" is not the concept of geodesic distance itself (which I encountered back when I studied Riemannian geometry), but rather how the field of semantic similarity can cleverly construct geodesic distances and apply them in certain scenarios. If we want to be fancy, we could even call it "semantic similarity on a manifold"—doesn't that sound much more advanced instantly?
First, let's briefly summarize the main content of the original paper. As the title suggests, the theme of the paper is summarization. Typically, unsupervised summarization is done like this: assume an article consists of $n$ sentences $t_1, t_2, \cdots, t_n$, and design a scoring function $s(t_i)$ for each sentence (classically tf-idf and its variants), then pick several sentences with the highest scores as the summary. Of course, the paper does not just do simple summarization, but "Opinion Summarization." This "Opinion" can be understood as a predefined theme or center $c$; the summary should lean towards extracting sentences related to $c$, so the scoring function should also be related to $c$, i.e., $s(t_i, c)$.
Since the era of "Everything is an Embedding," a mainstream way to design $s(t_i, c)$ is to encode both the sentence $t_i$ and the theme $c$ into corresponding sentence vectors $\boldsymbol{v}(t_i), \boldsymbol{v}(c)$, and then use the inverse of some distance as the scoring function:
\begin{equation}s(t_i, c) = \frac{1}{d(\boldsymbol{v}(t_i), \boldsymbol{v}(c))}\end{equation}In this design, the sentence vector encoding model $\boldsymbol{v}(\cdot)$ and the distance function $d(\cdot, \cdot)$ are both designable spaces. The original paper did some work on both $\boldsymbol{v}(\cdot)$ and $d(\cdot, \cdot)$. $\boldsymbol{v}(\cdot)$ is not the focus of this article, so it will be skipped; interested readers can refer to the original paper. As for the paper's contribution to $d(\cdot, \cdot)$, it is replacing common simple distances with the "geodesic distance" that is the subject of this article.
Why use geodesic distance? This starts with how we train sentence vectors.
Sentence vectors can be learned either supervised or unsupervised. Taking supervised learning as an example, it generally involves contrastive learning with positive and negative sample pairs (refer to 《CoSENT (1): A more effective sentence vector scheme than Sentence-BERT》). A positive sample pair is a pair of sentences marked with basically the same semantics; we consider their similarity very high or their distance very small. The problem lies in the negative sample pairs. As two sentences with different semantics, they might be specifically labeled hard negatives, or they might be two irrelevant samples picked at random. In principle, these two cases should be assigned different distances, but in practice, they are both simply labeled with the same tag, namely "negative."
This leads to a result: the distance values calculated using sentence vectors are theoretically accurate only for sentences that are semantically close; for sentences with large semantic gaps, the distance values can only be used to distinguish positive from negative samples but cannot be used for comparison within the neighboring range. For example, we can say that a distance of $1$ is more similar than a distance of $2$, and we can also say that a distance of $1$ is more similar than a distance of $10$, but we cannot confidently say that a distance of $10$ is more similar than a distance of $11$, because once the distance is large, its absolute numerical value is no longer accurate.
In retrieval scenarios, usually, one seeks to recall samples with very high similarity (i.e., very small distance), so directly using a simple distance function $d(\cdot, \cdot)$ for retrieval is fine. However, for the "Opinion Summarization" scenario in the original paper, what needs to be calculated is the distance $d(\boldsymbol{v}(t_i), \boldsymbol{v}(c))$ between a "sentence" and a "theme." The similarity between a "sentence" and a "theme" is not necessarily very high (the distance is slanted towards being large). In other words, it needs to perform relative comparisons in a range where distance similarity is large, which makes it suitable for geodesic distance.
Geodesic distance, simply put, is the shortest distance between two points. Since a manifold might not be flat, this distance is not necessarily the straight-line distance (Euclidean distance) between two points. A classic example is traveling from the South Pole to the North Pole of the Earth; we cannot walk in a straight line through the center of the earth, but must walk along the earth's surface to the equator and then to the North Pole, following a curved (semicircular) path.
In a local area (where the distance is small), the earth is still flat, so Euclidean distance is still usable. But when applied to large distances like "South Pole to North Pole" or "South Pole to Equator," it is no longer accurate. This is very similar to the semantic similarity scenario just mentioned—the known distance (such as Euclidean distance) is accurate at close range but inaccurate at long range, essentially because the manifold is not flat.
Fortunately, local distances are enough. We can transform it into a graph problem and use "shortest path" algorithms to estimate the approximate geodesic distance. Specifically, we can use the existing distance function to calculate the distance between each point and current points, and then retain only the $k$ nearest points (alternatively, truncate by a threshold depending on the situation). We connect them with an edge labeled with the distance. In this way, all points and edges form a weighted graph (which we call a "$k$-nearest neighbor graph"). We can then use Dijkstra's algorithm to search for the shortest path between any two points on the graph and calculate its length, which is the approximate result for the geodesic distance.
In summary, under the assumption that "distances between nearby points are accurate and distances between distant points are less accurate," we can use the $k$-nearest neighbor graph plus the shortest path method to estimate the geodesic distance of distant points as a substitute. Since geodesic distance considers the manifold condition of the vector space, it **potentially** yields better results. (Refer to Table 8 of the original paper)
This article shared the concept of "Geodesic Distance," pointing out that the similarities we usually train might only be suitable locally. Replacing them with geodesic distance at longer distances might help with certain semantic similarity problems.