官术网_书友最值得收藏!

Non-Euclidean distances

A non-Euclidean distance is based on the properties of the elements, but not on their location in space. Some well known distances are Jaccard distance, cosine distance, edit distance, and Hamming distance.

Jaccard distance is used to compute the distance between two sets. First, we compute the Jaccard similarity of two sets as the size of their intersection divided by the size of their union, as follows:

The Jaccard distance is then defined as per the following formula:

Cosine distance between two vectors focuses on the orientation and not magnitude, therefore, two vectors with the same orientation have a cosine similarity of 1, while two perpendicular vectors have a cosine similarity of 0. Supposing that we have two multidimensional points, think of a point as a vector from origin (0, 0, ..., 0) to its location. Two vectors make an angle, whose cosine distance is a normalized dot-product of the vectors, as follows:

Cosine distance is commonly used in a high-dimensional feature space; for instance, in text mining, where a text document represents an instance, features that correspond to different words, and their values correspond to the number of times the word appears in the document. By computing cosine similarity, we can measure how likely two documents match in describing similar content.

Edit distance makes sense when we compare two strings. The distance between the a=a1,a2,a3,...an and b=b1,b2,b3,...bn strings is the smallest number of the insert/delete operation of single characters required to convert the string from a to b, for example, a = abcd and b = abbd. To convert a into b, we have to delete the second b and insert c in its place. No smallest number of operations would convert a into b, hence the distance is d(a, b) =2.

Hamming distance compares two vectors of the same size and counts the number of dimensions in which they differ. In other words, it measures the number of substitutions required to convert one vector into another.

There are many distance measures focusing on various properties, for instance, correlation measures the linear relationship between two elements; Mahalanobis distance measures the distance between a point and distribution of other points and SimRank, which is based on graph theory, measures similarity of the structure in which elements occur, and so on. As you can already imagine, selecting and designing the right similarity measure for your problem is more than half of the battle. An impressive overview and evaluation of similarity measures is collected in Chapter 2, Similarity and Dissimilarity Measures, in the book Image Registration: Principles, Tools and Methods by A. A. Goshtasby, Springer Science and Business Media (2012).

主站蜘蛛池模板: 保山市| 彩票| 凉山| 河西区| 吴堡县| 甘谷县| 增城市| 邓州市| 锡林郭勒盟| 太仓市| 宣武区| 浮梁县| 昭通市| 凌海市| 化州市| 鄂伦春自治旗| 襄汾县| 武功县| 佛冈县| 蓬溪县| 澳门| 阿克苏市| 三台县| 霍邱县| 长寿区| 河北省| 松溪县| 璧山县| 澎湖县| 张北县| 都匀市| 买车| 亳州市| 铅山县| 莎车县| 和顺县| 昔阳县| 甘德县| 夏河县| 措勤县| 高雄县|