- Building Machine Learning Systems with Python
- Willi Richert Luis Pedro Coelho
- 523字
- 2021-08-13 16:35:46
Measuring the relatedness of posts
From the machine learning point of view, raw text is useless. Only if we manage to transform it into meaningful numbers, can we feed it into our machine-learning algorithms such as clustering. The same is true for more mundane operations on text, such as similarity measurement.
How not to do it
One text similarity measure is the Levenshtein distance, which also goes by the name edit distance. Let's say we have two words, "machine" and "mchiene". The similarity between them can be expressed as the minimum set of edits that are necessary to turn one word into the other. In this case, the edit distance would be 2, as we have to add an "a" after "m" and delete the first "e". This algorithm is, however, quite costly, as it is bound by the product of the lengths of the first and second words.
Looking at our posts, we could cheat by treating the whole word as characters and performing the edit distance calculation on the word level. Let's say we have two posts (let's concentrate on the title for the sake of simplicity), "How to format my hard disk" and "Hard disk format problems"; we would have an edit distance of five (removing "how", "to", "format", "my", and then adding "format" and "problems" at the end). Therefore, one could express the difference between two posts as the number of words that have to be added or deleted so that one text morphs into the other. Although we could speed up the overall approach quite a bit, the time complexity stays the same.
Even if it would have been fast enough, there is another problem. The post above the word "format" accounts for an edit distance of two (deleting it first, then adding it). So our distance doesn't seem to be robust enough to take word reordering into account.
How to do it
More robust than edit distance is the so-called bag-of-word approach. It uses simple word counts as its basis. For each word in the post, its occurrence is counted and noted in a vector. Not surprisingly, this step is also called vectorization. The vector is typically huge as it contains as many elements as the words that occur in the whole dataset. Take for instance two example posts with the following word counts:

The columns Post 1 and Post 2 can now be treated as simple vectors. We could simply calculate the Euclidean distance between the vectors of all posts and take the nearest one (too slow, as we have just found out). As such, we can use them later in the form of feature vectors in the following clustering steps:
- Extract the salient features from each post and store it as a vector per post.
- Compute clustering on the vectors.
- Determine the cluster for the post in question.
- From this cluster, fetch a handful of posts that are different from the post in question. This will increase diversity.
However, there is some more work to be done before we get there, and before we can do that work, we need some data to work on.
- C語言程序設計案例教程(第2版)
- Manga Studio Ex 5 Cookbook
- 我的第一本算法書
- 軟件測試項目實戰之性能測試篇
- 編寫整潔的Python代碼(第2版)
- Nexus規模化Scrum框架
- Java EE 7 Performance Tuning and Optimization
- 深入剖析Java虛擬機:源碼剖析與實例詳解(基礎卷)
- QlikView Unlocked
- 軟件工程與UML案例解析(第三版)
- 虛擬現實建模與編程(SketchUp+OSG開發技術)
- 嵌入式C編程實戰
- Swift iOS Programming for Kids
- TensorFlow.NET實戰
- SAP HANA Cookbook