官术网_书友最值得收藏!

Sets

Sets are unordered collections of elements, with the additional restriction that the elements must be unique.  The main use-cases where sets are a good choice are membership tests (testing if an element is present in the collection) and, unsurprisingly, set operations such as union, difference, and intersection.

In Python, sets are implemented using a hash-based algorithm just like dictionaries; therefore, the time complexities for addition, deletion, and test for membership scale as O(1) with the size of the collection.

Sets contain only unique elements. An immediate use case of sets is the removal of duplicates from a collection, which can be accomplished by simply passing the collection through the set constructor, as follows:

    # create a list that contains duplicates
x = list(range(1000)) + list(range(500))
# the set *x_unique* will contain only
# the unique elements in x
x_unique = set(x)

The time complexity for removing duplicates is O(N), as it requires to read the input and put each element in the set.

Sets expose a number of operations like union, intersection, and difference. The union of two sets is a new set containing all the elements of both the sets; the intersection is a new set that contains only the elements in common between the two sets, and the difference is a new set containing the element of the first set that are not contained in the second set. The time complexities for these operations are shown in the following table. Note that since we have two different input sizes, we will use the letter S to indicate the size of the first set (called s), and T to indicate the size of the second set (called t):

An application of set operations are, for example, Boolean queries. Going back to the inverted index example of the previous subsection, we may want to support queries that include multiple terms. For example, we may want to search for all the documents that contain the words cat and table. This kind of a query can be efficiently computed by taking the intersection between the set of documents containing cat and the set of documents containing table.

In order to efficiently support those operations, we can change our indexing code so that each term is associated to a set of documents (rather than a list). After applying this change, calculating more advanced queries is a matter of applying the right set operation. In the following code, we show the inverted index based on sets and the query using set operations:

    # Building an index using sets
index = {}
for i, doc in enumerate(docs):
# We iterate over each term in the document
for word in doc.split():
# We build a set containing the indices
# where the term appears
if word not in index:
index[word] = {i}
else:
index[word].add(i)

# Querying the documents containing both "cat" and "table"
index['cat'].intersection(index['table'])
主站蜘蛛池模板: 龙川县| 蕲春县| 万州区| 阆中市| 于都县| 新竹县| 抚宁县| 花垣县| 阿拉善右旗| 衡水市| 平乐县| 鄂伦春自治旗| 昌图县| 烟台市| 南投市| 黄梅县| 洪江市| 麟游县| 明星| 通河县| 岢岚县| 贵溪市| 九龙坡区| 湄潭县| 香河县| 文化| 临猗县| 宁安市| 马鞍山市| 安乡县| 濮阳市| 仁化县| 勐海县| 温泉县| 玛曲县| 五大连池市| 南京市| 周宁县| 信阳市| 柳州市| 勃利县|