官术网_书友最值得收藏!

Using reduce to transform collections

Sequences and functions that operate on sequences preserve the sequential ordering between elements. Lazy sequences avoid the unnecessary realization of elements in a collection until they are required for a computation, but the realization of these values is still performed in a sequential manner. However, this characteristic of sequential ordering may not be desirable for all computations performed over it. For example, it's not possible to map a function over a vector and then lazily realize values in the resulting collection out of order; since the map function converts the supplied collection into a sequence. Also, functions such as map and filter are lazy, but still sequential by nature.

What's wrong with sequences?

One of the limitations of sequences is that they are realized in chunks. Let's study a simple example to illustrate what this means. Consider a unary function, as shown in Example 3.1, which we intend to map over a given vector. The function must compute a value from the one it is supplied, and also perform a side effect so that we can observe its application over the elements in a collection.

Note

The following examples can be found in src/m_clj/c3/reducers.clj of the book's source code.

(defn square-with-side-effect [x]
  (do
    (println (str "Side-effect: " x))
    (* x x)))

Example 3.1: A simple unary function

The square-with-side-effect function simply returns the square of a number x using the * function. This function also prints the value of x using a println form whenever it is called. Suppose this function is mapped over a given vector. The resulting collection would have to be realized completely if a computation has to be performed over it, even if all the elements from the resulting vector are not required. This can be demonstrated as follows:

user> (def mapped (map square-with-side-effect [0 1 2 3 4 5]))
#'user/mapped
user> (reduce + (take 3 mapped))
Side-effect: 0
Side-effect: 1
Side-effect: 2
Side-effect: 3
Side-effect: 4
Side-effect: 5
5

As shown previously, the mapped variable contains the result of mapping the square-with-side-effect function over a vector. If we try to sum the first three values in the resulting collection using the reduce, take, and + functions, all the values in the [0 1 2 3 4 5] vector are printed as a side effect, as shown in the preceding output. This means that the square-with-side-effect function was applied to all the elements in the initial vector, despite the fact that only the first three elements were actually required by the reduce form. Of course, this can be solved using the seq function to convert the vector to a sequence before mapping the square-with-side-effect function over it. But then, we lose the ability to efficiently access elements in a random order in the resulting collection.

To understand why this actually happens, we first need to understand how the standard map function is actually implemented. A simplified definition of the map function is shown in Example 3.2:

(defn map [f coll]
  (cons (f (first coll))
        (lazy-seq (map f (rest coll)))))

Example 3.2: A simplified definition of the map function

The definition of map in Example 3.2 is a simplified and rather incomplete one, as it doesn't check for an empty collections and cannot be used over multiple collections. That aside, this definition of map does indeed apply a function f to all the elements in a collection coll. This is implemented using a composition of the cons, first, rest, and lazy-seq forms.

This implementation can be interpreted as "applying the function f to the first element in the collection coll, and then mapping f over the rest of the collection in a lazy manner". An interesting consequence of this implementation is that the map function has the following characteristics:

  • The ordering among elements in the collection coll is preserved.
  • This computation is performed recursively.
  • The lazy-seq form is used to perform the computation in a lazy manner.
  • The use of the first and rest forms indicate that coll must be a sequence, and the cons form will also produce a result that is a sequence. Hence, the map function accepts a sequence and builds a new one.

However, none of these properties of sequences are needed to transform a given collection into a result that is not a sequence. Another characteristic of lazy sequences is how they are realized. By the term realized, we mean to say a given lazy sequence is evaluated to produce concrete values. Lazy sequences are realized in chunks. Each chunk is comprised of 32 elements, and this is done as an optimization. Sequences that behave this way are termed as chunked sequences. Of course, not all sequences are chunked, and we can check whether a given sequence is chunked using the chunked-seq? predicate. The range function returns a chunked sequence, as shown here:

user> (first (map #(do (print \!) %) (range 70)))
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
0
user> (nth (map #(do (print \!) %) (range 70)) 32)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
32

Both the statements in the preceding output select a single element from a sequence returned by the map function. The function passed to the map function in both the preceding statements prints the! character and returns the value supplied to it. In the first statement, the first 32 elements of the resulting sequence are realized even though only the first element is required. Similarly, the second statement is observed to realize the first 64 elements of the resulting sequence when the element at the 32nd position is obtained using the nth function. But again, realizing a collection in chunks isn't required to perform a computation over the elements in the collection.

Note

Chunked sequences have been an integral part of Clojure since version 1.1.

If we are to handle such computations efficiently, we cannot build on functions that return sequences, such as map and filter. Incidentally, the reduce function does not necessarily produce a sequence. It also has a couple of other interesting properties:

  • The reduce function actually lets the supplied collection define how it is computed over or reduced. Thus, reduce is collection independent.
  • Also, the reduce function is versatile enough to build a single value or an entirely new collection as well. For example, using reduce with the * or + functions will create a single valued result, while using it with the cons or concat functions can create a new collection as a result. Thus, reduce can build anything.

To summarize, the reduce function can be used as a premise to generalize any computation or transformation that has to be applied on a collection.

Introducing reducers

A collection is said to be reducible when it defines its behavior with the reduce function. The binary function used by the reduce function along with a collection is also termed as a reducing function. A reducing function requires two arguments—one to represent the accumulated result of the reduction, and another to represent an input value that has to be combined into the result. Several reducing functions can be composed into one, which effectively changes how the reduce function processes a given collection. This composition is done using reducing function transformers, or simply reducers.

The use of sequences and laziness can be compared to using reducers to perform a given computation by Rich Hickey's infamous pie-maker analogy. Suppose a pie-maker has been supplied a bag of apples, with an intent to reduce the apples to a pie. There are a couple transformations needed to perform this task. First, the stickers on all the apples have to be removed, as in we map a function to "take the sticker off" over the apples in the collection. Also, all the rotten apples will have to be removed, which is analogous to using the filter function to remove elements from a collection. Instead of performing this work herself, the pie-maker delegates it to her assistant. The assistant could first take the stickers off of all the apples, thus producing a new collection, and then take out the rotten apples to produce another new collection, which illustrates the use of lazy sequences. But then, the assistant would be doing unnecessary work by removing the stickers from the rotten apples, which will have to be discarded later.

On the other hand, the assistant could delay this work until the actual reduction of the processed apples into a pie is performed. Once the work is actually needed to be performed, the assistant will compose the two tasks of mapping and filtering the collection of apples, thus avoiding any unnecessary work. This case depicts the use of reducers to compose and transform the tasks needed to effectively reduce the collection of apples into a pie. Thus, the use of intermediary collections between each transformation is avoided, which is an optimization in terms of memory allocations performed to produce the result.

Of course, a smart assistant would simply discard the rotten apples first, which is essentially filtering the apples before mapping them. However, not all recipes are that trivial, and moreover, we can achieve a more interesting optimization through the use of reducers—parallelism. By using reducers, we create a recipe of tasks to reduce a collection of apples into a pie that can be parallelized. Also, all processing is delayed until the final reduction, instead of dealing with collections as intermediary results of each task. This is the gist of how reducers achieve performance though function composition and parallelization.

Note

The following namespaces must be included in your namespace declaration for the upcoming examples:

(ns my-namespace
  (:require [clojure.core.reducers :as r]))

The clojure.core.reducers namespace requires Java 6 with the jsr166y.jar JAR or Java 7+ for fork/join support.

Let's now briefly explore how reducers are actually implemented. Functions that operate on sequences use the clojure.lang.ISeq interface to abstract the behavior of a collection. In the case of reducers, the common interface that we must build upon is that of a reducing function. As we mentioned earlier, a reducing function is a two-arity function in which the first argument is the accumulated result so far and the second argument is the current input that has to be combined with the first argument. The process of performing a computation over a collection and producing some result can be generalized into three distinct cases. They can be described as follows:

  • A new collection with the same number of elements as the collection it is supplied needs to be produced. This one-to-one case is analogous to using the map function.
  • The computation shrinks the supplied collection by removing elements from it. This can be done using the filter function.
  • The computation could also be expansive, in which case it produces a new collection that contains an increased number of elements. This is like what the mapcat function does.

These cases depict the different ways in which a collection can be transformed into the desired result. Any computation, or reduction, over a collection can be thought of as an arbitrary sequence of such transformations. These transformations are represented by transformers, which are essentially functions that transform a reducing function. They can be implemented as shown in Example 3.3:

(defn mapping [f]
  (fn [rf]
    (fn [result input]
      (rf result (f input)))))

(defn filtering [p?]
  (fn [rf]
    (fn [result input]
      (if (p? input)
        (rf result input)
        result))))

(defn mapcatting [f]
  (fn [rf]
    (fn [result input]
      (reduce rf result (f input)))))

Example 3.3: Transformers

The mapping, filtering, and mapcatting functions in Example 3.3 represent the core logic of the map, filter, and mapcat functions respectively. All of these functions are transformers that take a single argument and return a new function. The returned function transforms a supplied reducing function, represented by rf, and returns a new reducing function, created using the expression (fn [result input] ... ). Functions returned by the mapping, filtering, and mapcatting functions are termed as reducing function transformers.

The mapping function applies the f function to the current input, represented by the input variable. The value returned by the function f is then combined with the accumulated result, represented by result, using the reducing function rf. This transformer is a frighteningly pure abstraction of the standard map function that applies a function f over a collection. The mapping function makes no assumptions of the structure of the collection it is supplied or how the values returned by the function f are combined to produce the final result.

Similarly, the filtering function uses a predicate p? to check whether the current input of the reducing function rf must be combined into the final result, represented by result. If the predicate is not true, then the reducing function will simply return the value result without any modification. The mapcatting function uses the reduce function to combine the value result with the result of the expression (f input). In this transformer, we can assume that the function f will return a new collection and the reducing function rf will somehow combine two collections.

One of the foundations of the reducers library is the CollReduce protocol defined in the clojure.core.protocols namespace. This protocol abstracts the behavior of a collection when it is passed as an argument to the reduce function, and is declared as shown in Example 3.4:

(defprotocol CollReduce
  (coll-reduce [coll rf init]))

Example 3.4: The CollReduce protocol

The clojure.core.reducers namespace defines a reducer function that creates a reducible collection by dynamically extending the CollReduce protocol, as shown in Example 3.5:

(defn reducer
  ([coll xf]
   (reify
     CollReduce
     (coll-reduce [_ rf init]
       (coll-reduce coll (xf rf) init)))))

Example 3.5: The reducer function

The reducer function combines a collection coll and a reducing function transformer xf, which is returned by the mapping, filtering, and mapcatting functions, to produce a new reducible collection. When reduce is invoked on a reducible collection, it will ultimately ask the collection to reduce itself using the reducing function returned by the expression (xf rf). Using this mechanism, several reducing functions can be composed into a single computation to be performed over a given collection. Also, the reducer function needs to be defined only once, and the actual implementation of coll-reduce is provided by the collection supplied to the reducer function.

Now, we can redefine the reduce function to simply invoke the coll-reduce function implemented by a given collection, as shown in Example 3.6:

(defn reduce
  ([rf coll]
   (reduce rf (rf) coll))
  ([rf init coll]
   (coll-reduce coll rf init)))

Example 3.6: Redefining the reduce function

As shown in Example 3.6, the reduce function delegates the job of reducing a collection to the collection itself using the coll-reduce function. Also, the reduce function will use the reducing function rf to also supply the init argument when it is not specified. An interesting consequence of this definition of reduce is that the function rf must produce an identity value when supplied no arguments. The standard reduce function also uses the CollReduce protocol to delegate the job of reducing a collection to the collection itself, but will also fall back on the default definition of reduce in case the supplied collection does not implement the CollReduce protocol.

Note

Since Clojure 1.4, the reduce function allows a collection to define how it reduced using the clojure.core.CollReduce protocol. Clojure 1.5 introduced the clojure.core.reducers namespace that extends the use of this protocol.

All of the standard Clojure collections, namely lists, vectors, sets, and maps, implement the CollReduce protocol. The reducer function can be used to build a sequence of transformations to be applied to a collection when it is passed as an argument to the reduce function. This can be demonstrated as follows:

user> (r/reduce + 0 (r/reducer [1 2 3 4] (mapping inc)))
14
user> (reduce + 0 (r/reducer [1 2 3 4] (mapping inc)))
14

In the preceding output, the mapping function is used with the inc function to create a reducing function transformer that increments all the elements in a given collection. This transformer is then combined with a vector using the reducer function to produce a reducible collection. The call to reduce in both of the preceding statements is transformed into the expression (reduce + [2 3 4 5]), thus producing the result 14. We can now redefine the map, filter, and mapcat functions using the reducer function, as shown in Example 3.7:

(defn map [f coll]
  (reducer coll (mapping f)))

(defn filter [p? coll]
  (reducer coll (filtering p?)))

(defn mapcat [f coll]
  (reducer coll (mapcatting f)))

Example 3.7: Redefining the map, filter and mapcat functions using the reducer form

As shown in Example 3.7, the map, filter, and mapcat functions are now simply compositions of the reducer form with the mapping, filtering, and mapcatting transformers respectively.

Note

The definitions of CollReduce, reducer, reduce, map, filter, and mapcat as shown in this section are simplified versions of their actual definitions in the clojure.core.reducers namespace.

The definitions of the map, filter, and mapcat functions shown in Example 3.7 have the same shape as the standard versions of these functions, as shown here:

user> (r/reduce + (r/map inc [1 2 3 4]))
14
user> (r/reduce + (r/filter even? [1 2 3 4]))
6
user> (r/reduce + (r/mapcat range [1 2 3 4]))
10

Hence, the map, filter, and mapcat functions from the clojure.core.reducers namespace can be used in the same way as the standard versions of these functions. The reducers library also provides a take function that can be used as a replacement for the standard take function. We can use this function to reduce the number of calls to the square-with-side-effect function (from Example 3.1) when it is mapped over a given vector, as shown here:

user> (def mapped (r/map square-with-side-effect [0 1 2 3 4 5]))
#'user/mapped
user> (reduce + (r/take 3 mapped))
Side-effect: 0
Side-effect: 1
Side-effect: 2
Side-effect: 3
5

Thus, using the map and take functions from the clojure.core.reducers namespace as shown here avoids applying the square-with-side-effect function to all five elements in the vector [0 1 2 3 4 5] as only the first three are required.

The reducers library also provides variants of the standard take-while, drop, flatten, and remove functions, which are based on reducers. Effectively, functions based on reducers will require a lesser number of allocations than sequence-based functions, thus leading to an improvement in performance. For example, consider the process and process-with-reducer functions shown in Example 3.8:

(defn process [nums]
  (reduce + (map inc (map inc (map inc nums)))))

(defn process-with-reducer [nums]
  (reduce + (r/map inc (r/map inc (r/map inc nums)))))

Example 3.8: Functions to process a collection of numbers using sequences and reducers

The process function in Example 3.8 applies the inc function over a collection of numbers represented by nums using the map function. The process-with-reducer function performs the same action, but uses the reducer variant of the map function. The process-with-reducer function will take a lesser amount of time to produce its result from a large vector when compared to the process function, as shown here:

user> (def nums (vec (range 1000000)))
#'user/nums
user> (time (process nums))
"Elapsed time: 471.217086 msecs"
500002500000
user> (time (process-with-reducer nums))
"Elapsed time: 356.767024 msecs"
500002500000

The process-with-reducer function gets a slight performance boost as it requires a lesser number of memory allocations than the process function. We should note that the available memory should be large enough to load the entire file, or else we could run out of memory. The performance of this computation can be improved by a greater scale if we can somehow parallelize it, and we shall examine how this can be done in the following section.

主站蜘蛛池模板: 瓦房店市| 蒲城县| 宁明县| 奉节县| 晋中市| 鄂州市| 朝阳县| 开封县| 蒙山县| 商洛市| 永年县| 玛曲县| 深泽县| 应城市| 通许县| 夏津县| 锡林郭勒盟| 漳州市| 无锡市| 永吉县| 同德县| 米易县| 巩义市| 绥化市| 土默特右旗| 赞皇县| 定结县| 巴东县| 满洲里市| 建宁县| 常熟市| 建水县| 马尔康县| 济源市| 合作市| 长垣县| 松桃| 宜城市| 邯郸县| 上饶县| 韶关市|