- Mastering Clojure
- Akhil Wali
- 3253字
- 2021-07-09 20:18:04
Using reduce to transform collections
Sequences and functions that operate on sequences preserve the sequential ordering between elements. Lazy sequences avoid the unnecessary realization of elements in a collection until they are required for a computation, but the realization of these values is still performed in a sequential manner. However, this characteristic of sequential ordering may not be desirable for all computations performed over it. For example, it's not possible to map a function over a vector and then lazily realize values in the resulting collection out of order; since the map
function converts the supplied collection into a sequence. Also, functions such as map
and filter
are lazy, but still sequential by nature.
What's wrong with sequences?
One of the limitations of sequences is that they are realized in chunks. Let's study a simple example to illustrate what this means. Consider a unary function, as shown in Example 3.1, which we intend to map over a given vector. The function must compute a value from the one it is supplied, and also perform a side effect so that we can observe its application over the elements in a collection.
Note
The following examples can be found in src/m_clj/c3/reducers.clj
of the book's source code.
(defn square-with-side-effect [x] (do (println (str "Side-effect: " x)) (* x x)))
Example 3.1: A simple unary function
The square-with-side-effect
function simply returns the square of a number x
using the *
function. This function also prints the value of x
using a println
form whenever it is called. Suppose this function is mapped over a given vector. The resulting collection would have to be realized completely if a computation has to be performed over it, even if all the elements from the resulting vector are not required. This can be demonstrated as follows:
user> (def mapped (map square-with-side-effect [0 1 2 3 4 5])) #'user/mapped user> (reduce + (take 3 mapped)) Side-effect: 0 Side-effect: 1 Side-effect: 2 Side-effect: 3 Side-effect: 4 Side-effect: 5 5
As shown previously, the mapped
variable contains the result of mapping the square-with-side-effect
function over a vector. If we try to sum the first three values in the resulting collection using the reduce
, take
, and +
functions, all the values in the [0 1 2 3 4 5]
vector are printed as a side effect, as shown in the preceding output. This means that the square-with-side-effect
function was applied to all the elements in the initial vector, despite the fact that only the first three elements were actually required by the reduce
form. Of course, this can be solved using the seq
function to convert the vector to a sequence before mapping the square-with-side-effect
function over it. But then, we lose the ability to efficiently access elements in a random order in the resulting collection.
To understand why this actually happens, we first need to understand how the standard map
function is actually implemented. A simplified definition of the map
function is shown in Example 3.2:
(defn map [f coll] (cons (f (first coll)) (lazy-seq (map f (rest coll)))))
Example 3.2: A simplified definition of the map function
The definition of map
in Example 3.2 is a simplified and rather incomplete one, as it doesn't check for an empty collections and cannot be used over multiple collections. That aside, this definition of map
does indeed apply a function f
to all the elements in a collection coll
. This is implemented using a composition of the cons
, first
, rest
, and lazy-seq
forms.
This implementation can be interpreted as "applying the function f
to the first element in the collection coll
, and then mapping f
over the rest of the collection in a lazy manner". An interesting consequence of this implementation is that the map
function has the following characteristics:
- The ordering among elements in the collection
coll
is preserved. - This computation is performed recursively.
- The
lazy-seq
form is used to perform the computation in a lazy manner. - The use of the
first
andrest
forms indicate thatcoll
must be a sequence, and thecons
form will also produce a result that is a sequence. Hence, themap
function accepts a sequence and builds a new one.
However, none of these properties of sequences are needed to transform a given collection into a result that is not a sequence. Another characteristic of lazy sequences is how they are realized. By the term realized, we mean to say a given lazy sequence is evaluated to produce concrete values. Lazy sequences are realized in chunks. Each chunk is comprised of 32 elements, and this is done as an optimization. Sequences that behave this way are termed as chunked sequences. Of course, not all sequences are chunked, and we can check whether a given sequence is chunked using the chunked-seq?
predicate. The range
function returns a chunked sequence, as shown here:
user> (first (map #(do (print \!) %) (range 70))) !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 0 user> (nth (map #(do (print \!) %) (range 70)) 32) !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 32
Both the statements in the preceding output select a single element from a sequence returned by the map
function. The function passed to the map
function in both the preceding statements prints the!
character and returns the value supplied to it. In the first statement, the first 32 elements of the resulting sequence are realized even though only the first element is required. Similarly, the second statement is observed to realize the first 64 elements of the resulting sequence when the element at the 32nd position is obtained using the nth
function. But again, realizing a collection in chunks isn't required to perform a computation over the elements in the collection.
Note
Chunked sequences have been an integral part of Clojure since version 1.1.
If we are to handle such computations efficiently, we cannot build on functions that return sequences, such as map
and filter
. Incidentally, the reduce
function does not necessarily produce a sequence. It also has a couple of other interesting properties:
- The
reduce
function actually lets the supplied collection define how it is computed over or reduced. Thus,reduce
is collection independent. - Also, the
reduce
function is versatile enough to build a single value or an entirely new collection as well. For example, usingreduce
with the*
or+
functions will create a single valued result, while using it with thecons
orconcat
functions can create a new collection as a result. Thus,reduce
can build anything.
To summarize, the reduce
function can be used as a premise to generalize any computation or transformation that has to be applied on a collection.
Introducing reducers
A collection is said to be reducible when it defines its behavior with the reduce
function. The binary function used by the reduce
function along with a collection is also termed as a reducing function. A reducing function requires two arguments—one to represent the accumulated result of the reduction, and another to represent an input value that has to be combined into the result. Several reducing functions can be composed into one, which effectively changes how the reduce
function processes a given collection. This composition is done using reducing function transformers, or simply reducers.
The use of sequences and laziness can be compared to using reducers to perform a given computation by Rich Hickey's infamous pie-maker analogy. Suppose a pie-maker has been supplied a bag of apples, with an intent to reduce the apples to a pie. There are a couple transformations needed to perform this task. First, the stickers on all the apples have to be removed, as in we map a function to "take the sticker off" over the apples in the collection. Also, all the rotten apples will have to be removed, which is analogous to using the filter
function to remove elements from a collection. Instead of performing this work herself, the pie-maker delegates it to her assistant. The assistant could first take the stickers off of all the apples, thus producing a new collection, and then take out the rotten apples to produce another new collection, which illustrates the use of lazy sequences. But then, the assistant would be doing unnecessary work by removing the stickers from the rotten apples, which will have to be discarded later.
On the other hand, the assistant could delay this work until the actual reduction of the processed apples into a pie is performed. Once the work is actually needed to be performed, the assistant will compose the two tasks of mapping and filtering the collection of apples, thus avoiding any unnecessary work. This case depicts the use of reducers to compose and transform the tasks needed to effectively reduce the collection of apples into a pie. Thus, the use of intermediary collections between each transformation is avoided, which is an optimization in terms of memory allocations performed to produce the result.
Of course, a smart assistant would simply discard the rotten apples first, which is essentially filtering the apples before mapping them. However, not all recipes are that trivial, and moreover, we can achieve a more interesting optimization through the use of reducers—parallelism. By using reducers, we create a recipe of tasks to reduce a collection of apples into a pie that can be parallelized. Also, all processing is delayed until the final reduction, instead of dealing with collections as intermediary results of each task. This is the gist of how reducers achieve performance though function composition and parallelization.
Note
The following namespaces must be included in your namespace declaration for the upcoming examples:
(ns my-namespace (:require [clojure.core.reducers :as r]))
The clojure.core.reducers
namespace requires Java 6 with the jsr166y.jar
JAR or Java 7+ for fork/join support.
Let's now briefly explore how reducers are actually implemented. Functions that operate on sequences use the clojure.lang.ISeq
interface to abstract the behavior of a collection. In the case of reducers, the common interface that we must build upon is that of a reducing function. As we mentioned earlier, a reducing function is a two-arity function in which the first argument is the accumulated result so far and the second argument is the current input that has to be combined with the first argument. The process of performing a computation over a collection and producing some result can be generalized into three distinct cases. They can be described as follows:
- A new collection with the same number of elements as the collection it is supplied needs to be produced. This one-to-one case is analogous to using the
map
function. - The computation shrinks the supplied collection by removing elements from it. This can be done using the
filter
function. - The computation could also be expansive, in which case it produces a new collection that contains an increased number of elements. This is like what the
mapcat
function does.
These cases depict the different ways in which a collection can be transformed into the desired result. Any computation, or reduction, over a collection can be thought of as an arbitrary sequence of such transformations. These transformations are represented by transformers, which are essentially functions that transform a reducing function. They can be implemented as shown in Example 3.3:
(defn mapping [f] (fn [rf] (fn [result input] (rf result (f input))))) (defn filtering [p?] (fn [rf] (fn [result input] (if (p? input) (rf result input) result)))) (defn mapcatting [f] (fn [rf] (fn [result input] (reduce rf result (f input)))))
Example 3.3: Transformers
The mapping
, filtering
, and mapcatting
functions in Example 3.3 represent the core logic of the map
, filter
, and mapcat
functions respectively. All of these functions are transformers that take a single argument and return a new function. The returned function transforms a supplied reducing function, represented by rf
, and returns a new reducing function, created using the expression (fn [result input] ... )
. Functions returned by the mapping
, filtering
, and mapcatting
functions are termed as reducing function transformers.
The mapping
function applies the f
function to the current input, represented by the input
variable. The value returned by the function f
is then combined with the accumulated result, represented by result
, using the reducing function rf
. This transformer is a frighteningly pure abstraction of the standard map
function that applies a function f
over a collection. The mapping
function makes no assumptions of the structure of the collection it is supplied or how the values returned by the function f
are combined to produce the final result.
Similarly, the filtering
function uses a predicate p?
to check whether the current input of the reducing function rf
must be combined into the final result, represented by result
. If the predicate is not true, then the reducing function will simply return the value result
without any modification. The mapcatting
function uses the reduce
function to combine the value result
with the result of the expression (f input)
. In this transformer, we can assume that the function f
will return a new collection and the reducing function rf
will somehow combine two collections.
One of the foundations of the reducers
library is the CollReduce
protocol defined in the clojure.core.protocols
namespace. This protocol abstracts the behavior of a collection when it is passed as an argument to the reduce
function, and is declared as shown in Example 3.4:
(defprotocol CollReduce (coll-reduce [coll rf init]))
Example 3.4: The CollReduce protocol
The clojure.core.reducers
namespace defines a reducer
function that creates a reducible collection by dynamically extending the CollReduce
protocol, as shown in Example 3.5:
(defn reducer ([coll xf] (reify CollReduce (coll-reduce [_ rf init] (coll-reduce coll (xf rf) init)))))
Example 3.5: The reducer function
The reducer
function combines a collection coll
and a reducing function transformer xf
, which is returned by the mapping
, filtering
, and mapcatting
functions, to produce a new reducible collection. When reduce
is invoked on a reducible collection, it will ultimately ask the collection to reduce itself using the reducing function returned by the expression (xf rf)
. Using this mechanism, several reducing functions can be composed into a single computation to be performed over a given collection. Also, the reducer
function needs to be defined only once, and the actual implementation of coll-reduce
is provided by the collection supplied to the reducer
function.
Now, we can redefine the reduce
function to simply invoke the coll-reduce
function implemented by a given collection, as shown in Example 3.6:
(defn reduce ([rf coll] (reduce rf (rf) coll)) ([rf init coll] (coll-reduce coll rf init)))
Example 3.6: Redefining the reduce function
As shown in Example 3.6, the reduce
function delegates the job of reducing a collection to the collection itself using the coll-reduce
function. Also, the reduce
function will use the reducing function rf
to also supply the init
argument when it is not specified. An interesting consequence of this definition of reduce
is that the function rf
must produce an identity value when supplied no arguments. The standard reduce
function also uses the CollReduce
protocol to delegate the job of reducing a collection to the collection itself, but will also fall back on the default definition of reduce
in case the supplied collection does not implement the CollReduce
protocol.
Note
Since Clojure 1.4, the reduce
function allows a collection to define how it reduced using the clojure.core.CollReduce
protocol. Clojure 1.5 introduced the clojure.core.reducers
namespace that extends the use of this protocol.
All of the standard Clojure collections, namely lists, vectors, sets, and maps, implement the CollReduce
protocol. The reducer
function can be used to build a sequence of transformations to be applied to a collection when it is passed as an argument to the reduce
function. This can be demonstrated as follows:
user> (r/reduce + 0 (r/reducer [1 2 3 4] (mapping inc))) 14 user> (reduce + 0 (r/reducer [1 2 3 4] (mapping inc))) 14
In the preceding output, the mapping
function is used with the inc
function to create a reducing function transformer that increments all the elements in a given collection. This transformer is then combined with a vector using the reducer
function to produce a reducible collection. The call to reduce
in both of the preceding statements is transformed into the expression (reduce + [2 3 4 5])
, thus producing the result 14
. We can now redefine the map
, filter
, and mapcat
functions using the reducer
function, as shown in Example 3.7:
(defn map [f coll] (reducer coll (mapping f))) (defn filter [p? coll] (reducer coll (filtering p?))) (defn mapcat [f coll] (reducer coll (mapcatting f)))
Example 3.7: Redefining the map, filter and mapcat functions using the reducer form
As shown in Example 3.7, the map
, filter
, and mapcat
functions are now simply compositions of the reducer
form with the mapping
, filtering
, and mapcatting
transformers respectively.
Note
The definitions of CollReduce
, reducer
, reduce
, map
, filter
, and mapcat
as shown in this section are simplified versions of their actual definitions in the clojure.core.reducers
namespace.
The definitions of the map
, filter
, and mapcat
functions shown in Example 3.7 have the same shape as the standard versions of these functions, as shown here:
user> (r/reduce + (r/map inc [1 2 3 4])) 14 user> (r/reduce + (r/filter even? [1 2 3 4])) 6 user> (r/reduce + (r/mapcat range [1 2 3 4])) 10
Hence, the map
, filter
, and mapcat
functions from the clojure.core.reducers
namespace can be used in the same way as the standard versions of these functions. The reducers
library also provides a take
function that can be used as a replacement for the standard take
function. We can use this function to reduce the number of calls to the square-with-side-effect
function (from Example 3.1) when it is mapped over a given vector, as shown here:
user> (def mapped (r/map square-with-side-effect [0 1 2 3 4 5])) #'user/mapped user> (reduce + (r/take 3 mapped)) Side-effect: 0 Side-effect: 1 Side-effect: 2 Side-effect: 3 5
Thus, using the map
and take
functions from the clojure.core.reducers
namespace as shown here avoids applying the square-with-side-effect
function to all five elements in the vector [0 1 2 3 4 5]
as only the first three are required.
The reducers
library also provides variants of the standard take-while
, drop
, flatten
, and remove
functions, which are based on reducers. Effectively, functions based on reducers will require a lesser number of allocations than sequence-based functions, thus leading to an improvement in performance. For example, consider the process
and process-with-reducer
functions shown in Example 3.8:
(defn process [nums] (reduce + (map inc (map inc (map inc nums))))) (defn process-with-reducer [nums] (reduce + (r/map inc (r/map inc (r/map inc nums)))))
Example 3.8: Functions to process a collection of numbers using sequences and reducers
The process
function in Example 3.8 applies the inc
function over a collection of numbers represented by nums
using the map
function. The process-with-reducer
function performs the same action, but uses the reducer variant of the map
function. The process-with-reducer
function will take a lesser amount of time to produce its result from a large vector when compared to the process
function, as shown here:
user> (def nums (vec (range 1000000))) #'user/nums user> (time (process nums)) "Elapsed time: 471.217086 msecs" 500002500000 user> (time (process-with-reducer nums)) "Elapsed time: 356.767024 msecs" 500002500000
The process-with-reducer
function gets a slight performance boost as it requires a lesser number of memory allocations than the process
function. We should note that the available memory should be large enough to load the entire file, or else we could run out of memory. The performance of this computation can be improved by a greater scale if we can somehow parallelize it, and we shall examine how this can be done in the following section.
- Vue.js設計與實現
- DevOps with Kubernetes
- LabVIEW Graphical Programming Cookbook
- 工程軟件開發技術基礎
- Power Up Your PowToon Studio Project
- Java從入門到精通(第4版)
- Hands-On Enterprise Automation with Python.
- 小程序開發原理與實戰
- Mastering Elasticsearch(Second Edition)
- OpenCV 3 Blueprints
- Xamarin Cross-Platform Development Cookbook
- 現代C++語言核心特性解析
- C/C++程序設計教程
- Java EE互聯網輕量級框架整合開發:SSM+Redis+Spring微服務(上下冊)
- 微信公眾平臺服務號開發:揭秘九大高級接口