官术网_书友最值得收藏!

Benchmarking Hadoop MapReduce using TeraSort

Hadoop TeraSort is a well-known benchmark that aims to sort 1 TB of data as fast as possible using Hadoop MapReduce. TeraSort benchmark stresses almost every part of the Hadoop MapReduce framework as well as the HDFS filesystem making it an ideal choice to fine-tune the configuration of a Hadoop cluster.

The original TeraSort benchmark sorts 10 million 100 byte records making the total data size 1 TB. However, we can specify the number of records, making it possible to configure the total size of data.

Getting ready

You must set up and deploy HDFS and Hadoop v2 YARN MapReduce prior to running these benchmarks, and locate the hadoop-mapreduce-examples-*.jar file in your Hadoop installation.

How to do it...

The following steps will show you how to run the TeraSort benchmark on the Hadoop cluster:

  1. The first step of the TeraSort benchmark is the data generation. You can use the teragen command to generate the input data for the TeraSort benchmark. The first parameter of teragen is the number of records and the second parameter is the HDFS directory to generate the data. The following command generates 1 GB of data consisting of 10 million records to the tera-in directory in HDFS. Change the location of the hadoop-mapreduce-examples-*.jar file in the following commands according to your Hadoop installation:
    $ hadoop jar \
    $HADOOP_HOME/share/Hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
    teragen 10000000 tera-in
    

    Tip

    It's a good idea to specify the number of Map tasks to the teragen computation to speed up the data generation. This can be done by specifying the –Dmapred.map.tasks parameter.

    Also, you can increase the HDFS block size for the generated data so that the Map tasks of the TeraSort computation would be coarser grained (the number of Map tasks for a Hadoop computation typically equals the number of input data blocks). This can be done by specifying the –Ddfs.block.size parameter.

    $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
    teragen –Ddfs.block.size=536870912 \
    –Dmapred.map.tasks=256 10000000 tera-in
    
  2. The second step of the TeraSort benchmark is the execution of the TeraSort MapReduce computation on the data generated in step 1 using the following command. The first parameter of the terasort command is the input of HDFS data directory, and the second part of the terasort command is the output of the HDFS data directory.
    $ hadoop jar \
    $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
    terasort tera-in tera-out
    

    Tip

    It's a good idea to specify the number of Reduce tasks to the TeraSort computation to speed up the Reducer part of the computation. This can be done by specifying the –Dmapred.reduce.tasks parameter as follows:

    $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort –Dmapred.reduce.tasks=32 tera-in tera-out
    
  3. The last step of the TeraSort benchmark is the validation of the results. This can be done using the teravalidate application as follows. The first parameter is the directory with the sorted data and the second parameter is the directory to store the report containing the results.
    $ hadoop jar \
    $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
    teravalidate tera-out tera-validate
    

How it works...

TeraSort uses the sorting capability of the MapReduce framework together with a custom range Partitioner to divide the Map output among the Reduce tasks ensuring the global sorted order.

主站蜘蛛池模板: 龙岩市| 阳春市| 永顺县| 花垣县| 裕民县| 崇信县| 丹凤县| 汉源县| 寿光市| 侯马市| 井冈山市| 饶阳县| 九龙县| 南江县| 囊谦县| 洛浦县| 榆社县| 洪泽县| 林周县| 鹤庆县| 新巴尔虎左旗| 望都县| 临西县| 邓州市| 沾益县| 衡山县| 黔西县| 长宁区| 龙门县| 大关县| 景泰县| 安丘市| 舟曲县| 怀宁县| 桦川县| 华蓥市| 顺平县| 保山市| 桦南县| 赞皇县| 潮州市|