官术网_书友最值得收藏!

Importing data from another Hadoop cluster

Sometimes, we may want to copy data from one HDFS to another either for development, testing, or production migration. In this recipe, we will learn how to copy data from one HDFS cluster to another.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster.

How to do it...

Hadoop provides a utility called DistCp, which helps us copy data from one cluster to another. Using this utility is as simple as copying from one folder to another:

hadoop distcp hdfs://hadoopCluster1:9000/source hdfs://hadoopCluster2:9000/target

This would use a Map Reduce job to copy data from one cluster to another. You can also specify multiple source files to be copied to the target. There are a couple of other options that we can also use:

  • -update: When we use DistCp with the update option, it will copy only those files from the source that are not part of the target or differ from the target.
  • -overwrite: When we use DistCp with the overwrite option, it overwrites the target directory with the source.

How it works...

When DistCp is executed, it uses map reduce to copy the data and also assists in error handling and reporting. It expands the list of source files and directories and inputs them to map tasks. When copying from multiple sources, collisions are resolved in the destination based on the option (update/overwrite) that's provided. By default, it skips if the file is already present at the target. Once the copying is complete, the count of skipped files is presented.

Note

You can read more on DistCp at https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html.

主站蜘蛛池模板: 丰都县| 呈贡县| 武隆县| 沈丘县| 玉山县| 基隆市| 东乌珠穆沁旗| 焉耆| 贞丰县| 庄河市| 沙洋县| 新邵县| 麻城市| 胶州市| 雷州市| 绥宁县| 衡水市| 阳原县| 玉屏| 甘洛县| 揭阳市| 崇义县| 怀宁县| 北京市| 禄丰县| 义马市| 含山县| 昂仁县| 灵川县| 安新县| 田林县| 安仁县| 安徽省| 游戏| 咸宁市| 陇川县| 延寿县| 营山县| 靖江市| 冷水江市| 阿拉善盟|