官术网_书友最值得收藏!

Using PyArrow's filesystem interface for HDFS

PyArrow has a C++-based interface for HDFS. By default, it uses libhdfs, a JNI-based interface, for the Java Hadoop client. Alternatively, we can also use libhdfs3, a C++ library for HDFS. We connect to the NameNode using hdfs.connect:

import pyarrow as pa
hdfs = pa.hdfs.connect(host='hostname', port=8020, driver='libhdfs')

If we change the driver to libhdfs3, we will be using the C++ library for HDFS from Pivotal Labs. Once the connection to the NameNode is made, the filesystem is accessed using the same methods as for hdfs3. 

HDFS is preferred when the data is extremely large. It allows us to read and write data in chunks; this is helpful for accessing and processing streaming data. A nice comparison of the three native RPC client interfaces is presented in the following blog post: http://wesmckinney.com/blog/python-hdfs-interfaces/.

主站蜘蛛池模板: 天祝| 商城县| 平定县| 乌拉特后旗| 太康县| 普兰店市| 和田县| 丽江市| 墨玉县| 怀来县| 南阳市| 光泽县| 宣恩县| 饶阳县| 仁寿县| 永城市| 金乡县| 原平市| 诸暨市| 桃江县| 武邑县| 寿宁县| 三明市| 宁海县| 富蕴县| 区。| 勐海县| 昭平县| 陈巴尔虎旗| 亳州市| 夏津县| 凤山县| 安溪县| 奇台县| 阳泉市| 鹤庆县| 罗江县| 桂阳县| 西峡县| 柳州市| 洛浦县|