官术网_书友最值得收藏!

Using PyArrow's filesystem interface for HDFS

PyArrow has a C++-based interface for HDFS. By default, it uses libhdfs, a JNI-based interface, for the Java Hadoop client. Alternatively, we can also use libhdfs3, a C++ library for HDFS. We connect to the NameNode using hdfs.connect:

import pyarrow as pa
hdfs = pa.hdfs.connect(host='hostname', port=8020, driver='libhdfs')

If we change the driver to libhdfs3, we will be using the C++ library for HDFS from Pivotal Labs. Once the connection to the NameNode is made, the filesystem is accessed using the same methods as for hdfs3. 

HDFS is preferred when the data is extremely large. It allows us to read and write data in chunks; this is helpful for accessing and processing streaming data. A nice comparison of the three native RPC client interfaces is presented in the following blog post: http://wesmckinney.com/blog/python-hdfs-interfaces/.

主站蜘蛛池模板: 广昌县| 峨眉山市| 铜陵市| 揭阳市| 英山县| 岱山县| 尚义县| 莎车县| 清水县| 江川县| 方城县| 临汾市| 天柱县| 邳州市| 西盟| 富源县| 遵义县| 平原县| 双城市| 万山特区| 临泉县| 木里| 宜章县| 宜兰县| 顺昌县| 彝良县| 永宁县| 鄂尔多斯市| 临沭县| 静宁县| 永仁县| 仁怀市| 永善县| 凤翔县| 靖远县| 太原市| 曲沃县| 囊谦县| 施秉县| 龙陵县| 东乡县|