書名： Hands-On Artificial Intelligence for IoT
作者名： Amita Kapoor
本章字數： 141字
更新時間： 2021-07-02 14:02:02

Using PyArrow's filesystem interface for HDFS

PyArrow has a C++-based interface for HDFS. By default, it uses libhdfs, a JNI-based interface, for the Java Hadoop client. Alternatively, we can also use libhdfs3, a C++ library for HDFS. We connect to the NameNode using hdfs.connect:

import pyarrow as pa
hdfs = pa.hdfs.connect(host='hostname', port=8020, driver='libhdfs')

If we change the driver to libhdfs3, we will be using the C++ library for HDFS from Pivotal Labs. Once the connection to the NameNode is made, the filesystem is accessed using the same methods as for hdfs3.

HDFS is preferred when the data is extremely large. It allows us to read and write data in chunks; this is helpful for accessing and processing streaming data. A nice comparison of the three native RPC client interfaces is presented in the following blog post: http://wesmckinney.com/blog/python-hdfs-interfaces/.

官术网_书友最值得收藏!

Hands-On Artificial Intelligence for IoT

Using PyArrow's filesystem interface for HDFS