官术网_书友最值得收藏!

Using PyArrow's filesystem interface for HDFS

PyArrow has a C++-based interface for HDFS. By default, it uses libhdfs, a JNI-based interface, for the Java Hadoop client. Alternatively, we can also use libhdfs3, a C++ library for HDFS. We connect to the NameNode using hdfs.connect:

import pyarrow as pa
hdfs = pa.hdfs.connect(host='hostname', port=8020, driver='libhdfs')

If we change the driver to libhdfs3, we will be using the C++ library for HDFS from Pivotal Labs. Once the connection to the NameNode is made, the filesystem is accessed using the same methods as for hdfs3. 

HDFS is preferred when the data is extremely large. It allows us to read and write data in chunks; this is helpful for accessing and processing streaming data. A nice comparison of the three native RPC client interfaces is presented in the following blog post: http://wesmckinney.com/blog/python-hdfs-interfaces/.

主站蜘蛛池模板: 古交市| 襄城县| 新竹市| 静乐县| 昭通市| 甘孜县| 乳源| 乌海市| 宁远县| 阿勒泰市| 克东县| 金寨县| 安化县| 岗巴县| 宣化县| 瑞安市| 万宁市| 祁阳县| 阳曲县| 芮城县| 麻栗坡县| 宜良县| 上思县| 金昌市| 南宁市| 韶山市| 乌兰浩特市| 汶上县| 台东市| 察雅县| 高安市| 西城区| 兰考县| 酉阳| 绵竹市| 乌拉特后旗| 达孜县| 吉水县| 宁陕县| 神农架林区| 宁城县|