Python中如何上传文件到HDFS?

代码: import pyhdfs client = pyhdfs.HdfsClient(':')

可以 list

client.listdir('/') ['apps', 'benchmarks', 'data', 'gj_data', 'hbase', 'system', 'test', 'tmp', 'user']

不能把文件放到本地

client.copy_from_local(dest='/tmp/test', localsrc='/Users/1.py', user_name='hdfs')


JSONDecodeError Traceback (most recent call last) /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyhdfs.py in _json(response) 789 try: --> 790 return response.json() 791 except simplejson.scanner.JSONDecodeError:

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/models.py in json(self, **kwargs) 895 pass --> 896 return complexjson.loads(self.text, **kwargs) 897

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/simplejson/init.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, use_decimal, **kw) 517 and not use_decimal and not kw): --> 518 return _default_decoder.decode(s) 519 if cls is None:

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/simplejson/decoder.py in decode(self, s, _w, _PY3) 369 s = str(s, self.encoding) --> 370 obj, end = self.raw_decode(s) 371 end = _w(s, end).end()

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/simplejson/decoder.py in raw_decode(self, s, idx, _w, _PY3) 399 idx += 3 --> 400 return self.scan_once(s, idx=_w(s, idx).end())

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

HdfsException Traceback (most recent call last) <ipython-input-20-c3f47894efe1> in <module>() ----> 1 client.copy_from_local('/Users/lw/Desktop/1.py', '/tmp/test/')

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyhdfs.py in copy_from_local(self, localsrc, dest, **kwargs) 751 """ 752 with io.open(localsrc, 'rb') as f: --> 753 self.create(dest, f, **kwargs) 754 755 def copy_to_local(self, src, localdest, **kwargs):

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyhdfs.py in create(self, path, data, **kwargs) 425 data_response = self._requests_session.put( 426 metadata_response.headers['location'], data=data, **self._requests_kwargs) --> 427 _check_response(data_response, expected_status=httplib.CREATED) 428 assert not data_response.content 429

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyhdfs.py in _check_response(response, expected_status) 797 if response.status_code == expected_status: 798 return --> 799 remote_exception = _json(response)['RemoteException'] 800 exception_name = remote_exception['exception'] 801 python_name = 'Hdfs' + exception_name

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyhdfs.py in _json(response) 791 except simplejson.scanner.JSONDecodeError: 792 raise HdfsException( --> 793 "Expected JSON. Is WebHDFS enabled? Got {!r}".format(response.text)) 794 795

HdfsException: Expected JSON. Is WebHDFS enabled? Got '<html><head><title>Apache Tomcat/6.0.53 - Error report</title><style></style> </head><body>

HTTP Status 400 - Data upload requests must have content-type set to 'application/octet-stream'


type Status report

message <u>Data upload requests must have content-type set to 'application/octet-stream'</u>

description <u>The request sent by the client was syntactically incorrect.</u>


Apache Tomcat/6.0.53

</body></html>'


Python中如何上传文件到HDFS?

1 回复

用Python往HDFS上传文件,最常用的库是hdfspyarrow。这里给你两个最实用的方法。

方法一:使用 hdfs 库 (简单直接) 首先安装:pip install hdfs

from hdfs import InsecureClient

# 1. 连接到HDFS
# 注意替换成你的实际地址和用户
client = InsecureClient('http://namenode_host:50070', user='your_username')

# 2. 上传文件
# 参数:HDFS目标路径,本地文件路径,是否覆盖
client.upload('/hdfs/target/path', '/local/path/to/your/file.txt', overwrite=True)
print("文件上传成功!")

这个方法最傻瓜式,upload方法会自动处理分块和上传。

方法二:使用 pyarrow 库 (功能强大) 如果你已经在用PyArrow处理数据,用它会更顺手。

import pyarrow.fs as fs

# 1. 创建HDFS文件系统对象
hdfs = fs.HadoopFileSystem(host='namenode_host', port=9000, user='your_username')

# 2. 上传文件(本质是复制)
with open('/local/path/to/your/file.txt', 'rb') as local_file:
    with hdfs.open_output_stream('/hdfs/target/path/file.txt') as hdfs_file:
        hdfs_file.write(local_file.read())
print("搞定!")

选哪个?

  • 只是简单上传下载,用hdfs库,API更友好。
  • 如果要做数据读写、Parquet处理等,用pyarrow,它是生态的一部分。

两个关键点:

  1. 网络和权限:确保你的机器能访问HDFS的NameNode(Web端口50070或RPC端口9000),并且有写目标目录的权限。
  2. 大文件:上传大文件时,用pyarrow的话最好分块读写,避免内存爆掉。hdfsupload方法默认会分块。

一句话建议:日常上传用hdfs库的client.upload()最省事。

回到顶部