當前位置：首頁 > 编程语言 > python >内容正文

python

python替代hadoop_Python连接Hadoop数据中遇到的各种坑(汇总)

發(fā)布時間：2025/3/8 python 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 python替代hadoop_Python连接Hadoop数据中遇到的各种坑(汇总) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

最近準備使用Python+Hadoop+Pandas進行一些深度的分析與機器學習相關工作。(當然隨著學習過程的進展，現在準備使用Python+Spark+Hadoop這樣一套體系來搭建后續(xù)的工作環(huán)境)，當然這是后話。

但是這項工作首要條件就是將Python與Hadoop進行打通，本來認為很容易的一項工作，沒有想到竟然遇到各種坑，花費了整整半天時間。后來也在網上看到大家在咨詢相同的問題，但是真正解決這個問題的帖子又幾乎沒有，所以現在將Python連接Hadoop數據庫過程中遇到的各種坑進行一個匯總，然后與大家進行分享，以盡量避免大家花費寶貴的時間。

(說明一下：這篇文章中的各種坑的解決，翻閱了網上無數的帖子，最好一GIT上面一個帖子的角落里面帶了這么一句，否則很容易翻船。但是由于帖子太多，所以我就不一一帖出來了)

首先是選組件，我選擇的是使用：impala+Python3.7來連接Hadoop數據庫，如果你不是的話，就不要浪費寶貴時間繼續(xù)閱讀了。

執(zhí)行的代碼如下：

import impala.dbapi as ipdb

conn = ipdb.connect(host="192.168.XX.XXX",port=10000,user="xxx",password="xxxxxx",database="xxx",auth_mechanism='PLAIN')

cursor = conn.cursor()

#其中xxxx是表名，為了不涉及到公司的信息，我把表名隱藏掉了，大家自己換成自己數據庫表名

cursor.execute('select * From xxxx')

print(cursor.description) # prints the result set's schema

for rowData in cursor.fetchall():

print(rowData)

conn.close()

坑一：提示語法錯誤

現象：

/Users/wangxxin/miniconda3/bin/python3.7 /Users/wangxxin/Documents/Python/PythonDataAnalyze/project/knDt/pyHiveTest.py

Traceback (most recent call last):

File "/Users/wangxxin/Documents/Python/PythonDataAnalyze/project/knDt/pyHiveTest.py", line 1, in

import impala.dbapi as ipdb

File "/Users/wangxxin/miniconda3/lib/python3.7/site-packages/impala/dbapi.py", line 28, in

import impala.hiveserver2 as hs2

File "/Users/wangxxin/miniconda3/lib/python3.7/site-packages/impala/hiveserver2.py", line 340

async=True)

解決辦法：將參數async全部修改為“async_”(當然這個可以隨便，只要上下文一致，并且不是關鍵字即可)，原因：在Python3.0中，已經將async標為關鍵詞，如果再使用async做為參數，會提示語法錯誤；應該包括以下幾個地方：

#hiveserver2.py文件338行左右

op = self.session.execute(self._last_operation_string,

configuration,

async_=True)

#hiveserver2.py文件1022行左右

def execute(self, statement, configuration=None, async_=False):

req = TExecuteStatementReq(sessionHandle=self.handle,

statement=statement,

confOverlay=configuration,

runAsync=async_)

坑二：提供的Parser.py文件有問題，加載的時候會報錯

解決辦法：

#根據網上的意見對原代碼進行調整

elif url_scheme in ('c', 'd', 'e', 'f'):

with open(path) as fh:

data = fh.read()

elif url_scheme in ('http', 'https'):

data = urlopen(path).read()

else:

raise ThriftParserError('ThriftPy does not support generating module '

'with path in protocol \'{}\''.format(

url_scheme))

以上的坑一、坑二建議你直接修改。這兩點是肯定要調整的；

坑三：上面的兩個問題處理好之后，繼續(xù)運行，會報如下錯誤：

TProtocolException: TProtocolException(type=4)

解決辦法：

原因是由于connect方法里面沒有增加參數：auth_mechanism='PLAIN，修改如下所示：

import impala.dbapi as ipdb

conn = ipdb.connect(host="192.168.XX.XXX",port=10000,user="xxx",password="xxxxxx",database="xxx",auth_mechanism='PLAIN')`

坑四：問題三修改好之后，繼續(xù)運行程序，你會發(fā)現繼續(xù)報錯：

AttributeError: 'TSocket' object has no attribute 'isOpen'

解決辦法：

由于是thrift-sasl的版本太高了(0.3.0)，故將thrift-sasl的版本降級到0.2.1

pip uninstall thrift-sasl

pip install thrift-sasl==0.2.1

坑五：處理完這個問題后，繼續(xù)運行，繼續(xù)報錯(這個時間解決有點快崩潰的節(jié)奏了，但是請堅持住，其實你已經很快接近最后結果了)：

thriftpy.transport.TTransportException: TTransportException(type=1, message="Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: Unable to find a callback: 2'")

解決辦法：這個是最麻煩的，也是目前最難找到解決辦法的。

I solved the issue, had to uninstall the package SASL and install PURE-SASL, when impyla can′t find the sasl package it works with pure-sasl and then everything goes well.

主要原因其實還是因為sasl和pure-sasl有沖突，這種情況下，直接卸載sasl包就可能了。

pip uninstall SASL

坑六：但是執(zhí)行完成，繼續(xù)完成，可能還是會報錯：

TypeError: can't concat str to bytes

定位到錯誤的最后一條，在init.py第94行(標黃的部分)

header = struct.pack(">BI", status, len(body))

#按照網上的提供的辦法增加對BODY的處理

if (type(body) is str):

body = body.encode()

self._trans.write(header + body)

self._trans.flush()

經過以上步驟，大家應該可以連接Hive庫查詢數據，應該是不存在什么問題了。

最后總結一下，連接Hadoop數據庫中各種依賴包，請大家仔細核對一下依賴包(最好是依賴包相同，也就是不多不少[我說的是相關的包]，這樣真的可以避免很多問題的出現)

序號

包名

版本號

安裝命令行

pure_sasl

0.5.1

pip install pure_sasl==0.5.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

thrift

0.9.3

pip install thrift==0.9.3 -i https://pypi.tuna.tsinghua.edu.cn/simple

bitarray

0.8.3

pip install bitarray==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple

thrift_sasl

0.2.1

pip install thrift_sasl==0.2.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

thriftpy

0.3.9

pip install thriftpy==0.3.9 -i https://pypi.tuna.tsinghua.edu.cn/simple

impyla

0.14.1

pip install impyla==0.14.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

建議按順序安裝，我這邊之前有依賴包的問題，但是最終我是通過conda進行安裝的。

其中在安裝thriftpy、thrift_sasl、impyla報的時候報錯，想到自己有conda，直接使用conda install，會自動下載依賴的包，如下所示(供沒有conda環(huán)境的同學參考)

package

build

size

ply-3.11

py37_0

80 KB

conda-4.6.1

py37_0

1.7 MB

thriftpy-0.3.9

py37h1de35cc_2

171 KB

祝您好運！如果在實際過程中還是遇到各種各樣的問題，請你留言。

最后有一點提示：

SQL里面不要帶分號，否則會報錯。但是這個就不是環(huán)境問題了。報錯如下：

impala.error.HiveServer2Error: Error while compiling statement: FAILED: ParseException line 2:83 cannot recogniz

到此這篇關于Python連接Hadoop數據中遇到的各種坑(匯總)的文章就介紹到這了,更多相關Python連接Hadoop內容請搜索我們以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持我們！

本文標題: Python連接Hadoop數據中遇到的各種坑(匯總)

本文地址: http://www.cppcns.com/jiaoben/python/307541.html

總結

以上是生活随笔為你收集整理的python替代hadoop_Python连接Hadoop数据中遇到的各种坑(汇总)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：怎么用ACHI模式装系统 ACHI模式下
下一篇： python中multiply函数_py