當前位置：首頁 >

python（40）：利用utf-8编码判断中文英文字符

發(fā)布時間：2025/7/14 51 豆豆

生活随笔收集整理的這篇文章主要介紹了 python（40）：利用utf-8编码判断中文英文字符小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

#!/usr/bin/env?Python

# -*- coding:GBK -*-?

"""漢字處理的工具:

判斷unicode是否是漢字，數(shù)字，英文，或者其他字符。

全角符號轉(zhuǎn)半角符號。"""

__author__="internetsweeper <zhengbin0713@gmail.com>"

__date__="2007-08-04"

def is_chinese(uchar):

? ? ? ? """判斷一個unicode是否是漢字"""

? ? ? ? if uchar >= u'\u4e00' and uchar<=u'\u9fa5':

? ? ? ? ? ? ? ? return True

? ? ? ? else:

? ? ? ? ? ? ? ? return False

def is_number(uchar):

? ? ? ? """判斷一個unicode是否是數(shù)字"""

? ? ? ? if uchar >= u'\u0030' and uchar<=u'\u0039':

? ? ? ? ? ? ? ? return True

? ? ? ? else:

? ? ? ? ? ? ? ? return False

def is_alphabet(uchar):

? ? ? ? """判斷一個unicode是否是英文字母"""

? ? ? ? if (uchar >= u'\u0041' and uchar<=u'\u005a') or (uchar >= u'\u0061' and uchar<=u'\u007a'):

? ? ? ? ? ? ? ? return True

? ? ? ? else:

? ? ? ? ? ? ? ? return False

def is_other(uchar):

? ? ? ? """判斷是否非漢字，數(shù)字和英文字符"""

? ? ? ? if not (is_chinese(uchar) or is_number(uchar) or is_alphabet(uchar)):

? ? ? ? ? ? ? ? return True

? ? ? ? else:

? ? ? ? ? ? ? ? return False

def B2Q(uchar):

? ? ? ? """半角轉(zhuǎn)全角"""

? ? ? ? inside_code=ord(uchar)

? ? ? ? if inside_code<0x0020 or inside_code>0x7e: ? ? ?#不是半角字符就返回原來的字符

? ? ? ? ? ? ? ? return uchar

? ? ? ? if inside_code==0x0020: #除了空格其他的全角半角的公式為:半角=全角-0xfee0

? ? ? ? ? ? ? ? inside_code=0x3000

? ? ? ? else:

? ? ? ? ? ? ? ? inside_code+=0xfee0

? ? ? ? return unichr(inside_code)

def Q2B(uchar):

? ? ? ? """全角轉(zhuǎn)半角"""

? ? ? ? inside_code=ord(uchar)

? ? ? ? if inside_code==0x3000:

? ? ? ? ? ? ? ? inside_code=0x0020

? ? ? ? else:

? ? ? ? ? ? ? ? inside_code-=0xfee0

? ? ? ? if inside_code<0x0020 or inside_code>0x7e: ? ? ?#轉(zhuǎn)完之后不是半角字符返回原來的字符

? ? ? ? ? ? ? ? return uchar

? ? ? ? return unichr(inside_code)

def stringQ2B(ustring):

? ? ? ? """把字符串全角轉(zhuǎn)半角"""

? ? ? ? return "".join([Q2B(uchar) for uchar in ustring])

def uniform(ustring):

? ? ? ? """格式化字符串，完成全角轉(zhuǎn)半角，大寫轉(zhuǎn)小寫的工作"""

? ? ? ? return stringQ2B(ustring).lower()

def string2List(ustring):

? ? ? ? """將ustring按照中文，字母，數(shù)字分開"""

? ? ? ? retList=[]

? ? ? ? utmp=[]

? ? ? ? for uchar in ustring:

? ? ? ? ? ? ? ? if is_other(uchar):

? ? ? ? ? ? ? ? ? ? ? ? if len(utmp)==0:

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue

? ? ? ? ? ? ? ? ? ? ? ? else:

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? retList.append("".join(utmp))

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? utmp=[]

? ? ? ? ? ? ? ? else:

? ? ? ? ? ? ? ? ? ? ? ? utmp.append(uchar)

? ? ? ? if len(utmp)!=0:

? ? ? ? ? ? ? ? retList.append("".join(utmp))

? ? ? ? return retList

if __name__=="__main__":

? ? ? ? #test Q2B and B2Q

? ? ? ? for i in range(0x0020,0x007F):

? ? ? ? ? ? ? ? print Q2B(B2Q(unichr(i))),B2Q(unichr(i))

? ? ? ? #test uniform

? ? ? ? ustring=u'中國人名ａ高頻Ａ'

? ? ? ? ustring=uniform(ustring)

? ? ? ? ret=string2List(ustring)

? ? ? ? print ret

以上轉(zhuǎn)自http://hi.baidu.com/fenghua1893/item/d1a71d5ac47ffdcfd3e10cd1

這個問題是做?MkIV?預(yù)處理程序時搞定的，就是把一個混合了中英文混合字串分離為英文與中文的子字串，譬如，將?”我的?English?學(xué)的不好“?分離為?“我的"、"?English?”?與?"學(xué)的不好"?三個子字串。

1.?中英文混合字串的統(tǒng)一編碼表示中英文混合字串處理最省力的辦法就是把它們的編碼都轉(zhuǎn)成?Unicode，讓一個漢字與一個英文字母的內(nèi)存位寬都是相等的。這個工作用?Python?來做，比較合適，因為?Python?內(nèi)碼采用的是?Unicode，并且為了支持?Unicode?字串的操作，Python?做了一個?Unicode?內(nèi)建模塊，把?string?對象的全部方法重新實現(xiàn)了一遍，另外提供了?Codecs?對象，解決各種編碼類型的字符串解碼與編碼問題。
譬如下面的?Python?代碼，可實現(xiàn)?UTF-8?編碼的中英文混合字串向?Unicode?編碼的轉(zhuǎn)換：#?-*-

?coding:utf-8?-*-
a?=?"我的?English?學(xué)的不好"
print?type(a),len?(a),?a
b?=?unicode?(a,?"utf-8")
print?type(b),?len?(b),?b字符串?a?是?utf-8?編碼，使用?python?的內(nèi)建對象?unicode?可將其轉(zhuǎn)換為?Unicode?編碼的字符串?b。上述代碼執(zhí)行后的輸出結(jié)果如下所示，比較字串?a?與字串?b?的長度，顯然?len?(b)?的輸出結(jié)果是合理的。<type?'str'>?27?我的?English?學(xué)的不好
<type?'unicode'>?15?我的?English?學(xué)的不好要注意的一個問題是?Unicode?雖然號稱是“統(tǒng)一碼”，不過也是存在著兩種形式，即：
UCS-2：為?16?位碼，具有?2^16?=?65536?個碼位；?UCS-4：為?32?位碼，目前的規(guī)定是其首字節(jié)的首位為?0，因此具有?2^31?=?2147483648?個碼位，不過現(xiàn)在的只使用了?0x00000000?－?0x0010FFFF?之間的碼位，共?1114112?個。?
使用Python??sys?模塊提供的一個變量?maxunicode?的值可以判斷當前?Python?所使用的?Unicode?類型是?UCS-2?的還是?UCS-4?的。import?sys
print?sys.maxunicode若?sys.maxunicode?的值為?1114111，即為?UCS-4；若為?65535，則為?UCS-2。

2.?中英文混合字串的分離一旦中英文字串的編碼獲得統(tǒng)一，那么對它們進行分裂就是很簡單的事情了。首先要為中文字串與英文字串分別準備一個收集器，使用兩個空的字串對象即可，譬如?zh_gather?與?en_gather；然后要準備一個列表對象，負責(zé)按分離次序存儲?zh_gather?與?en_gather?的值。下面這個?Python?函數(shù)接受一個中英文混合的?Unicode?字串，并返回存儲中英文子字串的列表。def?split_zh_en?(zh_en_str):

????????zh_en_group?=?[]
????????zh_gather?=?""
????????en_gather?=?""
????????zh_status?=?False

????????for?c?in?zh_en_str:
????????????????if?not?zh_status?and?is_zh?(c):
????????????????????????zh_status?=?True
????????????????????????if?en_gather?!=?"":
????????????????????????????????zh_en_group.append?([mark["en"],en_gather])
????????????????????????????????en_gather?=?""
????????????????elif?not?is_zh?(c)?and?zh_status:
????????????????????????zh_status?=?False
????????????????????????if?zh_gather?!=?"":
????????????????????????????????zh_en_group.append?([mark["zh"],?zh_gather])
????????????????if?zh_status:
????????????????????????zh_gather?+=?c
????????????????else:
????????????????????????en_gather?+=?c???????????????????????????????
????????????????????????zh_gather?=?""

????????if?en_gather?!=?"":
????????????????zh_en_group.append?([mark["en"],en_gather])
????????elif?zh_gather?!=?"":
????????????????zh_en_group.append?([mark["zh"],zh_gather])

????????return?zh_en_group上述代碼所實現(xiàn)的功能細節(jié)是：對中英文混合字串?zh_en_str?的遍歷過程中進行逐字識別，若當前字符為中文，則將其添加到?zh_gather?中；若當前字符為英文，則將其添加到?en_gather?中。zh_status?表示中英文字符的切換狀態(tài)，當?zh_status?的值發(fā)生突變時，就將所收集的中文子字串或英文子字串添加到?zh_en_group?中去。
判斷字串?zh_en_str?中是否包含中文字符的條件語句中出現(xiàn)了一個?is_zh?()?函數(shù)，它的實現(xiàn)如下：def?is_zh?(c):
????????x?=?ord?(c)
????????#?Punct?&?Radicals
????????if?x?>=?0x2e80?and?x?<=?0x33ff:
????????????????return?True

????????#?Fullwidth?Latin?Characters
????????elif?x?>=?0xff00?and?x?<=?0xffef:
????????????????return?True

????????#?CJK?Unified?Ideographs?&
????????#?CJK?Unified?Ideographs?Extension?A
????????elif?x?>=?0x4e00?and?x?<=?0x9fbb:
????????????????return?True
????????#?CJK?Compatibility?Ideographs
????????elif?x?>=?0xf900?and?x?<=?0xfad9:
????????????????return?True

????????#?CJK?Unified?Ideographs?Extension?B
????????elif?x?>=?0x20000?and?x?<=?0x2a6d6:
????????????????return?True

????????#?CJK?Compatibility?Supplement
????????elif?x?>=?0x2f800?and?x?<=?0x2fa1d:
????????????????return?True

????????else:
????????????????return?False這段代碼來自?jjgod?寫的?XeTeX?預(yù)處理程序。
對于分離出來的中文子字串與英文子字串，為了使用方便，在將它們存入?zh_en_group?列表時，我對它們分別做了標記，即?mark["zh"]?與?mark["en"]。mark?是一個?dict?對象，其定義如下：mark?=?{"en":1,?"zh":2}如果要對?zh_en_group?中的英文字串或中文字串進行處理時，標記的意義在于快速判定字串是中文的，還是英文的，譬如：for?str?in?zh_en_group:
????????if?str[0]?=?mark["en"]:
????????????????do?somthing
????????else:
????????????????do?somthing?

總結(jié)

以上是生活随笔為你收集整理的python（40）：利用utf-8编码判断中文英文字符的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： linux--cut命令
下一篇：红薯因 Swift 重写开源中国失败，貌

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python（40）：利用utf-8编码判断中文英文字符

總結(jié)