日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

python正则表达式——regex模块

發布時間:2025/3/21 python 38 豆豆
生活随笔 收集整理的這篇文章主要介紹了 python正则表达式——regex模块 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

目錄

1. 為了與re模塊兼容,此模塊具有2個行為

2. Unicode中不區分大小寫的匹配:Case-insensitive matches

3. Flags

4. 組

5. 其他功能,如下表


參考:擴展模塊官網regex 2020.5.7

regex正則表達式實現與標準“ re”模塊向后兼容,但提供了其他功能。

re模塊的零寬度匹配行為是在Python 3.7中更改的,并且為Python 3.7編譯時,此模塊將遵循該行為。

1. 為了與re模塊兼容,此模塊具有2個行為

  • Version 0:(old behaviour,與re模塊兼容):

    Please note that the re module’s behaviour may change over time, and I’ll endeavour to match that behaviour in version 0.

    • Indicated by the?VERSION0?or?V0?flag, or?(?V0)?in the pattern.
    • Zero-width matches are not handled correctly in the re module before Python 3.7. The behaviour in those earlier versions is:
      • .split?won’t split a string at a zero-width match.
      • .sub?will advance by one character after a zero-width match.
    • Inline flags apply to the entire pattern, and they can’t be turned off.
    • Only simple sets are supported.
    • Case-insensitive matches in Unicode use simple case-folding by default.
  • Version 1:(new behaviour, possibly different from the re module):

    • Indicated by the?VERSION1?or?V1?flag, or?(?V1)?in the pattern.
    • Zero-width matches are handled correctly.
    • Inline flags apply to the end of the group or pattern, and they can be turned off.
    • Nested sets and set operations are supported.
    • Case-insensitive matches in Unicode use full case-folding by default.

如果未指定版本,則regex模塊將默認為regex.DEFAULT_VERSION。

2. Unicode中不區分大小寫的匹配:Case-insensitive matches

regex模塊支持簡單和完整的大小寫折疊,以實現Unicode中不區分大小寫的匹配。可以使用FULLCASE或F標志或模式中的(?f)來打開完整的大小寫折疊。請注意,該標志會影響IGNORECASE標志的工作方式。FULLCASE標志本身不會打開不區分大小寫的匹配。

  • 在版本0行為中,默認情況下該標志處于關閉狀態。
  • 在版本1行為中,默認情況下該標志處于啟用狀態。

3. Flags

標志有2種:局部標志和全局標志。范圍標志只能應用于模式的一部分,并且可以打開或關閉;全局標志適用于整個模式,只能將其打開。

局部標志:?FULLCASE,?IGNORECASE,?MULTILINE,?DOTALL,?VERBOSE,?WORD.

全局標志:ASCII,?BESTMATCH,?ENHANCEMATCH,?LOCALE,?POSIX,?REVERSE,?UNICODE,?VERSION0,?VERSION1.

如果未指定ASCII,LOCALE或UNICODE標志,則如果正則表達式模式為Unicode字符串,則默認為UNICODE;如果為字節字符串,則默認為ASCII。

  • ENHANCEMATCH標志進行模糊匹配,以提高找到的下一個匹配的匹配度。

  • BESTMATCH標志使模糊匹配搜索最佳匹配而不是下一個匹配。

4. 組

所有捕獲組都有一個組號,從1開始。具有相同組名的組將具有相同的組號,而具有不同組名的組將具有不同的組號。

同一名稱可由多個組使用,以后的捕獲“覆蓋”較早的捕獲。該組的所有捕獲都可以通過match對象的captures方法獲得。

組號將在分支重置的不同分支之間重用,例如。(?|(first)|(second))僅具有組1。如果捕獲組具有不同的組名,則它們當然將具有不同的組號,例如,(?|(?P<foo>first)|(?P<bar>second)) 具有組1?(“foo”) 和組2 (“bar”).

?正則表達式:?(\s+)(?|(?P<foo>[A-Z]+)|(\w+)) (?P<foo>[0-9]+) 有2組

  • (\s+)?is group 1.
  • (?P<foo>[A-Z]+)?is group 2, also called “foo”.
  • (\w+)?is group 2 because of the branch reset.
  • (?P<foo>[0-9]+)?is group 2 because it’s called “foo”.

5. 其他功能,如下表

模式描述

\m

\M

\b

單詞起始位置、結束位置、分界位置

regex用\m表示單詞起始位置,用\M表示單詞結束位置。

\b:是單詞分界位置,但不能區分是起始還是結束位置。

(?flags-flags:...)? 局部

(?flags-flags)? 全局

局部范圍控制:

(?i:)是打開忽略大小寫,(?-i:)則是關閉忽略大小寫。

如果有多個flag挨著寫既可,如(?is-f:):減號左邊的是打開,減號右邊的是關閉

>>> regex.search(r"<B>(?i:good)</B>", "<B>GOOD</B>")
<regex.Match object; span=(0, 11), match='<B>GOOD</B>'>

?

全局范圍控制:

(?si-f)<B>good</B>

lookaround

對條件模式中環顧四周的支持:

>>> regex.match(r'(?(?=\d)\d+|\w+)', '123abc')
<regex.Match object; span=(0, 3), match='123'>
>>> regex.match(r'(?(?=\d)\d+|\w+)', 'abc123')
<regex.Match object; span=(0, 6), match='abc123'>

?

這與在一對替代方案的第一個分支中進行環視不太一樣:

>>> print(regex.match(r'(?:(?=\d)\d+\b|\w+)', '123abc')) ? # 若分支1不匹配,嘗試第2個分支
<regex.Match object; span=(0, 6), match='123abc'>
>>> print(regex.match(r'(?(?=\d)\d+\b|\w+)', '123abc')) ? ?# 若分支1不匹配,不嘗試第2個分支
None

(?p)? ?

POSIX匹配(最左最長)

正常匹配:
>>> regex.search(r'Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 2), match='Mr'>
>>> regex.search(r'one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 7), match='oneself'>


POSIX匹配:
>>> regex.search(r'(?p)Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 3), match='Mrs'>
>>> regex.search(r'(?p)one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 17), match='oneselfsufficient'>

[[a-z]--[aeiou]]

V0:simple sets,與re模塊兼容

V1:nested sets,功能增強,集合包含'a'-'z',排除“a”, “e”, “i”, “o”, “u”

eg:

? ? ?regex.search(r'(?V1)[[a-z]--[aeiou]]+', 'abcde')

? ? ?regex.search(r'[[a-z]--[aeiou]]+', 'abcde', flags=regex.V1)

<regex.Match object; span=(1, 4), match='bcd'>

(?(DEFINE)...)

命名組內容及名字:如果沒有名為“ DEFINE”的組,則…將被忽略,但只要有任何組定義,(?(DEFINE))將起作用。

eg:

>>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant) (?&item)', '5 elephants')
<regex.Match object; span=(0, 11), match='5 elephants'>

?

# 卡兩頭為固定樣式、中間隨意的內容
>>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant)[\u4E00-\u9FA5](?&item)', '123哈哈dog')
<regex.Match object; span=(0, 8), match='123哈哈dog'>

\K

保留K出現位置之后的匹配內容,丟棄其之前的匹配內容。

>>> m = regex.search(r'(\w\w\K\w\w\w)', 'abcdef')
<regex.Match object; span=(2, 5), match='cde'> ? 保留cde,丟棄ab
>>> m[0] ? 'cde'
>>> m[1] ? 'abcde'

>>> m = regex.search(r'(?r)(\w\w\K\w\w\w)', 'abcdef') ??
<regex.Match object; span=(1, 3), match='bc'> ? 反向,保留bc,丟棄def
>>> m[0] ?'bc'
>>> m[1] ?'bcdef'

?

(?r)? 反向搜索 >>> regex.findall(r".", "abc") ['a', 'b', 'c'] >>> regex.findall(r"(?r).", "abc") ['c', 'b', 'a']

注意:反向搜索的結果不一定與正向搜索相反?

>>> regex.findall(r"..", "abcde") ['ab', 'cd'] >>> regex.findall(r"(?r)..", "abcde") ['de', 'bc']
expandf

使用下標來獲取重復捕獲組的所有捕獲?

>>> m = regex.match(r"(\w)+", "abc")
>>> m.expandf("{1}")? ?'c' ? ?m.expandf("{1}") == m.expandf("{1[-1]}")? ? 后面的匹配覆蓋前面的匹配,所以{1}=c
>>> m.expandf("{1[0]} {1[1]} {1[2]}") ? ? ?'a b c'
>>> m.expandf("{1[-1]} {1[-2]} {1[-3]}") ? 'c b a'

?

定義組名
>>> m = regex.match(r"(?P<letter>\w)+", "abc")
>>> m.expandf("{letter}") ? ?'c'
>>> m.expandf("{letter[0]} {letter[1]} {letter[2]}") ? ? ? 'a b c'
>>> m.expandf("{letter[-1]} {letter[-2]} {letter[-3]}") ? ?'c b a'

?

>>> m = regex.match(r"(\w+) (\w+)", "foo bar")
>>> m.expandf("{0} => {2} {1}") ? ? 'foo bar => bar foo'

>>> m = regex.match(r"(?P<word1>\w+) (?P<word2>\w+)", "foo bar")
>>> m.expandf("{word2} {word1}") ? ?'bar foo'

?

同樣可以用于search()方法

capturesdict()

groupdict()

captures()

capturesdict() 是 groupdict() 和 captures()的結合:

groupdict():返回一個字典,key = 組名,value = 匹配的最后一個值?

captures():返回一個所有匹配值的列表

capturesdict():返回一個字典,key = 組名,value = 所有匹配值的列表

?

>>> m = regex.match(r"(?:(?P<word>\w+) (?P<digits>\d+)\n)+", "one 1\ntwo 2\nthree 3\n")
>>> m.groupdict()
{'word': 'three', 'digits': '3'}


>>> m.captures("word")
['one', 'two', 'three']


>>> m.captures("digits")
['1', '2', '3']
>>> m.capturesdict()


{'word': ['one', 'two', 'three'], 'digits': ['1', '2', '3']}?

訪問組的方式

(1)通過下標、切片訪問:
>>> m = regex.search(r"(?P<before>.*?)(?P<num>\d+)(?P<after>.*)", "pqr123stu")
>>> m["before"]
pqr
>>> len(m)
4
>>> m[:]
('pqr123stu', 'pqr', '123', 'stu')

?

(2)通過group("name")訪問:
>>> m.group('num')?

'123'

?

(3)通過組序號訪問:
>>> m.group(0)

'pqr123stu'

>>> m.group(1)

'pqr'

subf

subfn

subf和subfn分別是sub和subn的替代方案。當傳遞替換字符串時,他們將其視為格式字符串。

?

>>> regex.subf(r"(\w+) (\w+)", "{0} => {2} {1}", "foo bar")
'foo bar => bar foo'
>>> regex.subf(r"(?P<word1>\w+) (?P<word2>\w+)", "{word2} {word1}", "foo bar")
'bar foo'?

partial

部分匹配:match、search、fullmatch、finditer都支持部分匹配,使用partial關鍵字參數設置。匹配對象有一個pattial參數,當部分匹配時返回True,完全匹配時返回False

?

>>> regex.search(r'\d{4}', '12', partial=True)
? ? ? ?<regex.Match object; span=(0, 2), match='12', partial=True>
>>> regex.search(r'\d{4}', '123', partial=True)
? ? ? ?<regex.Match object; span=(0, 3), match='123', partial=True>
>>> regex.search(r'\d{4}', '1234', partial=True)
? ? ? ?<regex.Match object; span=(0, 4), match='1234'>? ??完全匹配:沒有partial
>>> regex.search(r'\d{4}', '12345', partial=True)
? ? ? <regex.Match object; span=(0, 4), match='1234'>
>>> regex.search(r'\d{4}', '12345', partial=True).partial? ? ?完全匹配
? ? ? ?False
>>> regex.search(r'\d{4}', '145', partial=True).partial? ? ? ? 部分匹配
? ? ? True
>>> regex.search(r'\d{4}', '1245', partial=True).partial? ? ??完全匹配
? ? ??False

??

(?P<name>)

允許組名重復

允許組名重復,后面的捕獲覆蓋前面的捕獲
可選組:
>>> # Both groups capture, the second capture 'overwriting' the first.
>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", "first or second")
>>> m.group("item") ? 'second'
>>> m.captures("item") ? ['first', 'second']

>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", " or second")
>>> m.group("item") ? ? 'second'
>>> m.captures("item") ? ['second']

>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", "first or ")
>>> m.group("item") ? ? 'first'
>>> m.captures("item") ? ['first']

?

強制性組:
>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)?", "first or second")
>>> m.group("item") ? ?'second'
>>> m.captures("item") ?['first', 'second']

>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)", " or second")
>>> m.group("item") ? ? 'second'
>>> m.captures("item") ? ['', 'second']

>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)", "first or ")
>>> m.group("item") ? ? ? ?''
>>> m.captures("item") ? ? ['first', '']

?

detach_string

匹配對象通過其string屬性,對所搜索字符串進行引用。detach_string方法將“分離”該字符串,使其可用于垃圾回收,如果該字符串很大,則可能節省寶貴的內存。

>>> m = regex.search(r"\w+", "Hello world")
>>> print(m.group())
Hello
>>> print(m.string)
Hello world
>>> m.detach_string()
>>> print(m.group())
Hello
>>> print(m.string)
None

(?0)、(?1)、(?2)

?

(?R)或(?0)嘗試遞歸匹配整個正則表達式。
(?1)、(?2)等,嘗試匹配相關的捕獲組,第1組、第2組。(Tarzan|Jane) loves (?1) == (Tarzan|Jane) loves (?:Tarzan|Jane)
(?&name)嘗試匹配命名的捕獲組。

>>> regex.match(r"(Tarzan|Jane) loves (?1)", "Tarzan loves Jane").groups()
('Tarzan',)
>>> regex.match(r"(Tarzan|Jane) loves (?1)", "Jane loves Tarzan").groups()
('Jane',)

>>> m = regex.search(r"(\w)(?:(?R)|(\w?))\1", "kayak")
>>> m.group(0, 1, 2)
('kayak', 'k', None)

模糊匹配

三種類型錯誤:

  • 插入: “i”
  • 刪除:“d”
  • 替換:“s”
  • 任何類型錯誤:“e”

Examples:

  • foo?match “foo” exactly
  • (?:foo){i}?match “foo”, permitting insertions
  • (?:foo)ozvdkddzhkzd?match “foo”, permitting deletions
  • (?:foo){s}?match “foo”, permitting substitutions
  • (?:foo){i,s}?match “foo”, permitting insertions and substitutions
  • (?:foo){e}?match “foo”, permitting errors

如果指定了某種類型的錯誤,則不允許任何未指定的類型。在以下示例中,我將省略item并僅寫出模糊性:

  • {d<=3}?permit at most 3 deletions, but no other types
  • {i<=1,s<=2}?permit at most 1 insertion and at most 2 substitutions, but no deletions
  • {1<=e<=3}?permit at least 1 and at most 3 errors
  • {i<=2,d<=2,e<=3}?permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions

It’s also possible to state the costs of each type of error and the maximum permitted total cost.

Examples:

  • {2i+2d+1s<=4}?each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4
  • {i<=1,d<=1,s<=1,2i+2d+1s<=4}?at most 1 insertion, at most 1 deletion, at most 1 substitution; each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4

Examples:

  • {s<=2:[a-z]}?at most 2 substitutions, which must be in the character set?[a-z].
  • {s<=2,i<=3:\d}?at most 2 substitutions, at most 3 insertions, which must be digits.

默認情況下,模糊匹配將搜索滿足給定約束的第一個匹配項。ENHANCEMATCH (?e)標志將使它嘗試提高找到的匹配項的擬合度(即減少錯誤數量)。

BESTMATCH標志將使其搜索最佳匹配。

  • regex.search("(dog){e}",?"cat and?dog")[1]?returns?"cat"?because that matches?"dog"?with 3 errors (an unlimited number of errors is permitted).
  • regex.search("(dog){e<=1}",?"cat and?dog")[1]?returns?" dog"?(with a leading space) because that matches?"dog"?with 1 error, which is within the limit.
  • regex.search("(?e)(dog){e<=1}",?"cat and?dog")[1]?returns?"dog"?(without a leading space) because the fuzzy search matches?" dog"?with 1 error, which is within the limit, and the?(?e)?then it attempts a better fit.

匹配對象具有屬性fuzzy_counts,該屬性給出替換、插入和刪除的總數:

>>> # A 'raw' fuzzy match:
>>> regex.fullmatch(r"(?:cats|cat){e<=1}", "cat").fuzzy_counts
(0, 0, 1)
>>> # 0 substitutions, 0 insertions, 1 deletion.

>>> # A better match might be possible if the ENHANCEMATCH flag used:
>>> regex.fullmatch(r"(?e)(?:cats|cat){e<=1}", "cat").fuzzy_counts
(0, 0, 0)
>>> # 0 substitutions, 0 insertions, 0 deletions.

?

匹配對象還具有屬性fuzzy_changes,該屬性給出替換、插入和刪除的位置的元組:

>>> m = regex.search('(fuu){i<=2,d<=2,e<=5}', 'anaconda foo bar')
>>> m
<regex.Match object; span=(7, 10), match='a f', fuzzy_counts=(0, 2, 2)>
>>> m.fuzzy_changes
([], [7, 8], [10, 11])?

\L<name>

?Named lists

老方法:

p = regex.compile(r"first|second|third|fourth|fifth"),如果列表很大,則解析生成的正則表達式可能會花費大量時間,并且還必須注意正確地對字符串進行轉義和正確排序,例如,“ cats”位于“ cat”之間。

?

新方法:?順序無關緊要,將它們視為一個set

>>> option_set = ["first", "second", "third", "fourth", "fifth"]
>>> p = regex.compile(r"\L<options>", options=option_set)

?

named_lists屬性:

>>> print(p.named_lists)
# Python 3
{'options': frozenset({'fifth', 'first', 'fourth', 'second', 'third'})}
# Python 2
{'options': frozenset(['fifth', 'fourth', 'second', 'third', 'first'])}

Set operators

集合、嵌套集合

僅版本1行為

添加了集合運算符,并且集合可以包含嵌套集合。

按優先級高低排序的運算符為:

  • ||?for union (“x||y” means “x or y”)
  • ~~?(double tilde) for symmetric difference (“x~~y” means “x or y, but not both”)
  • &&?for intersection (“x&&y” means “x and y”)
  • --?(double dash) for difference (“x–y” means “x but not y”)

隱式聯合,即[ab]中的簡單并置具有最高優先級。因此,[ab && cd] 與 [[a || b] && [c || d]] 相同。

eg:

  • [ab]? # Set containing ‘a’ and ‘b’
  • [a-z]? # Set containing ‘a’ .. ‘z’
  • [[a-z]--[qw]]? # Set containing ‘a’ .. ‘z’, but not ‘q’ or ‘w’
  • [a-z--qw]? # Same as above
  • [\p{L}--QW]? # Set containing all letters except ‘Q’ and ‘W’
  • [\p{N}--[0-9]]? # Set containing all numbers except ‘0’ .. ‘9’
  • [\p{ASCII}&&\p{Letter}]?# Set containing all characters which are ASCII and letter
開始、結束索引

匹配對象具有其他方法,這些方法返回有關重復捕獲組的所有成功匹配的信息。這些方法是:

  • matchobject.captures([group1,?...])
  • matchobject.starts([group])
  • matchobject.ends([group])
  • matchobject.spans([group])?
>>> m = regex.search(r"(\w{3})+", "123456789") >>> m.group(1) '789' >>> m.captures(1) ['123', '456', '789'] >>> m.start(1) 6 >>> m.starts(1) [0, 3, 6] >>> m.end(1) 9 >>> m.ends(1) [3, 6, 9] >>> m.span(1) (6, 9) >>> m.spans(1) [(0, 3), (3, 6), (6, 9)]
??

?\G

搜索錨,它在每個搜索開始/繼續的位置匹配,可用于連續匹配或在負變長后向限制中使用,以限制后向搜索的范圍:

?

>>> regex.findall(r"\w{2}", "abcd ef") ['ab', 'cd', 'ef'] >>> regex.findall(r"\G\w{2}", "abcd ef") ['ab', 'cd']
??
(?|...|...)? ?分支重置

捕獲組號將在所有替代方案中重復使用,但是具有不同名稱的組將具有不同的組號。

>>> regex.match(r"(?|(first)|(second))", "first").groups() ('first',) >>> regex.match(r"(?|(first)|(second))", "second").groups() ('second',)

注:只有一個組

超時

匹配方法和功能支持超時。超時(以秒為單位)適用于整個操作:

>>> from time import sleep >>> >>> def fast_replace(m): ... return 'X' ... >>> def slow_replace(m): ... sleep(0.5) ... return 'X' ... >>> regex.sub(r'[a-z]', fast_replace, 'abcde', timeout=2) 'XXXXX' >>> regex.sub(r'[a-z]', slow_replace, 'abcde', timeout=2) Traceback (most recent call last):File "<stdin>", line 1, in <module>File "C:\Python37\lib\site-packages\regex\regex.py", line 276, in subendpos, concurrent, timeout) TimeoutError: regex timed out

?

總結

以上是生活随笔為你收集整理的python正则表达式——regex模块的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。