當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python正则表达式——regex模块

發(fā)布時(shí)間：2025/3/21 python 41 豆豆

生活随笔收集整理的這篇文章主要介紹了 python正则表达式——regex模块小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

1. 為了與re模塊兼容，此模塊具有2個(gè)行為

2. Unicode中不區(qū)分大小寫的匹配：Case-insensitive matches

3. Flags

4. 組

5. 其他功能，如下表

參考：擴(kuò)展模塊官網(wǎng)regex 2020.5.7

regex正則表達(dá)式實(shí)現(xiàn)與標(biāo)準(zhǔn)“ re”模塊向后兼容，但提供了其他功能。

re模塊的零寬度匹配行為是在Python 3.7中更改的，并且為Python 3.7編譯時(shí)，此模塊將遵循該行為。

1. 為了與re模塊兼容，此模塊具有2個(gè)行為

Version 0：(old behaviour，與re模塊兼容):

Please note that the re module’s behaviour may change over time, and I’ll endeavour to match that behaviour in version 0.
- Indicated by the?VERSION0?or?V0?flag, or?(?V0)?in the pattern.
- Zero-width matches are not handled correctly in the re module before Python 3.7. The behaviour in those earlier versions is:
  - .split?won’t split a string at a zero-width match.
  - .sub?will advance by one character after a zero-width match.
- Inline flags apply to the entire pattern, and they can’t be turned off.
- Only simple sets are supported.
- Case-insensitive matches in Unicode use simple case-folding by default.
Version 1：(new behaviour, possibly different from the re module):
- Indicated by the?VERSION1?or?V1?flag, or?(?V1)?in the pattern.
- Zero-width matches are handled correctly.
- Inline flags apply to the end of the group or pattern, and they can be turned off.
- Nested sets and set operations are supported.
- Case-insensitive matches in Unicode use full case-folding by default.

如果未指定版本，則regex模塊將默認(rèn)為regex.DEFAULT_VERSION。

2. Unicode中不區(qū)分大小寫的匹配：Case-insensitive matches

regex模塊支持簡單和完整的大小寫折疊，以實(shí)現(xiàn)Unicode中不區(qū)分大小寫的匹配。可以使用FULLCASE或F標(biāo)志或模式中的（？f）來打開完整的大小寫折疊。請(qǐng)注意，該標(biāo)志會(huì)影響IGNORECASE標(biāo)志的工作方式。FULLCASE標(biāo)志本身不會(huì)打開不區(qū)分大小寫的匹配。

在版本0行為中，默認(rèn)情況下該標(biāo)志處于關(guān)閉狀態(tài)。
在版本1行為中，默認(rèn)情況下該標(biāo)志處于啟用狀態(tài)。

3. Flags

標(biāo)志有2種：局部標(biāo)志和全局標(biāo)志。范圍標(biāo)志只能應(yīng)用于模式的一部分，并且可以打開或關(guān)閉；全局標(biāo)志適用于整個(gè)模式，只能將其打開。

局部標(biāo)志：?FULLCASE,?IGNORECASE,?MULTILINE,?DOTALL,?VERBOSE,?WORD.

全局標(biāo)志：ASCII,?BESTMATCH,?ENHANCEMATCH,?LOCALE,?POSIX,?REVERSE,?UNICODE,?VERSION0,?VERSION1.

如果未指定ASCII，LOCALE或UNICODE標(biāo)志，則如果正則表達(dá)式模式為Unicode字符串，則默認(rèn)為UNICODE；如果為字節(jié)字符串，則默認(rèn)為ASCII。

ENHANCEMATCH標(biāo)志進(jìn)行模糊匹配，以提高找到的下一個(gè)匹配的匹配度。
BESTMATCH標(biāo)志使模糊匹配搜索最佳匹配而不是下一個(gè)匹配。

4. 組

所有捕獲組都有一個(gè)組號(hào)，從1開始。具有相同組名的組將具有相同的組號(hào)，而具有不同組名的組將具有不同的組號(hào)。

同一名稱可由多個(gè)組使用，以后的捕獲“覆蓋”較早的捕獲。該組的所有捕獲都可以通過match對(duì)象的captures方法獲得。

組號(hào)將在分支重置的不同分支之間重用，例如。(?|(first)|(second))僅具有組1。如果捕獲組具有不同的組名，則它們當(dāng)然將具有不同的組號(hào)，例如，(?|(?P<foo>first)|(?P<bar>second)) 具有組1?(“foo”) 和組2 (“bar”).

?正則表達(dá)式：?(\s+)(?|(?P<foo>[A-Z]+)|(\w+)) (?P<foo>[0-9]+) 有2組

(\s+)?is group 1.
(?P<foo>[A-Z]+)?is group 2, also called “foo”.
(\w+)?is group 2 because of the branch reset.
(?P<foo>[0-9]+)?is group 2 because it’s called “foo”.

5. 其他功能，如下表

模式

描述

單詞起始位置、結(jié)束位置、分界位置

regex用\m表示單詞起始位置，用\M表示單詞結(jié)束位置。

\b：是單詞分界位置，但不能區(qū)分是起始還是結(jié)束位置。

(?flags-flags:...)? 局部

(?flags-flags)? 全局

局部范圍控制：

(?i:)是打開忽略大小寫，(?-i:)則是關(guān)閉忽略大小寫。

如果有多個(gè)flag挨著寫既可，如(?is-f:)：減號(hào)左邊的是打開，減號(hào)右邊的是關(guān)閉。

>>> regex.search(r"(?i:good)", "GOOD")
<regex.Match object; span=(0, 11), match='GOOD'>

全局范圍控制：

(?si-f)good

lookaround

對(duì)條件模式中環(huán)顧四周的支持：

>>> regex.match(r'(?(?=\d)\d+|\w+)', '123abc')
<regex.Match object; span=(0, 3), match='123'>
>>> regex.match(r'(?(?=\d)\d+|\w+)', 'abc123')
<regex.Match object; span=(0, 6), match='abc123'>

這與在一對(duì)替代方案的第一個(gè)分支中進(jìn)行環(huán)視不太一樣：

>>> print(regex.match(r'(?:(?=\d)\d+\b|\w+)', '123abc')) ? # 若分支1不匹配，嘗試第2個(gè)分支
<regex.Match object; span=(0, 6), match='123abc'>
>>> print(regex.match(r'(?(?=\d)\d+\b|\w+)', '123abc')) ? ?# 若分支1不匹配，不嘗試第2個(gè)分支
None

(?p)? ?

POSIX匹配（最左最長）

正常匹配：
>>> regex.search(r'Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 2), match='Mr'>
>>> regex.search(r'one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 7), match='oneself'>

POSIX匹配：
>>> regex.search(r'(?p)Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 3), match='Mrs'>
>>> regex.search(r'(?p)one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 17), match='oneselfsufficient'>

[[a-z]--[aeiou]]

V0：simple sets，與re模塊兼容

V1：nested sets，功能增強(qiáng)，集合包含'a'-'z'，排除“a”, “e”, “i”, “o”, “u”

eg：

? ? ?regex.search(r'(?V1)[[a-z]--[aeiou]]+', 'abcde')

或

? ? ?regex.search(r'[[a-z]--[aeiou]]+', 'abcde', flags=regex.V1)

<regex.Match object; span=(1, 4), match='bcd'>

(?(DEFINE)...)

命名組內(nèi)容及名字：如果沒有名為“ DEFINE”的組，則…將被忽略，但只要有任何組定義，(?(DEFINE))將起作用。

eg：

>>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant) (?&item)', '5 elephants')
<regex.Match object; span=(0, 11), match='5 elephants'>

# 卡兩頭為固定樣式、中間隨意的內(nèi)容
>>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant)[\u4E00-\u9FA5](?&item)', '123哈哈dog')
<regex.Match object; span=(0, 8), match='123哈哈dog'>

保留K出現(xiàn)位置之后的匹配內(nèi)容，丟棄其之前的匹配內(nèi)容。

>>> m = regex.search(r'(\w\w\K\w\w\w)', 'abcdef')
<regex.Match object; span=(2, 5), match='cde'> ? 保留cde，丟棄ab
>>> m[0] ? 'cde'
>>> m[1] ? 'abcde'

>>> m = regex.search(r'(?r)(\w\w\K\w\w\w)', 'abcdef') ??
<regex.Match object; span=(1, 3), match='bc'> ? 反向，保留bc，丟棄def
>>> m[0] ?'bc'
>>> m[1] ?'bcdef'

(?r)? 反向搜索

>>> regex.findall(r".", "abc") ['a', 'b', 'c'] >>> regex.findall(r"(?r).", "abc") ['c', 'b', 'a']

注意：反向搜索的結(jié)果不一定與正向搜索相反?

>>> regex.findall(r"..", "abcde") ['ab', 'cd'] >>> regex.findall(r"(?r)..", "abcde") ['de', 'bc']

expandf

使用下標(biāo)來獲取重復(fù)捕獲組的所有捕獲?

>>> m = regex.match(r"(\w)+", "abc")
>>> m.expandf("{1}")? ?'c' ? ?m.expandf("{1}") == m.expandf("{1[-1]}")? ? 后面的匹配覆蓋前面的匹配，所以{1}=c
>>> m.expandf("{1[0]} {1[1]} {1[2]}") ? ? ?'a b c'
>>> m.expandf("{1[-1]} {1[-2]} {1[-3]}") ? 'c b a'

定義組名
>>> m = regex.match(r"(?P<letter>\w)+", "abc")
>>> m.expandf("{letter}") ? ?'c'
>>> m.expandf("{letter[0]} {letter[1]} {letter[2]}") ? ? ? 'a b c'
>>> m.expandf("{letter[-1]} {letter[-2]} {letter[-3]}") ? ?'c b a'

>>> m = regex.match(r"(\w+) (\w+)", "foo bar")
>>> m.expandf("{0} => {2} {1}") ? ? 'foo bar => bar foo'

>>> m = regex.match(r"(?P<word1>\w+) (?P<word2>\w+)", "foo bar")
>>> m.expandf("{word2} {word1}") ? ?'bar foo'

同樣可以用于search()方法

capturesdict()

groupdict()

captures()

capturesdict() 是 groupdict() 和 captures()的結(jié)合：

groupdict()：返回一個(gè)字典，key = 組名，value = 匹配的最后一個(gè)值?

captures()：返回一個(gè)所有匹配值的列表

capturesdict()：返回一個(gè)字典，key = 組名，value = 所有匹配值的列表

>>> m = regex.match(r"(?:(?P<word>\w+) (?P<digits>\d+)\n)+", "one 1\ntwo 2\nthree 3\n")
>>> m.groupdict()
{'word': 'three', 'digits': '3'}

>>> m.captures("word")
['one', 'two', 'three']

>>> m.captures("digits")
['1', '2', '3']
>>> m.capturesdict()

{'word': ['one', 'two', 'three'], 'digits': ['1', '2', '3']}?

訪問組的方式

（1）通過下標(biāo)、切片訪問：
>>> m = regex.search(r"(?P<before>.*?)(?P<num>\d+)(?P<after>.*)", "pqr123stu")
>>> m["before"]
pqr
>>> len(m)
4
>>> m[:]
('pqr123stu', 'pqr', '123', 'stu')

（2）通過group("name")訪問：
>>> m.group('num')?

'123'

（3）通過組序號(hào)訪問：
>>> m.group(0)

'pqr123stu'

>>> m.group(1)

'pqr'

subf

subfn

subf和subfn分別是sub和subn的替代方案。當(dāng)傳遞替換字符串時(shí)，他們將其視為格式字符串。

>>> regex.subf(r"(\w+) (\w+)", "{0} => {2} {1}", "foo bar")
'foo bar => bar foo'
>>> regex.subf(r"(?P<word1>\w+) (?P<word2>\w+)", "{word2} {word1}", "foo bar")
'bar foo'?

partial

部分匹配：match、search、fullmatch、finditer都支持部分匹配，使用partial關(guān)鍵字參數(shù)設(shè)置。匹配對(duì)象有一個(gè)pattial參數(shù)，當(dāng)部分匹配時(shí)返回True，完全匹配時(shí)返回False

>>> regex.search(r'\d{4}', '12', partial=True)
? ? ? ?<regex.Match object; span=(0, 2), match='12', partial=True>
>>> regex.search(r'\d{4}', '123', partial=True)
? ? ? ?<regex.Match object; span=(0, 3), match='123', partial=True>
>>> regex.search(r'\d{4}', '1234', partial=True)
? ? ? ?<regex.Match object; span=(0, 4), match='1234'>? ??完全匹配：沒有partial
>>> regex.search(r'\d{4}', '12345', partial=True)
? ? ? <regex.Match object; span=(0, 4), match='1234'>
>>> regex.search(r'\d{4}', '12345', partial=True).partial? ? ?完全匹配
? ? ? ?False
>>> regex.search(r'\d{4}', '145', partial=True).partial? ? ? ? 部分匹配
? ? ? True
>>> regex.search(r'\d{4}', '1245', partial=True).partial? ? ??完全匹配
? ? ??False

(?P<name>)

允許組名重復(fù)

允許組名重復(fù)，后面的捕獲覆蓋前面的捕獲
可選組：
>>> # Both groups capture, the second capture 'overwriting' the first.
>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", "first or second")
>>> m.group("item") ? 'second'
>>> m.captures("item") ? ['first', 'second']

>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", " or second")
>>> m.group("item") ? ? 'second'
>>> m.captures("item") ? ['second']

>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", "first or ")
>>> m.group("item") ? ? 'first'
>>> m.captures("item") ? ['first']

強(qiáng)制性組：
>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)?", "first or second")
>>> m.group("item") ? ?'second'
>>> m.captures("item") ?['first', 'second']

>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)", " or second")
>>> m.group("item") ? ? 'second'
>>> m.captures("item") ? ['', 'second']

>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)", "first or ")
>>> m.group("item") ? ? ? ?''
>>> m.captures("item") ? ? ['first', '']

detach_string

匹配對(duì)象通過其string屬性，對(duì)所搜索字符串進(jìn)行引用。detach_string方法將“分離”該字符串，使其可用于垃圾回收，如果該字符串很大，則可能節(jié)省寶貴的內(nèi)存。

>>> m = regex.search(r"\w+", "Hello world")
>>> print(m.group())
Hello
>>> print(m.string)
Hello world
>>> m.detach_string()
>>> print(m.group())
Hello
>>> print(m.string)
None

(?0)、(?1)、(?2)

(?R)或(?0)嘗試遞歸匹配整個(gè)正則表達(dá)式。
(?1)、(?2)等，嘗試匹配相關(guān)的捕獲組，第1組、第2組。(Tarzan|Jane) loves (?1) == (Tarzan|Jane) loves (?:Tarzan|Jane)
(?＆name)嘗試匹配命名的捕獲組。

>>> regex.match(r"(Tarzan|Jane) loves (?1)", "Tarzan loves Jane").groups()
('Tarzan',)
>>> regex.match(r"(Tarzan|Jane) loves (?1)", "Jane loves Tarzan").groups()
('Jane',)

>>> m = regex.search(r"(\w)(?:(?R)|(\w?))\1", "kayak")
>>> m.group(0, 1, 2)
('kayak', 'k', None)

模糊匹配

三種類型錯(cuò)誤：

插入： “i”
刪除：“d”
替換：“s”
任何類型錯(cuò)誤：“e”

Examples:

foo?match “foo” exactly
(?:foo){i}?match “foo”, permitting insertions
(?:foo)ozvdkddzhkzd?match “foo”, permitting deletions
(?:foo){s}?match “foo”, permitting substitutions
(?:foo){i,s}?match “foo”, permitting insertions and substitutions
(?:foo){e}?match “foo”, permitting errors

如果指定了某種類型的錯(cuò)誤，則不允許任何未指定的類型。在以下示例中，我將省略item并僅寫出模糊性：

{d<=3}?permit at most 3 deletions, but no other types
{i<=1,s<=2}?permit at most 1 insertion and at most 2 substitutions, but no deletions
{1<=e<=3}?permit at least 1 and at most 3 errors
{i<=2,d<=2,e<=3}?permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions

It’s also possible to state the costs of each type of error and the maximum permitted total cost.

Examples:

{2i+2d+1s<=4}?each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4
{i<=1,d<=1,s<=1,2i+2d+1s<=4}?at most 1 insertion, at most 1 deletion, at most 1 substitution; each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4

Examples:

{s<=2:[a-z]}?at most 2 substitutions, which must be in the character set?[a-z].
{s<=2,i<=3:\d}?at most 2 substitutions, at most 3 insertions, which must be digits.

默認(rèn)情況下，模糊匹配將搜索滿足給定約束的第一個(gè)匹配項(xiàng)。ENHANCEMATCH (?e)標(biāo)志將使它嘗試提高找到的匹配項(xiàng)的擬合度（即減少錯(cuò)誤數(shù)量）。

BESTMATCH標(biāo)志將使其搜索最佳匹配。

regex.search("(dog){e}",?"cat and?dog")[1]?returns?"cat"?because that matches?"dog"?with 3 errors (an unlimited number of errors is permitted).
regex.search("(dog){e<=1}",?"cat and?dog")[1]?returns?" dog"?(with a leading space) because that matches?"dog"?with 1 error, which is within the limit.
regex.search("(?e)(dog){e<=1}",?"cat and?dog")[1]?returns?"dog"?(without a leading space) because the fuzzy search matches?" dog"?with 1 error, which is within the limit, and the?(?e)?then it attempts a better fit.

匹配對(duì)象具有屬性fuzzy_counts，該屬性給出替換、插入和刪除的總數(shù)：

>>> # A 'raw' fuzzy match:
>>> regex.fullmatch(r"(?:cats|cat){e<=1}", "cat").fuzzy_counts
(0, 0, 1)
>>> # 0 substitutions, 0 insertions, 1 deletion.

>>> # A better match might be possible if the ENHANCEMATCH flag used:
>>> regex.fullmatch(r"(?e)(?:cats|cat){e<=1}", "cat").fuzzy_counts
(0, 0, 0)
>>> # 0 substitutions, 0 insertions, 0 deletions.

匹配對(duì)象還具有屬性fuzzy_changes，該屬性給出替換、插入和刪除的位置的元組：

>>> m = regex.search('(fuu){i<=2,d<=2,e<=5}', 'anaconda foo bar')
>>> m
<regex.Match object; span=(7, 10), match='a f', fuzzy_counts=(0, 2, 2)>
>>> m.fuzzy_changes
([], [7, 8], [10, 11])?

\L<name>

?Named lists

老方法：

p = regex.compile(r"first|second|third|fourth|fifth")，如果列表很大，則解析生成的正則表達(dá)式可能會(huì)花費(fèi)大量時(shí)間，并且還必須注意正確地對(duì)字符串進(jìn)行轉(zhuǎn)義和正確排序，例如，“ cats”位于“ cat”之間。

新方法：?順序無關(guān)緊要，將它們視為一個(gè)set

>>> option_set = ["first", "second", "third", "fourth", "fifth"]
>>> p = regex.compile(r"\L<options>", options=option_set)

named_lists屬性：

>>> print(p.named_lists)
# Python 3
{'options': frozenset({'fifth', 'first', 'fourth', 'second', 'third'})}
# Python 2
{'options': frozenset(['fifth', 'fourth', 'second', 'third', 'first'])}

Set operators

集合、嵌套集合

僅版本1行為

添加了集合運(yùn)算符，并且集合可以包含嵌套集合。

按優(yōu)先級(jí)高低排序的運(yùn)算符為：

||?for union (“x||y” means “x or y”)
~~?(double tilde) for symmetric difference (“x~~y” means “x or y, but not both”)
&&?for intersection (“x&&y” means “x and y”)
--?(double dash) for difference (“x–y” means “x but not y”)

隱式聯(lián)合，即[ab]中的簡單并置具有最高優(yōu)先級(jí)。因此，[ab && cd] 與 [[a || b] && [c || d]] 相同。

eg：

[ab]? # Set containing ‘a(chǎn)’ and ‘b’
[a-z]? # Set containing ‘a(chǎn)’ .. ‘z’
[[a-z]--[qw]]? # Set containing ‘a(chǎn)’ .. ‘z’, but not ‘q’ or ‘w’
[a-z--qw]? # Same as above
[\p{L}--QW]? # Set containing all letters except ‘Q’ and ‘W’
[\p{N}--[0-9]]? # Set containing all numbers except ‘0’ .. ‘9’
[\p{ASCII}&&\p{Letter}]?# Set containing all characters which are ASCII and letter

開始、結(jié)束索引

匹配對(duì)象具有其他方法，這些方法返回有關(guān)重復(fù)捕獲組的所有成功匹配的信息。這些方法是：

matchobject.captures([group1,?...])
matchobject.starts([group])
matchobject.ends([group])
matchobject.spans([group])?

>>> m = regex.search(r"(\w{3})+", "123456789") >>> m.group(1) '789' >>> m.captures(1) ['123', '456', '789'] >>> m.start(1) 6 >>> m.starts(1) [0, 3, 6] >>> m.end(1) 9 >>> m.ends(1) [3, 6, 9] >>> m.span(1) (6, 9) >>> m.spans(1) [(0, 3), (3, 6), (6, 9)]

?\G

搜索錨，它在每個(gè)搜索開始/繼續(xù)的位置匹配，可用于連續(xù)匹配或在負(fù)變長后向限制中使用，以限制后向搜索的范圍：

>>> regex.findall(r"\w{2}", "abcd ef") ['ab', 'cd', 'ef'] >>> regex.findall(r"\G\w{2}", "abcd ef") ['ab', 'cd']

(?|...|...)? ?分支重置

捕獲組號(hào)將在所有替代方案中重復(fù)使用，但是具有不同名稱的組將具有不同的組號(hào)。

>>> regex.match(r"(?|(first)|(second))", "first").groups() ('first',) >>> regex.match(r"(?|(first)|(second))", "second").groups() ('second',)

注：只有一個(gè)組

超時(shí)

匹配方法和功能支持超時(shí)。超時(shí)（以秒為單位）適用于整個(gè)操作：

>>> from time import sleep >>> >>> def fast_replace(m): ... return 'X' ... >>> def slow_replace(m): ... sleep(0.5) ... return 'X' ... >>> regex.sub(r'[a-z]', fast_replace, 'abcde', timeout=2) 'XXXXX' >>> regex.sub(r'[a-z]', slow_replace, 'abcde', timeout=2) Traceback (most recent call last):File "<stdin>", line 1, in <module>File "C:\Python37\lib\site-packages\regex\regex.py", line 276, in subendpos, concurrent, timeout) TimeoutError: regex timed out

總結(jié)

以上是生活随笔為你收集整理的python正则表达式——regex模块的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：文本查重：difflib.Sequenc
下一篇： Python 定时任务框架 APSche

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

python正则表达式——regex模块

1. 為了與re模塊兼容，此模塊具有2個(gè)行為

2. Unicode中不區(qū)分大小寫的匹配：Case-insensitive matches

3. Flags

4. 組

5. 其他功能，如下表

總結(jié)