當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python文本结构化处理_在Python中标记非结构化文本数据

發(fā)布時間：2023/12/15 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 python文本结构化处理_在Python中标记非结构化文本数据小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

python文本結(jié)構(gòu)化處理

Labelled data has been a crucial demand for supervised machine learning leading to a new industry altogether. This is an expensive and time-consuming activity with an unstructured text data which requires custom made techniques/rules to assign appropriate labels.

標(biāo)記數(shù)據(jù)一直是對有監(jiān)督的機器學(xué)習(xí)的一項至關(guān)重要的需求，從而帶動了整個新興產(chǎn)業(yè)的發(fā)展。對于非結(jié)構(gòu)化文本數(shù)據(jù)，這是一項昂貴且費時的活動，需要定制技術(shù)/規(guī)則來分配適當(dāng)?shù)臉?biāo)簽。

With the advent of state-of-the-art ML models and framework pipelines like Tensorflow and Pytorch, the dependency of data science practitioners has increased upon them for multiple problems. But these can only be consumed if provided with well-labelled training datasets and the cost and quality of this activity are positively correlated with Subject matter experts (SME). These constraints have directed the minds of practitioners towards Weak Supervision — an alternative of labelling training data utilizing high — level supervision from SMEs and some abstraction from noisier inputs using task-specific heuristics and regular expression patterns. These techniques have been employed in some opensource labelling models like Snorkel using Labelling functions and paid proprietaries Groudtruth, Dataturks etc.

隨著最先進的ML模型和Tensorflow和Pytorch之類的框架管道的出現(xiàn)，數(shù)據(jù)科學(xué)從業(yè)者對多種問題的依賴日益增加。但是，只有在提供了標(biāo)簽明確的培訓(xùn)數(shù)據(jù)集之后，這些內(nèi)容才能被消耗，并且這項活動的成本和質(zhì)量與主題專家(SME)呈正相關(guān)。這些限制使從業(yè)者的思想轉(zhuǎn)向“弱監(jiān)督”(Weak Supervision)，這是一種利用中小型企業(yè)的高層監(jiān)督來標(biāo)記培訓(xùn)數(shù)據(jù)的方法，并且使用特定于任務(wù)的試探法和正則表達式模式從嘈雜的輸入中進行抽象。這些技術(shù)已被某些開源標(biāo)簽?zāi)Ｐ筒捎?#xff0c;例如使用標(biāo)簽功能的Snorkel以及付費所有者Groudtruth，Dataturks等。

The solution proposed is for a Multinational Enterprise Information Technology client that develops a wide variety of hardware components as well as software-related services for consumers & businesses. They deploy a robust Service team that supports customers through after-sales services. The client recognized the need for an in-depth, automated, and near-real-time analysis of customer communication logs. This has several benefits such as enabling proactive identification of product shortcomings and pinpointing improvements in future product releases.

提出的解決方案是針對跨國企業(yè)信息技術(shù)客戶的，該客戶為消費者和企業(yè)開發(fā)了各種各樣的硬件組件以及與軟件相關(guān)的服務(wù)。他們部署了一支強大的服務(wù)團隊，通過售后服務(wù)為客戶提供支持。客戶認識到需要對客戶通信日志進行深入，自動化和近乎實時的分析。這具有許多好處，例如能夠主動發(fā)現(xiàn)產(chǎn)品缺陷并在將來的產(chǎn)品版本中指出改進之處。

We developed a two-phase solution strategy to address the problem at hand.

我們制定了兩階段解決方案策略來解決當(dāng)前的問題。

The first task was that of a binary classification to segregate customer calls into Operating System (OS) and Non-Operating System (Non-OS) calls. Since labelled data was not available in this case, we resorted to using Regular Expressions for this classification exercise. Using Regex also has the added utility of labelling the data in their respective categories. In the second phase, we targeted the ‘Non-OS’ category to tag other features.

第一項任務(wù)是二進制分類的任務(wù)，目的是將客戶呼叫分為操作系統(tǒng)(OS)和非操作系統(tǒng)(Non-OS)呼叫。由于在這種情況下無法使用標(biāo)簽數(shù)據(jù)，因此我們使用正則表達式進行分類。使用Regex還具有在其各自類別中標(biāo)記數(shù)據(jù)的附加實用程序。在第二階段，我們以“非操作系統(tǒng)”類別為目標(biāo)來標(biāo)記其他功能。

A Stepwise Solution Approach is thus:

因此，逐步解決方案是：

Preprocessing:

預(yù)處理：

1. Create a corpus of frequently used OS phrases and abbreviations (ex: windows install, windows activation, Deployment issue, windows, VMware)

1.創(chuàng)建一個常用操作系統(tǒng)短語和縮寫的語料庫(例如：Windows安裝，Windows激活，部署問題，Windows，VMware)

2. Similarly, form a corpus of phrases and words that may occur simultaneously with the OS phrases and may indicate to non-OS calls.

2.類似地，形成短語和單詞的語料庫，這些短語和單詞可能與OS短語同時出現(xiàn)，并可能指示非OS調(diào)用。

Core steps:

核心步驟：

1. Standard text cleaning procedures such as:

1.標(biāo)準(zhǔn)的文字清潔程序，例如：

a) Convert text to all lower cases

a)將文本轉(zhuǎn)換為所有小寫

b) Remove multiple spaces

b)刪除多個空格

c) Remove punctuations and special characters

c)刪除標(biāo)點符號和特殊字符

d) Remove non-ASCII characters

d)刪除非ASCII字符

2. In the first search pass, identify OS related words and phrases to tag the relevant calls as OS calls

2.在第一遍搜索中，識別與操作系統(tǒng)相關(guān)的詞和短語，以將相關(guān)呼叫標(biāo)記為操作系統(tǒng)呼叫

3. In the second search pass, identify non-OS related words and phrases to tag calls related to features other than operating systems. This is needed as most call logs will keep a record of the configuration of the system which may lead to false tagging of the calls as OS

3.在第二遍搜索中，標(biāo)識與操作系統(tǒng)無關(guān)的單詞和短語，以標(biāo)記與操作系統(tǒng)以外的功能相關(guān)的調(diào)用。這是必需的，因為大多數(shù)呼叫日志將保留系統(tǒng)配置的記錄，這可能導(dǎo)致錯誤地將呼叫標(biāo)記為OS

Details for the phrase and word search:

短語和單詞搜索的詳細信息：

a) Create a dictionary of all text with the text of each row split into words and save the list of words as an element of the dictionary against the text or unique id.

a)創(chuàng)建所有文本的字典，將每一行的文本分成單詞，并將單詞列表作為文本或唯一ID的字典元素保存。

b) Now split each phrase of the corpus in words and search for each word of the phrase in each element of the dictionary. If all the words of the phrase are available in a given element of the dictionary, then tag the respective text or unique id accordingly.

b)現(xiàn)在將語料庫的每個短語拆分為單詞，然后在字典的每個元素中搜索短語的每個單詞。如果該短語的所有單詞在字典的給定元素中均可用，則相應(yīng)地標(biāo)記相應(yīng)的文本或唯一ID。

c) Similarly, search for the words in the corpus in all the text and tag the successful search calls accordingly.

c)同樣，在所有文本中搜索語料庫中的單詞，并相應(yīng)地標(biāo)記成功的搜索調(diào)用。

Code Snippets

代碼段

Text cleaning:

文字清理：

Phrase search:

詞組搜索：

Limitations

局限性

1. Currently, the text is being only searched for the phrases of a single product and having it tagged accordingly. As an improvement, we can expect to include phrases of multiple products and tag the calls in a similar fashion.

1.目前，僅在文本中搜索單個產(chǎn)品的短語并對其進行相應(yīng)標(biāo)記。作為改進，我們可以期望包含多個產(chǎn)品的短語，并以類似的方式標(biāo)記通話。

2. We can also include the language translations for foreign languages and check for spelling mistakes.

2.我們還可以包括外語的語言翻譯，并檢查拼寫錯誤。

3. Domain experts can help in creating an exclusive set of words and phrases for each product which can make the product more customizable for different industry segments.

3.領(lǐng)域?qū)＜铱梢詭椭鸀槊總€產(chǎn)品創(chuàng)建一組專有的單詞和短語，這可以使產(chǎn)品針對不同的行業(yè)細分而更加可定制。

Sample search results

樣本搜索結(jié)果

1. OS Terms: RHEL, RedHat, OS install, no boot, subscription

1. 操作系統(tǒng)條款： RHEL，RedHat，操作系統(tǒng)安裝，不啟動，訂閱

2. Non-OS Terms: HW (Hardware), Disk Error

2. 非操作系統(tǒng)術(shù)語：硬件(硬件)，磁盤錯誤

Proposed Future Enhancements

擬議的未來增強功能

1. The labelled training data can be consumed into training an NLP based Binary classification model which can classify the call logs into OS and Non-OS classes.

1.標(biāo)記的訓(xùn)練數(shù)據(jù)可以用于訓(xùn)練基于NLP的二進制分類模型，該模型可以將呼叫日志分類為OS和Non-OS類。

2. Textual data needs to be converted into vectorized form, which can be achieved by using word embeddings for each token in the sentence. We can use pre-trained open-source embeddings like FastText, BERT, GloVe, etc.

2.文本數(shù)據(jù)需要轉(zhuǎn)換為矢量化形式，這可以通過對句子中的每個標(biāo)記使用單詞嵌入來實現(xiàn)。我們可以使用經(jīng)過預(yù)先訓(xùn)練的開源嵌入，例如FastText，BERT，GloVe等。

3. Some of the state-of-the-art models, like Neural Nets, can be used for the classification task, with RNN/GRU/LSTM layers to learn representations for text sequences.

3.一些最新的模型，例如神經(jīng)網(wǎng)絡(luò)，可以用于分類任務(wù)，通過RNN / GRU / LSTM層可以學(xué)習(xí)文本序列的表示形式。

翻譯自: https://towardsdatascience.com/labelling-unstructured-text-data-in-python-974e809b98d9

python文本結(jié)構(gòu)化處理

總結(jié)

以上是生活随笔為你收集整理的python文本结构化处理_在Python中标记非结构化文本数据的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 10分拉满！IGN发布《流浪地球2》影评
下一篇： python 数组合并排重_并排深度学习