如何识别媒体偏见_描述性语言理解,以识别文本中的潜在偏见
如何識(shí)別媒體偏見
TGumGum can do to bring change by utilizing our Natural Language Processing technology to shed light on potential bias that websites may have in their content. The ideas and techniques shared in this blog are a result of the GumGum Hackathon Project: Verity E-Quality (Aditya Ramesh, Erica Nishimura, Ishan Shrivastava, Lane Schechter and Trung Do).
T GumGum可以利用我們的自然語言處理技術(shù)來帶來變化,從而揭示網(wǎng)站內(nèi)容可能存在的潛在偏見。 本博客中分享的想法和技術(shù)是GumGum Hackathon項(xiàng)目:Verity E-Quality(Aditya Ramesh,Erica Nishimura,Ishan Shrivastava,Lane Schechter和Trung Do)的結(jié)果。
In this blog, we will look into how we can utilize and build upon the existing product offering from GumGum to understand the Gender Representation in a website’s content. We aren’t saying that one publisher is more biased than the other, rather we are merely providing the awareness around the representation as it exists. With Natural Language Processing, we can compare between the descriptive language being used around Males and Females to provide this awareness.
在此博客中,我們將研究如何利用和建立GumGum提供的現(xiàn)有產(chǎn)品,以了解網(wǎng)站內(nèi)容中的性別表示形式。 我們并不是說一個(gè)出版商比另一個(gè)出版商有更大的偏見,相反,我們只是在提供有關(guān)表示形式的意識(shí)。 通過自然語言處理,我們可以在男性和女性周圍使用的描述性語言之間進(jìn)行比較,以提供這種意識(shí)。
In order to facilitate meaningful change, we need to be aware and mindful of where that change is needed. — Lane Schechter, Product Manager, GumGum Inc.
為了促進(jìn)有意義的變更,我們需要意識(shí)到并銘記需要進(jìn)行哪些變更。 —口香糖公司產(chǎn)品經(jīng)理Lane Schechter
口香糖的產(chǎn)品 (GumGum’s Product Offerings)
Before we move ahead to understand how we build upon the existing product offerings, let us first take a brief look at them. GumGum’s Verity Product does a complete contextual analysis of a publisher’s webpage. Some of the key offerings of this product are:
在繼續(xù)了解如何在現(xiàn)有產(chǎn)品基礎(chǔ)上發(fā)展之前,讓我們首先簡要地了解一下它們。 GumGum的Verity產(chǎn)品對發(fā)布者的網(wǎng)頁進(jìn)行了完整的上下文分析。 該產(chǎn)品的一些主要產(chǎn)品包括:
Contextual Classification & Targeting: This feature identifies and scores publisher’s content (webpages) for contextual classification based on standard IAB Content Taxonomy v1.0 and v2.0. Some of those categories are “Sports”, “Food & Drinks”, “Automotive”, “Medical Health” etc. Going forward, we will refer to them as IAB verticals.
內(nèi)容相關(guān)分類和定位 :此功能可根據(jù)標(biāo)準(zhǔn)IAB內(nèi)容分類標(biāo)準(zhǔn)v1.0和v2.0對發(fā)布者的內(nèi)容(網(wǎng)頁)進(jìn)行識(shí)別和評分,以進(jìn)行內(nèi)容相關(guān)分類。 其中一些類別是“體育”,“食品和飲料”,“汽車”,“醫(yī)療保健”等。展望未來,我們將其稱為IAB行業(yè)。
Brand Safety & Suitability: This feature flags and rates brand safety threats based on GumGum’s proprietary threat classification taxonomy and in compliance with The 4A’s Advertising Assurance Brand Safety Framework.
品牌安全性和適用性 :此功能基于GumGum專有的威脅分類法并符合4A的廣告保證品牌安全框架來標(biāo)記和評估品牌安全威脅。
Named Entity Recognition (NER): This feature identifies and extracts any mention of a named entity in the publisher’s content. A named entity could be any mention of a ‘Person’, ‘Location’ or ‘Organization’.
命名實(shí)體識(shí)別(NER) :此功能可以識(shí)別并提取發(fā)布者內(nèi)容中對命名實(shí)體的任何提及。 命名實(shí)體可以是對“人員”,“位置”或“組織”的任何提及。
Sentiment Analysis: This feature analyzes the attitudes, opinions and emotions expressed online to provide the most nuanced brand safety and contextual insights.
情感分析 :此功能可分析在線表達(dá)的態(tài)度,觀點(diǎn)和情感,以提供最細(xì)微的品牌安全性和上下文相關(guān)見解。
Here is one way we can provide the Descriptive Language Understanding Associated with Gender. We can use the Named Entity Recognition (NER) feature to extract Names of “Person” named entity type which can be used to identify the gender of the person being talked about. We can also use the Sentiment Analysis feature to extract sentiment of the sentences in which Males and Females are being talked about. We can use all of this information to understand the descriptive language being used around Males and Females (more on how to do this in the next section) and compare it across different IAB verticals extracted using our Contextual Classification feature.
這是我們提供與性別相關(guān)的描述性語言理解的一種方法。 我們可以使用命名實(shí)體識(shí)別( NER )功能來提取“個(gè)人”命名實(shí)體類型的名稱,該名稱可用于識(shí)別所談?wù)撊藛T的性別。 我們還可以使用情感分析功能來提取正在談?wù)撃行院团缘木渥拥那楦小?我們可以使用所有這些信息來理解男性和女性周圍使用的描述性語言(在下一節(jié)中將詳細(xì)介紹如何操作),并使用上下文分類功能將其與不同的IAB垂直行業(yè)進(jìn)行比較。
與性別相關(guān)的描述性語言理解方法 (Approach for Descriptive Language Understanding Associated with Gender)
Fig 1: Flowchart diagram describing the approach for Descriptive Language Understanding Associated with Gender圖1:流程圖描述了與性別相關(guān)的描述性語言理解方法We start by running a Domain Specific Query on our NLP Databases to extract URL’s for the given publisher. We then utilize the Named Entity Recognition Feature of Verity to filter out pages that do not contain any “Person” Named Entity. From the remaining pages, we extract all “Person Names” and the Sentences in which those “Person Names” occur. As a future step, we can also perform coreference resolution, to extract more sentences where the “Persons” are mentioned using their respective pronouns.
我們首先在NLP數(shù)據(jù)庫上運(yùn)行特定于域的查詢,以提取給定發(fā)布者的URL。 然后,我們利用Verity的命名實(shí)體識(shí)別功能來過濾掉不包含任何“人”命名實(shí)體的頁面。 從其余頁面中,我們提取所有“人名”和出現(xiàn)這些“人名”的句子。 作為未來的步驟,我們還可以執(zhí)行共指解析,以提取更多句子,并使用各自的代詞提及“人物”。
We then use the “Person Names” to detect the gender of the person using an open source package called Gender Guesser. We also use the “Sentences” to extract the sentiment of the Sentence by utilizing our own FastText based Sentiment Classification model. This model is trained on our publisher data which classifies a sentence into Negative, Neutral or Positive Sentiment.
然后,我們使用稱為“性別名稱”的開源軟件包Gender Guesser來檢測人員的性別 。 我們還使用“句子”通過利用我們自己的基于FastText的情感分類模型來提取句子的情感。 此模型是根據(jù)我們的發(fā)布者數(shù)據(jù)訓(xùn)練的,該數(shù)據(jù)將句子分為負(fù)面,中性或正面情緒。
We also use “Person Names” and the Sentences they occur in to extract Adjectives used in the surrounding context for a given person. To achieve this we used Spacy’s Part of Speech Tokenizer and extract adjectives used within a proximity of a mention of a person name. Consider the example given below:
我們還使用“人物名稱”及其出現(xiàn)的句子來提取給定人物在周圍環(huán)境中使用的形容詞。 為了達(dá)到這個(gè)目的,我們使用了Spacy的語音詞性分詞器,并提取了在提及某人名時(shí)使用的形容詞。 考慮下面給出的示例:
Fig 2圖2We use all this information to create a Word Cloud for the Adjectives used around each Gender and Sentiment Pair across the entire content as well as specific to different IAB verticals.
我們使用所有這些信息為整個(gè)內(nèi)容以及特定于不同IAB行業(yè)的每個(gè)性別和情感對使用的形容詞創(chuàng)建詞云 。
For example, consider the following four word clouds that we got based on the Adjectives used around Males and Females in a Positive and Negative context extracted from a Publisher’s content:
例如,考慮以下四個(gè)詞云,這些詞云是根據(jù)從發(fā)布者內(nèi)容中提取的正面和負(fù)面上下文中男性和女性周圍使用的形容詞得出的:
Negative Sentiment否定情感 Negative Sentiment負(fù)感的女性周圍的形容詞的詞云Nothing stereotypical stands out here. It has similarly or equally negative adjectives being used around Males and Females alike.
沒有什么定型觀念在這里脫穎而出。 它在男性和女性周圍都有相似或同等的否定形容詞。
Positive Sentiment陽性的男性使用 Positive Sentiment積極情緒的女性What we see here is that more Intellectual Type Adjectives being used around Males, while more Appearance Type Adjectives being used around Females.
我們在這里看到的是,在男性周圍使用更多的智力類型形容詞,而在女性周圍使用更多的外觀類型形容詞。
It becomes even more clearer if we look at the most frequent Adjectives used around ONLY Males or Female. We do this be considering the top 15 adjectives and extracting only the Uncommon Adjectives between the two genders and compare it among the Positive and Negative Context.
如果我們只看男性或女性周圍最常用的形容詞,就會(huì)更加清楚。 我們這樣做是在考慮前15名形容詞,并僅提取兩個(gè)性別之間的不常見形容詞 ,然后將其在正面和負(fù)面語境中進(jìn)行比較。
Fig 7: Most Frequent Adjectives used for Only Male/Female based on top 15 Adjectives for each Gender corresponding to different Sentiment Context圖7:基于不同情感情境的每個(gè)性別的前15個(gè)形容詞,僅用于男性/女性的最常見形容詞Here we can clearly see that in the Negative context, the most frequent Adjectives used around Only Males and Only Females can be considered equally negative. But in the Positive context, that is clearly not the case. Around Males, we see adjectives like “Proud”, “Sized”, “Perfect”, “Fantastic etc while we see adjectives like “Beautiful”, “Healthy”, “Amazing”, “Sweet”, “Supporting”, “Lucky” etc around Females. This is suggestive of more Intellectual Type Adjectives being used around Males and more Appearance Types adjectives being used around Females.
在這里我們可以清楚地看到,在否定語境中,僅男性和僅女性周圍使用最頻繁的形容詞可被視為同等否定。 但是,在積極方面,情況顯然并非如此。 在男性周圍,我們看到形容詞如“驕傲”,“大小”,“完美”,“棒極了”,而我們看到形容詞如“美麗”,“健康”,“驚人”,“甜”,“支持”,“幸運(yùn)”等女性。 這表明在男性周圍使用更多的智力類型形容詞,在女性周圍使用更多的外觀類型形容詞。
This sort of analysis of the descriptive language being used around different Genders in different Sentimental Context can really help in understanding what sort of Bias if any is present in a publisher’s content. But how can we quantify this? For this we introduce a Context Based Similarity Score.
對在不同情感環(huán)境中不同性別之間使用的描述性語言進(jìn)行的這種分析,確實(shí)可以幫助理解發(fā)行人內(nèi)容中存在的哪種偏差(如有)。 但是我們?nèi)绾瘟炕?#xff1f; 為此,我們介紹了一個(gè)基于上下文的相似度評分 。
基于上下文的相似度評分 (Context Based Similarity Score)
The idea here is to find a way to compute a single score that shows the degree of similarity between the most frequent adjectives used around only Males and only Females. To achieve this we make use of the famous Transformer based Deep Learning model: BERT by Google Research.
這里的想法是找到一種方法來計(jì)算單個(gè)分?jǐn)?shù),該分?jǐn)?shù)顯示僅在男性和女性之間使用的最常見形容詞之間的相似程度。 為此,我們利用了著名的基于Transformer的深度學(xué)習(xí)模型: Google Research的BERT 。
Among being awesome at a variety of NLP tasks and breaking the State of the Art results on them, BERT is also great at providing Contextualized Word Vector Representations (Embeddings). What that means is that, BERT doesn’t provide a single and constant representation of a word, rather it looks at the context in which the word was used in the sentence and spits out a context sensitive representation of that word. This is particularly useful as it captures more information than other representations such as Word2Vec or Glove. A famous example used to point this out is that BERT will provide different representations for the word “Bank” depending on the context in which it was used. The context could be of a river bank or of a financial bank. Therefore, to extract a word representation from BERT, you need to send a sentence in which it was used to get a Contextualized Word Vector Representations. (Apart from reading their original paper here, you can also look at this and this to get a more visualistic way of understanding Transformers and BERT. )
BERT擅長處理各種NLP任務(wù)并打破了最新的技術(shù)成果,其中,BERT擅長提供上下文化的詞向量表示(嵌入) 。 這就是說,BERT不提供單詞的單一且恒定的表示形式,而是查看句子中使用該單詞的上下文,并吐出該單詞的上下文相關(guān)表示形式。 這一點(diǎn)特別有用,因?yàn)樗戎T如Word2Vec或Glove之類的其他表示形式捕獲的信息更多。 指出這一點(diǎn)的一個(gè)著名示例是,BERT將根據(jù)使用的上下文為“銀行”一詞提供不同的表示形式。 上下文可以是河岸或金融銀行。 因此,要從BERT中提取單詞表示形式,您需要發(fā)送一個(gè)句子,在該句子中使用它來獲取上下文化的單詞向量表示形式。 (除了這里閱讀他們的原始論文,你也可以看看這個(gè)和這個(gè)得到理解變壓器和BERT更visualistic方式。)
Therefore, along with the most frequent Male only and Female Only adjectives, we also extract the sentences in which these Male only and Female Only Adjectives are used. We send these sentences into BERT to extract Contextualized Vector Representations of length 768, for each of these Adjectives based on the context in which these adjectives were used.
因此,與最常見的男性專用和女性專用形容詞一起,我們還提取了使用這些男性專用和女性專用形容詞的句子。 我們根據(jù)使用這些形容詞的上下文,將這些句子發(fā)送到BERT中,以提取長度為768的上下文化向量表示形式。
We use these representation that have rich context information to compute a Context Based Similarity Score between the Male only Adjectives and Female Only Adjectives used in with Positive or a Negative context. We take the mean of the contextual representations of all Male only Adjectives and Female Only Adjectives to get an averaged representation for all the Male only Adjectives and Female only Adjectives respectively. We then take the cosine similarity between the two vector representations to compute a Context Based Similarity Score as shown in the figure below:
我們使用具有豐富上下文信息的這些表示來計(jì)算在正或負(fù)上下文中使用的僅男性形容詞和僅女性形容詞之間的基于上下文的相似性得分 。 我們?nèi)∷心行詫S眯稳菰~和女性專用形容詞的上下文表示的平均值,以分別獲得所有男性專用形容詞和女性專用形容詞的平均表示。 然后,我們使用兩個(gè)向量表示之間的余弦相似度來計(jì)算基于上下文的相似度得分,如下圖所示:
Fig 8: Calculating the Context Based Similarity Score from Contextualized Word Vector Representations of the Adjectives used around only Males and around only Females.圖8:從僅在男性周圍和僅在女性周圍使用的形容詞的上下文化詞向量表示形式,計(jì)算基于上下文的相似性分?jǐn)?shù) 。This score is calculated for a given sentiment and a given IAB vertical.
針對給定的情緒和給定的IAB垂直度計(jì)算此分?jǐn)?shù)。
The higher this score, the better is the balance between the Adjectives being used around a particular gender in the context of a given sentiment and given IAB vertical.
該分?jǐn)?shù)越高,在給定的情緒和IAB垂直的情況下針對特定性別使用的形容詞之間的平衡就越好。
Let us look at the Context Based Similarity score in action:
讓我們看一下基于上下文的相似性得分:
Fig 9: The Context Based Similarity Score based on the most Frequent Adjectives used around Only Males and Only Females corresponding to different Sentiment Context圖9:基于上下文的相似性評分,該評分基于對應(yīng)于不同情感上下文的僅男性和女性周圍使用的最常見形容詞Comparing the two scores, we can see that we get a higher score in the case of Negative sentiment, where there were similar kind of Adjectives (equally negative in this case) used around Males and Females. On the other hand, we get a lower score in the case of Positive sentiment, where we did see some form of Bias with Intellectual Type Adjectives being used around Males while Appearance Type Adjectives being used around Females.
比較這兩個(gè)分?jǐn)?shù),我們可以發(fā)現(xiàn),在負(fù)面情緒的情況下,我們在男性和女性周圍使用了相似類型的形容詞(在這種情況下,均為負(fù)數(shù))時(shí)得分更高。 另一方面,在積極情緒的情況下,我們得到了較低的分?jǐn)?shù),在這種情況下,我們確實(shí)看到了某種形式的偏見,其中男性使用智力類型形容詞,而女性使用外觀類型形容詞。
結(jié)論 (Conclusion)
In this blog we saw how we can analyze the Descriptive Language used around Males and Females. We analyzed the insights found from such an analysis and saw how it can guide and point us to where the change might be required. We took a look at how GumGum can leverage Product Offerings like Content Classification and Named Entity Recognition from its vast variety of feature arsenal and build upon them to quantify the degree of similarities in the descriptive language being used around Males and Females. As a part of our future works, we can work on identifying Race mentions in a piece of text and easily extend this work to understand the Descriptive Language used around different Races.
在此博客中,我們看到了如何分析男性和女性周圍使用的描述性語言。 我們分析了從這種分析中發(fā)現(xiàn)的見解,并了解了它如何指導(dǎo)并指出我們可能需要進(jìn)行更改的地方。 我們研究了GumGum如何利用其功能豐富的功能庫中的內(nèi)容分類和命名實(shí)體識(shí)別之類的產(chǎn)品,并以此為基礎(chǔ)來量化男性和女性使用的描述性語言的相似程度。 作為我們未來工作的一部分,我們可以在一段文字中識(shí)別種族提及,并輕松地?cái)U(kuò)展這項(xiàng)工作以理解圍繞不同種族使用的描述性語言。
About Me: Graduated with a Masters in Computer Science from ASU. I am a NLP Scientist at GumGum. I am interested in applying Machine Learning/Deep Learning to provide some structure to the unstructured data that surrounds us.
關(guān)于我 :畢業(yè)于ASU的計(jì)算機(jī)科學(xué)碩士學(xué)位。 我是GumGum的NLP科學(xué)家。 我對應(yīng)用機(jī)器學(xué)習(xí)/深度學(xué)習(xí)感興趣,以便為我們周圍的非結(jié)構(gòu)化數(shù)據(jù)提供某種結(jié)構(gòu)。
We’re always looking for new talent! View jobs.
我們一直在尋找新的人才! 查看工作 。
Follow us: Facebook | Twitter | | Linkedin | Instagram
關(guān)注我們: Facebook | 推特 | | Linkedin | Instagram
翻譯自: https://medium.com/gumgum-tech/descriptive-language-understanding-to-identify-potential-bias-in-text-89936fefbae7
如何識(shí)別媒體偏見
總結(jié)
以上是生活随笔為你收集整理的如何识别媒体偏见_描述性语言理解,以识别文本中的潜在偏见的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到裙子破了是什么意思
- 下一篇: 数据不平衡处理_如何处理多类不平衡数据说