在Python中使用Twitter Rest API批量搜索和下载推文
數據挖掘 , 編程 (Data Mining, Programming)
Getting Twitter data
獲取Twitter數據
Let’s use the Tweepy package in python instead of handling the Twitter API directly. The two things we will do with the package are, authorize ourselves to use the API and then use the cursor to access the twitter search APIs.
讓我們在python中使用Tweepy包,而不是直接處理Twitter API。 我們將對該軟件包執行的兩件事是,授權自己使用API??,然后使用光標訪問twitter搜索API。
Let’s go ahead and get our imports loaded.
讓我們繼續加載我們的導入。
import tweepyimport pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as npsns.set()
%matplotlib inline
Twitter授權 (Twitter authorization)
To use the Twitter API, you must first register to get an API key. To get Tweepy just install it via pip install Tweepy. The Tweepy documentation is best at explaining how to authenticate, but I’ll go over the basic steps.
要使用Twitter API,您必須首先注冊以獲得API密鑰。 要獲取Tweepy,只需通過pip安裝Tweepy即可安裝。 Tweepy文檔最擅長于說明如何進行身份驗證,但我將介紹一些基本步驟。
Once you register your app you will receive API keys, next use Tweepy to get an OAuthHandler. I have the keys stored in a separate config dict.
一旦注冊您的應用程序,您將收到API密鑰,接下來請使用Tweepy獲取OAuthHandler。 我將密鑰存儲在單獨的配置字典中。
config = {"twitterConsumerKey":"XXXX", "twitterConsumerSecretKey" :"XXXX"}auth = tweepy.OAuthHandler(config["twitterConsumerKey"], config["twitterConsumerSecretKey"])
redirect_url = auth.get_authorization_url()
redirect_url
Now that we’ve given Tweepy our keys to generate an OAuthHandler, we will now use the handler to get a redirect URL. Go to the URL from the output in a browser where you can allow your app to authorize your account so you can get access to the API.
現在,我們已經為Tweepy提供了密鑰來生成OAuthHandler,現在將使用該處理程序來獲取重定向URL。 在瀏覽器中從輸出轉到URL,您可以在其中允許您的應用對帳戶進行授權,以便可以訪問API。
Once you’ve authorized your account with the app, you’ll be given a PIN. Use that number in Tweepy to let it know that you’ve authorized it with the API.
使用該應用授權您的帳戶后,將獲得PIN碼。 在Tweepy中使用該編號,以使其知道您已使用API??授權。
pin = "XXXX"auth.get_access_token(pin)
搜索推文 (Searching for tweets)
After getting the authorization, we can use it to search for all the tweets containing the term “British Airways”; we have restricted the maximum results to 1000.
獲得授權后,我們可以使用它來搜索包含“英國航空”一詞的所有推文; 我們已將最大結果限制為1000。
query = 'British Airways'max_tweets = 10
searched_tweets = [status for status in tweepy.Cursor(api.search, q=query,tweet_mode='extended').items(max_tweets)]search_dict = {"text": [], "author": [], "created_date": []}for item in searched_tweets:
if not item.retweet or "RT" not in item.full_text:
search_dict["text"].append(item.full_text)
search_dict["author"].append(item.author.name)
search_dict["created_date"].append(item.created_at)df = pd.DataFrame.from_dict(search_dict)
df.head()#
text author created_date
0 @RwandAnFlyer @KenyanAviation @KenyaAirways @U... Bkoskey 2019-03-06 10:06:14
1 @PaulCol56316861 Hi Paul, I'm sorry we can't c... British Airways 2019-03-06 10:06:09
2 @AmericanAir @British_Airways do you agree wit... Hat 2019-03-06 10:05:38
3 @Hi_Im_AlexJ Hi Alex, I'm glad you've managed ... British Airways 2019-03-06 10:02:58
4 @ZRHworker @British_Airways @Schmidy_87 @zrh_a... Stefan Paetow 2019-03-06 10:02:33
語言檢測 (Language detection)
The tweets downloaded by the code above can be in any language, and before we use this data for further text mining, we should classify it by performing language detection.
上面的代碼下載的推文可以使用任何語言,并且在我們使用此數據進行進一步的文本挖掘之前,我們應該通過執行語言檢測對其進行分類。
In general, language detection is performed by a pre-trained text classifier based on either the Naive Bayes algorithm or more modern neural networks. Google’s compact language detector library is an excellent choice for production-level workloads where you have to analyze hundreds of thousands of documents in less than a few minutes. However, it’s a bit tricky to set up and as a result, a lot of people rely on calling a language detection API from third-party providers like Algorithmia which are free to use for hundreds of calls a month (free sign up required with no credit cards needed).
通常,語言檢測由基于Naive Bayes算法或更現代的神經網絡的預訓練文本分類器執行。 Google的緊湊型語言檢測器庫是生產級工作負載的絕佳選擇,您必須在幾分鐘之內分析成千上萬的文檔。 但是,設置起來有點棘手,因此,許多人依賴于從第三方提供商(例如Algorithmia)調用語言檢測API ,這些提供商每月可以免費使用數百次呼叫(無需注冊即可免費注冊)需要信用卡)。
Let’s keep things simple in this example and just use a Python library called Langid which is orders of magnitude slower than the options discussed above but should be OK for us in this example since we are only to analyze about a hundred tweets.
讓我們在此示例中保持簡單,只使用一個名為Langid的Python庫,該庫比上面討論的選項慢幾個數量級,但在本示例中應該可以接受,因為我們僅分析大約100條推文。
from langid.langid import LanguageIdentifier, modeldef get_lang(document):
identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
prob_tuple = identifier.classify(document)
return prob_tuple[0]df["language"] = df["text"].apply(get_lang)
We find that there are tweets in four unique languages present in the output, and only 45 out of 100 tweets are in English, which are filtered as shown below.
我們發現輸出中存在四種獨特語言的推文,而100條推文中只有45條是英文,如下所示進行過濾。
print(df["language"].unique())df_filtered = df[df["language"]=="en"]
print(df_filtered.shape)#Out:
array(['en', 'rw', 'nl', 'es'], dtype=object)
(45, 4)
獲得情緒來為推特打分 (Getting sentiments to score for tweets)
We can take df_filtered created in the preceding section and run it through a pre-trained sentiments analysis library. For illustration purposes we are using the one present in Textblob, however, I would highly recommend using a more accurate sentiments model such as those in coreNLP or train your own model using Sklearn or Keras.
我們可以采用在上一節中創建的df_filtered并將其通過預訓練的情感分析庫運行。 為了便于說明,我們使用Textblob中提供的模型,但是,我強烈建議使用更準確的情感模型(例如coreNLP中的模型),或者使用Sklearn或Keras訓練自己的模型。
Alternately, if you choose to go via the API route, then there is a pretty good sentiments API at Algorithmia.
或者,如果您選擇通過API路線,那么Algorithmia中會有一個相當不錯的情緒API 。
from textblob import TextBlobdef get_sentiments(text):blob = TextBlob(text)# sent_dict = {}# sent_dict["polarity"] = blob.sentiment.polarity# sent_dict["subjectivity"] = blob.sentiment.subjectivity
if blob.sentiment.polarity > 0.1:
return 'positive'
elif blob.sentiment.polarity < -0.1:
return 'negative'
else:
return 'neutral'def get_sentiments_score(text):
blob = TextBlob(text)
return blob.sentiment.polarity
df_filtered["sentiments"]=df_filtered["text"].apply(get_sentiments)
df_filtered["sentiments_score"]=df_filtered["text"].apply(get_sentiments_score)
df_filtered.head()
#Out:
text author created_date language sentiments sentiments_score
0 @British_Airways Having some trouble with our ... Rosie Smith 2019-03-06 10:24:57 en neutral 0.025
1 @djban001 This doesn't sound good, Daniel. Hav... British Airways 2019-03-06 10:24:45 en positive 0.550
2 First #British Airways Flight to #Pakistan Wil... Developing Pakistan 2019-03-06 10:24:43 en positive 0.150
3 I don’t know why he’s not happy. I thought he ... Joyce Stevenson 2019-03-06 10:24:18 en negative -0.200
4 Fancy winning a global holiday for you and a f... Selective Travel Mgt 🌍 2019-03-06 10:23:40 en positive 0.360
Let us plot the sentiments score to see how many negative, neutral, and positive tweets people are sending for “British airways”. You can also save it as a CSV file for further processing at a later time.
讓我們繪制情緒分數,以查看人們向“英國航空公司”發送了多少條負面,中立和正面的推文。 您也可以將其另存為CSV文件,以便以后進行進一步處理。
Originally published at http://jaympatel.com on February 1, 2019.
最初于 2019年2月1日 發布在 http://jaympatel.com 上。
翻譯自: https://medium.com/towards-artificial-intelligence/using-twitter-rest-apis-in-python-to-search-and-download-tweets-in-bulk-da234b5f155a
總結
以上是生活随笔為你收集整理的在Python中使用Twitter Rest API批量搜索和下载推文的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到被蛇咬了手是什么意思
- 下一篇: binary masks_Python中