當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

scrape创建_确实在2分钟内对Scrape公司进行了评论和评分

發(fā)布時間：2023/11/29 编程问答 53 豆豆

生活随笔收集整理的這篇文章主要介紹了 scrape创建_确实在2分钟内对Scrape公司进行了评论和评分小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

scrape創(chuàng)建

網(wǎng)頁搜羅，數(shù)據(jù)科學(xué) (Web Scraping, Data Science)

In this tutorial, I will show you how to perform web scraping using Anaconda Jupyter notebook and the BeautifulSoup library.

在本教程中，我將向您展示如何使用Anaconda Jupyter筆記本和BeautifulSoup庫執(zhí)行Web抓取。

We’ll be scraping Company reviews and ratings from Indeed platform, and then we will export them to Pandas library dataframe and then to a .CSV file.

我們將從Indeed平臺上抓取公司的評論和評分，然后將它們導(dǎo)出到Pandas庫數(shù)據(jù)框，然后導(dǎo)出到.CSV文件。

Let us get straight down to business, however, if you’re looking on a guide to understanding Web Scraping in general, I advise you of reading this article from Dataquest.

但是，讓我們直接從事業(yè)務(wù)，但是，如果您正在尋找一般理解Web爬網(wǎng)的指南，建議您閱讀Dataquest的這篇文章。

Let us start by importing our 3 libraries

讓我們從導(dǎo)入3個庫開始

from bs4 import BeautifulSoup
import pandas as pd
import requests

Then, let’s go to indeed website and examine which information we want, we will be targeting Ernst & Young firm page, you can check it from the following link

然后，讓我們轉(zhuǎn)到確實的網(wǎng)站并檢查我們想要的信息，我們將以安永會計師事務(wù)所為目標(biāo)頁面，您可以從以下鏈接中進(jìn)行檢查

https://www.indeed.com/cmp/Ey/reviews?fcountry=IT

Based on my location, the country is indicated as Italy but you can choose and control that if you want.

根據(jù)我的位置，該國家/地區(qū)顯示為意大利，但您可以根據(jù)需要選擇和控制該國家/地區(qū)。

In the next picture, we can see the multiple information that we can tackle and scrape:

在下一張圖片中，我們可以看到我們可以解決和抓取的多種信息：

1- Review Title

1-評論標(biāo)題

2- Review Body

2-審查機(jī)構(gòu)

3- Rating

3-評分

4- The role of the reviewer

4-審稿人的角色

5- The location of the reviewer

5-評論者的位置

6- The review date

6-審查日期

However, you can notice that Points 4,5&6 are all in one line and will be scraped together, this can cause a bit of confusion for some people, but my advice is to scrape first then solve problems later. So, let’s try to do this.

但是，您會注意到，點4,5＆6都在同一行中，并且將被刮擦在一起，這可能會使某些人感到困惑，但是我的建議是先刮擦然后再解決問題。因此，讓我們嘗試執(zhí)行此操作。

After knowing what we want to scrape, we need to find out how much do we need to scrape, do we want only 1 review? 1 page of reviews or all pages of reviews? I guess the answer should be all pages!!

知道要抓取的內(nèi)容后，我們需要找出需要抓取的數(shù)量，我們只需要進(jìn)行1次審核嗎？ 1頁評論或所有頁面評論？我想答案應(yīng)該是所有頁面！

If you scrolled down the page and went over to page 2 you will find that the link for that page became as following:

如果您向下滾動頁面并轉(zhuǎn)到頁面2，則會發(fā)現(xiàn)該頁面的鏈接如下：

https://www.indeed.com/cmp/Ey/reviews?fcountry=IT&start=20

Then try to go to page 3, you will find the link became as following:

然后嘗試轉(zhuǎn)到第3頁，您會發(fā)現(xiàn)鏈接如下所示：

https://www.indeed.com/cmp/Ey/reviews?fcountry=IT&start=4

Looks like we have a pattern here, page 2=20 , page 3 = 40, then page 4 = 60, right? All untill page 8 = 140

看起來我們這里有一個模式，第2頁= 20，第3頁= 40，然后第4頁= 60，對嗎？全部直到第8頁= 140

Let’s get back to coding, start by defining your dataframe that you want.

讓我們回到編碼，首先定義所需的數(shù)據(jù)框。

df = pd.DataFrame({‘review_title’: [],’review’:[],’author’:[],’rating’:[]})

In the next code I will make a for loop that starts from 0, jumps 20 and stops at 140.

在下一個代碼中，我將創(chuàng)建一個for循環(huán)，該循環(huán)從0開始，跳20，然后在140處停止。

1- Inside that for loop we will make a GET request to the web server, which will download the HTML contents of a given web page for us.

1-在該for循環(huán)內(nèi)，我們將向Web服務(wù)器發(fā)出GET請求，該服務(wù)器將為我們下載給定網(wǎng)頁HTML內(nèi)容。

2- Then, We will use the BeautifulSoup library to parse this page, and extract the text from it. We first have to create an instance of the BeautifulSoup class to parse our document

2-然后，我們將使用BeautifulSoup庫解析此頁面，并從中提取文本。我們首先必須創(chuàng)建BeautifulSoup類的實例來解析我們的文檔

3- Then by inspecting the html, we choose the classes from the web page, classes are used when scraping to specify specific elements we want to scrape.

3-然后通過檢查html，我們從網(wǎng)頁上選擇類，在抓取時使用這些類來指定要抓取的特定元素。

4- And then we can conclude by adding the results to our DataFrame created before.

4-然后我們可以通過將結(jié)果添加到之前創(chuàng)建的DataFrame中來得出結(jié)論。

“I added a picture down for how the code should be in case you copied and some spaces were added wrong”

“我在圖片上添加了圖片，以防萬一您復(fù)制了代碼并添加了錯誤的空格，應(yīng)該如何處理”

for i in range(10,140,20):
url = (f’https://www.indeed.com/cmp/Ey/reviews?fcountry=IT&start={i}')
header = {“User-Agent”:”Mozilla/5.0 Gecko/20100101 Firefox/33.0 GoogleChrome/10.0"}
page = requests.get(url,headers = header)
soup = BeautifulSoup(page.content, ‘lxml’)
results = soup.find(“div”, { “id” : ‘cmp-container’})
elems = results.find_all(class_=’cmp-Review-container’)
for elem in elems:
title = elem.find(attrs = {‘class’:’cmp-Review-title’})
review = elem.find(‘div’, {‘class’: ‘cmp-Review-text’})
author = elem.find(attrs = {‘class’:’cmp-Review-author’})
rating = elem.find(attrs = {‘class’:’cmp-ReviewRating-text’})
df = df.append({‘review_title’: title.text,
‘review’: review.text,
‘a(chǎn)uthor’: author.text,
‘rating’: rating.text
}, ignore_index=True)

DONE. Let’s check our dataframe

完成。讓我們檢查一下數(shù)據(jù)框

df.head()

Now, once scraped, let’s try solve the problem we have.

現(xiàn)在，一旦刮掉，讓我們嘗試解決我們遇到的問題。

Notice the author coulmn had 3 differnt information seperated by (-)

請注意，作者可能有3個不同的信息，并以(-)分隔

So, let’s split them

所以，讓我們分開

author = df[‘a(chǎn)uthor’].str.split(‘-’, expand=True)

Now, let’s rename the columns and delete the last one.

現(xiàn)在，讓我們重命名列并刪除最后一列。

author = author.rename(columns={0: “job”, 1: “l(fā)ocation”,2:’time’})del author[3]

Then let’s join those new columns to our original dataframe and delete the old author column

然后，將這些新列添加到原始數(shù)據(jù)框中，并刪除舊的author列

df1 = pd.concat([df,author],axis=1)
del df1[‘a(chǎn)uthor’]

let’s examine our new dataframe

讓我們檢查一下新的數(shù)據(jù)框

df1.head()

Let’s re-organize the columns and remove any duplicates

讓我們重新整理各列并刪除所有重復(fù)項

df1 = df1[[‘job’, ‘review_title’, ‘review’, ‘rating’,’location’,’time’]]
df1 = df1.drop_duplicates()

Then finally let’s save the dataframe to a CSV file

最后，讓我們將數(shù)據(jù)框保存到CSV文件中

df1.to_csv(‘EY_indeed.csv’)

You should now have a good understanding of how to scrape and extract data from Indeed. A good next step for you if you are familiar a bit with web scraping it to pick a site and try some web scraping on your own.

您現(xiàn)在應(yīng)該對如何從Indeed抓取和提取數(shù)據(jù)有很好的了解。如果您對網(wǎng)絡(luò)抓取有點熟悉，可以選擇一個不錯的下一步來選擇一個站點，然后自己嘗試一些網(wǎng)絡(luò)抓取。

Happy Coding:)

快樂編碼：)

翻譯自: https://towardsdatascience.com/scrape-company-reviews-ratings-from-indeed-in-2-minutes-59205222d3ae

scrape創(chuàng)建

總結(jié)

以上是生活随笔為你收集整理的scrape创建_确实在2分钟内对Scrape公司进行了评论和评分的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。