python 定时执行 爬虫 模块_【Python】定时执行网站爬虫
今天我們額討論如何使用Python,SQLite數據庫與crontab工具將爬蟲程序部署到服務器上并實現定時爬取存儲
編寫爬蟲代碼
編寫一個爬蟲程序,使用requests與beautifulsoup4包爬取和解析Yahoo!股市-上市成交價排行與Yahoo!股市-上柜成交價排行的資料,再利用pandas包將解析后的展示出來。
import datetime
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_price_ranks():
current_dt = datetime.datetime.now().strftime("%Y-%m-%d %X")
current_dts = [current_dt for _ in range(200)]
stock_types = ["tse", "otc"]
price_rank_urls = ["https://tw.stock.yahoo.com/d/i/rank.php?t=pri&e={}&n=100".format(st) for st in stock_types]
tickers = []
stocks = []
prices = []
volumes = []
mkt_values = []
ttl_steps = 10*100
each_step = 10
for pr_url in price_rank_urls:
r = requests.get(pr_url)
soup = BeautifulSoup(r.text, 'html.parser')
ticker = [i.text.split()[0] for i in soup.select(".name a")]
tickers += ticker
stock = [i.text.split()[1] for i in soup.select(".name a")]
stocks += stock
price = [float(soup.find_all("td")[2].find_all("td")[i].text) for i in range(5, 5+ttl_steps, each_step)]
prices += price
volume = [int(soup.find_all("td")[2].find_all("td")[i].text.replace(",", "")) for i in range(11, 11+ttl_steps, each_step)]
volumes += volume
mkt_value = [float(soup.find_all("td")[2].find_all("td")[i].text)*100000000 for i in range(12, 12+ttl_steps, each_step)]
mkt_values += mkt_value
types = ["上市" for _ in range(100)] + ["上柜" for _ in range(100)]
ky_registered = [True if "KY" in st else False for st in stocks]
df = pd.DataFrame()
df["scrapingTime"] = current_dts
df["type"] = types
df["kyRegistered"] = ky_registered
df["ticker"] = tickers
df["stock"] = stocks
df["price"] = prices
df["volume"] = volumes
df["mktValue"] = mkt_values
return df
price_ranks = get_price_ranks()
print(price_ranks.shape)
這個的結果展示為
## (200, 8)
接下來我們利用pandas進行前幾行展示
price_ranks.head()
price_ranks.tail()
接下來我們就開始往服務器上部署
對于服務器的選擇,環境配置不在本課的討論范圍之內,我們主要是要講一下怎么去設置定時任務。
接下來我們改造一下代碼,改造成結果有sqlite存儲。
import datetime
import requests
from bs4 import BeautifulSoup
import pandas as pd
import sqlite3
def get_price_ranks():
current_dt = datetime.datetime.now().strftime("%Y-%m-%d %X")
current_dts = [current_dt for _ in range(200)]
stock_types = ["tse", "otc"]
price_rank_urls = ["https://tw.stock.yahoo.com/d/i/rank.php?t=pri&e={}&n=100".format(st) for st in stock_types]
tickers = []
stocks = []
prices = []
volumes = []
mkt_values = []
ttl_steps = 10*100
each_step = 10
for pr_url in price_rank_urls:
r = requests.get(pr_url)
soup = BeautifulSoup(r.text, 'html.parser')
ticker = [i.text.split()[0] for i in soup.select(".name a")]
tickers += ticker
stock = [i.text.split()[1] for i in soup.select(".name a")]
stocks += stock
price = [float(soup.find_all("td")[2].find_all("td")[i].text) for i in range(5, 5+ttl_steps, each_step)]
prices += price
volume = [int(soup.find_all("td")[2].find_all("td")[i].text.replace(",", "")) for i in range(11, 11+ttl_steps, each_step)]
volumes += volume
mkt_value = [float(soup.find_all("td")[2].find_all("td")[i].text)*100000000 for i in range(12, 12+ttl_steps, each_step)]
mkt_values += mkt_value
types = ["上市" for _ in range(100)] + ["上櫃" for _ in range(100)]
ky_registered = [True if "KY" in st else False for st in stocks]
df = pd.DataFrame()
df["scrapingTime"] = current_dts
df["type"] = types
df["kyRegistered"] = ky_registered
df["ticker"] = tickers
df["stock"] = stocks
df["price"] = prices
df["volume"] = volumes
df["mktValue"] = mkt_values
return df
price_ranks = get_price_ranks()
conn = sqlite3.connect('/home/ubuntu/yahoo_stock.db')
price_ranks.to_sql("price_ranks", conn, if_exists="append", index=False)
接下來如果我們讓他定時啟動,那么,我們需要linux的crontab命令:
如果我們要設置每天的 9:30 到 16:30 之間每小時都執行一次
那么我們只需要先把文件命名為price_rank_scraper.py
然后在crontab的文件中添加
30 9-16 * * * /home/ubuntu/miniconda3/bin/python /home/ubuntu/price_rank_scraper.py
這樣我們就成功的做好了一個定時任務爬蟲
總結
以上是生活随笔為你收集整理的python 定时执行 爬虫 模块_【Python】定时执行网站爬虫的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: axios发送登录请求_使用axios实
- 下一篇: lisp xy轴不等比缩放_不就是用Py