當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

go-colly官方文档翻译(持续翻译中)

發(fā)布時(shí)間：2024/1/18 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 go-colly官方文档翻译(持续翻译中) 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

介紹

如何安裝

煤灰只有一個(gè)前提,那就是Golang編程語言。你可以使用他們的安裝指南 https://golang.org/doc/install

在終端輸入以下命令安裝煤灰和回車。

go get -u github.com/gocolly/colly/...

入門

在使用Colly之前，請確保您具有最新版本。有關(guān)更多詳細(xì)信息，請參見

讓我們從一些簡單的例子開始。

首先，您需要將Colly導(dǎo)入您的代碼庫：

import "github.com/gocolly/colly"

收集器

煤灰的主要實(shí)體是一個(gè) 收集器對象。收集器管理網(wǎng)絡(luò)通信,并負(fù)責(zé)執(zhí)行附加收集器工作運(yùn)行時(shí)回調(diào)。與煤灰,你必須初始化收集器:

c := colly.NewCollector()

回調(diào)

你可以把不同類型的回調(diào)函數(shù) 收集器控制或檢索信息收集工作。檢查相關(guān)的部分在包的文檔。

添加回調(diào) 收集器

c.OnRequest(func(r *colly.Request) {fmt.Println("Visiting", r.URL) })c.OnError(func(_ *colly.Response, err error) {log.Println("Something went wrong:", err) })c.OnResponseHeaders(func(r *colly.Response) {fmt.Println("Visited", r.Request.URL) })c.OnResponse(func(r *colly.Response) {fmt.Println("Visited", r.Request.URL) })c.OnHTML("a[href]", func(e *colly.HTMLElement) {e.Request.Visit(e.Attr("href")) })c.OnHTML("tr td:nth-of-type(1)", func(e *colly.HTMLElement) {fmt.Println("First column of a table row:", e.Text) })c.OnXML("//h1", func(e *colly.XMLElement) {fmt.Println(e.Text) })c.OnScraped(func(r *colly.Response) {fmt.Println("Finished", r.Request.URL) })

回調(diào)函數(shù)的調(diào)用順序

1. OnRequest

在請求之前調(diào)用

2. OnError

如果請求期間發(fā)生錯(cuò)誤，則調(diào)用

3. OnResponseHeaders

在收到響應(yīng)標(biāo)頭后調(diào)用

4. OnResponse

收到回復(fù)后調(diào)用

5. OnHTML

OnResponse如果接收到的內(nèi)容是HTML ，則在此之后立即調(diào)用

6. OnXML

OnHTML如果接收到的內(nèi)容是HTML或XML ，則在之后調(diào)用

7. OnScraped

在OnXML回調(diào)之后調(diào)用

配置

Colly是一個(gè)高度可定制的抓取框架。它具有合理的默認(rèn)值，并提供了很多選項(xiàng)來更改它們。

收集器的配置

收集器屬性的完整列表可以在這里找到。建議使用初始化收集器的方法colly.NewCollector(options...)。

使用默認(rèn)設(shè)置創(chuàng)建收集器：

c1 := colly.NewCollector()

創(chuàng)建另一個(gè)收集器，并更改User-Agent和url重新訪問選

c2 := colly.NewCollector(colly.UserAgent("xy"),colly.AllowURLRevisit(), )

或者

c2 := colly.NewCollector() c2.UserAgent = "xy" c2.AllowURLRevisit = true

通過覆蓋收集器的屬性，可以在刮削作業(yè)的任何時(shí)候更改配置。

一個(gè)很好的例子是一個(gè)User-Agent切換器，它可以在每個(gè)請求上更改User-Agent：

const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"func RandomString() string {b := make([]byte, rand.Intn(10)+10)for i := range b {b[i] = letterBytes[rand.Intn(len(letterBytes))]}return string(b) }c := colly.NewCollector()c.OnRequest(func(r *colly.Request) {r.Headers.Set("User-Agent", RandomString()) })

過環(huán)境變量進(jìn)行配置

可以通過環(huán)境變量來更改收集器的默認(rèn)配置。這使我們可以微調(diào)收集器而無需重新編譯。環(huán)境解析是收集器初始化的最后一步，因此初始化之后的每個(gè)配置更改都會(huì)覆蓋從環(huán)境解析的配置。

環(huán)境變量配置

COLLY_ALLOWED_DOMAINS （以逗號分隔的域列表）
COLLY_CACHE_DIR （細(xì)繩）
COLLY_DETECT_CHARSET （是/否）
COLLY_DISABLE_COOKIES （是/否）
COLLY_DISALLOWED_DOMAINS （以逗號分隔的域列表）
COLLY_IGNORE_ROBOTSTXT （是/否）
COLLY_FOLLOW_REDIRECTS （是/否）
COLLY_MAX_BODY_SIZE （int）
COLLY_MAX_DEPTH （int-0表示無窮大）
COLLY_PARSE_HTTP_ERROR_RESPONSE （是/否）
COLLY_USER_AGENT （細(xì)繩）

HTTP配置

Colly使用Golang的默認(rèn)http客戶端作為網(wǎng)絡(luò)層?？梢酝ㄟ^更改默認(rèn)的HTTP roundtripper來調(diào)整HTTP選項(xiàng)。

c := colly.NewCollector() c.WithTransport(&http.Transport{Proxy: http.ProxyFromEnvironment,DialContext: (&net.Dialer{Timeout: 30 * time.Second,KeepAlive: 30 * time.Second,DualStack: true,}).DialContext,MaxIdleConns: 100,IdleConnTimeout: 90 * time.Second,TLSHandshakeTimeout: 10 * time.Second,ExpectContinueTimeout: 1 * time.Second, }

最佳實(shí)戰(zhàn)

調(diào)試

有時(shí)一些就足夠了 log.Println ()回調(diào)函數(shù)調(diào)用,但有時(shí)它不是。煤灰有內(nèi)置的收集器調(diào)試能力。調(diào)試器調(diào)試器接口和不同類型的實(shí)現(xiàn)。

將調(diào)試器附加到一個(gè)收集器

將調(diào)試器需要一個(gè)基本的日志記錄調(diào)試( github.com/gocolly/colly/debug從煤灰的回購)包。

import ("github.com/gocolly/colly""github.com/gocolly/colly/debug" )func main() {c := colly.NewCollector(colly.Debugger(&debug.LogDebugger{}))// [..] }

實(shí)現(xiàn)一個(gè)自定義調(diào)試器

您可以創(chuàng)建任何類型的自定義調(diào)試器實(shí)現(xiàn) debug.Debugger 接口。就是一個(gè)很好的例子 LogDebugger 。

分布式抓取

分布式抓取可以以不同的方式實(shí)現(xiàn)根據(jù)抓取任務(wù)的要求是什么。大部分時(shí)間是足夠規(guī)模的網(wǎng)絡(luò)通信層可以很容易地通過使用代理和煤灰的代理轉(zhuǎn)換器。

代理轉(zhuǎn)換器

使用代理扳道工刮仍然集中分布在多個(gè)代理服務(wù)器的HTTP請求。通過其“煤灰支持代理切換 SetProxyFunc成員。任何可以通過自定義函數(shù) SetProxyFunc()的簽名 func(*http.Request) (*url.URL, error)。

煤灰有一個(gè)內(nèi)置的代理切換器,旋轉(zhuǎn)代理對每個(gè)請求的列表。

使用

package mainimport ("github.com/gocolly/colly""github.com/gocolly/colly/proxy" )func main() {c := colly.NewCollector()if p, err := proxy.RoundRobinProxySwitcher("socks5://127.0.0.1:1337","socks5://127.0.0.1:1338","http://127.0.0.1:8080",); err == nil {c.SetProxyFunc(p)}// ... }

實(shí)現(xiàn)自定義代理切換器:

var proxies []*url.URL = []*url.URL{&url.URL{Host: "127.0.0.1:8080"},&url.URL{Host: "127.0.0.1:8081"}, }func randomProxySwitcher(_ *http.Request) (*url.URL, error) {return proxies[random.Intn(len(proxies))], nil }// ... c.SetProxyFunc(randomProxySwitcher)

分布式刮刀

獨(dú)立管理和分布式刮刀你能做的最好的就是包裝的刮刀服務(wù)器。服務(wù)器可以是任何類型的服務(wù)像HTTP、TCP服務(wù)器或Google App Engine。使用自定義存儲實(shí)現(xiàn)集中和持久的餅干和訪問url處理。

可以找到一個(gè)示例實(shí)現(xiàn) 在這里。

分布式存儲

訪問URL和餅干默認(rèn)數(shù)據(jù)存儲內(nèi)存中。短住刮刀的工作,這很方便,但它可以是一個(gè)嚴(yán)重的限制在處理大規(guī)模或爬行需要長時(shí)間運(yùn)行的工作。

煤灰有能力取代默認(rèn)的內(nèi)存中存儲任何存儲后端實(shí)現(xiàn) 煤灰/ storage.Storage 接口。看看現(xiàn)有的存儲。

存儲后端

煤灰有一個(gè)內(nèi)存中的存儲后端存儲餅干和訪問url,但它可以覆蓋任何自定義存儲后端實(shí)現(xiàn) 煤灰/ storage.Storage 。

現(xiàn)有存儲后端

內(nèi)存中后端

默認(rèn)端鍋灰。使用 collector.SetStorage () 覆蓋。

復(fù)述,后端

看到復(fù)述,例子獲取詳細(xì)信息。

boltdb后端

SQLite3的后端

MongoDB的后端

PostgreSQL的后端

使用多個(gè)收集器

如果任務(wù)足夠復(fù)雜或具有不同類型的子任務(wù)，建議使用多個(gè)收集器來執(zhí)行一個(gè)抓取作業(yè)。一個(gè)很好的例子是Coursera課程抓取工具，其中使用了兩個(gè)收集器-一個(gè)解析列表視圖并處理分頁，另一個(gè)則收集課程詳細(xì)信息。

Colly具有一些內(nèi)置方法來支持多個(gè)收集器的使用。

克隆采集器

Clone()如果收集器具有類似的配置，則可以使用收集器的方法。Clone()復(fù)制具有相同配置但沒有附加回調(diào)的收集器。

c := colly.NewCollector(colly.UserAgent("myUserAgent"),colly.AllowedDomains("foo.com", "bar.com"), ) // Custom User-Agent and allowed domains are cloned to c2 c2 := c.Clone()

在收集器之間傳遞自定義數(shù)據(jù)

使用收集器的Request()功能可以與其他收集器共享上下文。

共享上下文示例：

c.OnResponse(func(r *colly.Response) {r.Ctx.Put(r.Headers.Get("Custom-Header"))c2.Request("GET", "https://foo.com/", nil, r.Ctx, nil) })

爬蟲程序配置

Colly的默認(rèn)配置經(jīng)過優(yōu)化，可以在一項(xiàng)作業(yè)中抓取較少數(shù)量的站點(diǎn)。如果您想抓取數(shù)百萬個(gè)網(wǎng)站，則此設(shè)置不是最佳選擇。以下是一些調(diào)整：

使用永久性存儲后端

默認(rèn)情況下，Colly將cookie和已訪問的URL存儲在內(nèi)存中。您可以使用任何自定義后端替換內(nèi)置的內(nèi)存中存儲后端。在這里查看更多詳細(xì)信息。

將異步用于具有遞歸調(diào)用的長時(shí)間運(yùn)行的作業(yè)

默認(rèn)情況下，在請求未完成時(shí)Colly會(huì)阻塞，因此Collector.Visit從回調(diào)遞歸調(diào)用會(huì)產(chǎn)生不斷增長的堆棧。有了Collector.Async = true這可避免。（不要忘了c.Wait()與async一起使用。）

禁用或限制連接保持活動(dòng)狀態(tài)

Colly使用HTTP保持活動(dòng)來提高抓取速度。它需要打開文件描述符，因此長時(shí)間運(yùn)行的作業(yè)很容易達(dá)到max-fd限制。

可以使用以下代碼禁用HTTP Keep-alive：

c := colly.NewCollector() c.WithTransport(&http.Transport{DisableKeepAlives: true, })

擴(kuò)展

擴(kuò)展是Colly附帶的小型幫助程序?qū)嵱贸绦?。插件列表可在此處獲得。

使用

下面的例子使隨機(jī)代理切換器和兩次引用setter擴(kuò)展并訪問httpbin.org。

import ("log""github.com/gocolly/colly""github.com/gocolly/colly/extensions" )func main() {c := colly.NewCollector()visited := falseextensions.RandomUserAgent(c)extensions.Referer(c)c.OnResponse(func(r *colly.Response) {log.Println(string(r.Body))if !visited {visited = truer.Request.Visit("/get?q=2")}})c.Visit("http://httpbin.org/get") }

例子

基本

package mainimport ("fmt""github.com/gocolly/colly" )func main() {// Instantiate default collectorc := colly.NewCollector(// Visit only domains: hackerspaces.org, wiki.hackerspaces.orgcolly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),)// On every a element which has href attribute call callbackc.OnHTML("a[href]", func(e *colly.HTMLElement) {link := e.Attr("href")// Print linkfmt.Printf("Link found: %q -> %s\n", e.Text, link)// Visit link found on page// Only those links are visited which are in AllowedDomainsc.Visit(e.Request.AbsoluteURL(link))})// Before making a request print "Visiting ..."c.OnRequest(func(r *colly.Request) {fmt.Println("Visiting", r.URL.String())})// Start scraping on https://hackerspaces.orgc.Visit("https://hackerspaces.org/") }

錯(cuò)誤處理

package mainimport ("fmt""github.com/gocolly/colly" )func main() {// Create a collectorc := colly.NewCollector()// Set HTML callback// Won't be called if error occursc.OnHTML("*", func(e *colly.HTMLElement) {fmt.Println(e)})// Set error handlerc.OnError(func(r *colly.Response, err error) {fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)})// Start scrapingc.Visit("https://definitely-not-a.website/") }

登錄

package mainimport ("log""github.com/gocolly/colly" )func main() {// create a new collectorc := colly.NewCollector()// authenticateerr := c.Post("http://example.com/login", map[string]string{"username": "admin", "password": "admin"})if err != nil {log.Fatal(err)}// attach callbacks after loginc.OnResponse(func(r *colly.Response) {log.Println("response received", r.StatusCode)})// start scrapingc.Visit("https://example.com/") }

最大深度

package mainimport ("fmt""github.com/gocolly/colly" )func main() {// Instantiate default collectorc := colly.NewCollector(// MaxDepth is 1, so only the links on the scraped page// is visited, and no further links are followedcolly.MaxDepth(1),)// On every a element which has href attribute call callbackc.OnHTML("a[href]", func(e *colly.HTMLElement) {link := e.Attr("href")// Print linkfmt.Println(link)// Visit link found on pagee.Request.Visit(link)})// Start scraping on https://en.wikipedia.orgc.Visit("https://en.wikipedia.org/") }

多部分

package mainimport ("fmt""io/ioutil""net/http""os""time""github.com/gocolly/colly" )func generateFormData() map[string][]byte {f, _ := os.Open("gocolly.jpg")defer f.Close()imgData, _ := ioutil.ReadAll(f)return map[string][]byte{"firstname": []byte("one"),"lastname": []byte("two"),"email": []byte("onetwo@example.com"),"file": imgData,} }func setupServer() {var handler http.HandlerFunc = func(w http.ResponseWriter, r *http.Request) {fmt.Println("received request")err := r.ParseMultipartForm(10000000)if err != nil {fmt.Println("server: Error")w.WriteHeader(500)w.Write([]byte("<html><body>Internal Server Error</body></html>"))return}w.WriteHeader(200)fmt.Println("server: OK")w.Write([]byte("<html><body>Success</body></html>"))}go http.ListenAndServe(":8080", handler) }func main() {// Start a single route http server to post an image to.setupServer()c := colly.NewCollector(colly.AllowURLRevisit(), colly.MaxDepth(5))// On every a element which has href attribute call callbackc.OnHTML("html", func(e *colly.HTMLElement) {fmt.Println(e.Text)time.Sleep(1 * time.Second)e.Request.PostMultipart("http://localhost:8080/", generateFormData())})// Before making a request print "Visiting ..."c.OnRequest(func(r *colly.Request) {fmt.Println("Posting gocolly.jpg to", r.URL.String())})// Start scrapingc.PostMultipart("http://localhost:8080/", generateFormData())c.Wait() }

平行

package mainimport ("fmt""github.com/gocolly/colly" )func main() {// Instantiate default collectorc := colly.NewCollector(// MaxDepth is 2, so only the links on the scraped page// and links on those pages are visitedcolly.MaxDepth(2),colly.Async(true),)// Limit the maximum parallelism to 2// This is necessary if the goroutines are dynamically// created to control the limit of simultaneous requests.//// Parallelism can be controlled also by spawning fixed// number of go routines.c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2})// On every a element which has href attribute call callbackc.OnHTML("a[href]", func(e *colly.HTMLElement) {link := e.Attr("href")// Print linkfmt.Println(link)// Visit link found on page on a new threade.Request.Visit(link)})// Start scraping on https://en.wikipedia.orgc.Visit("https://en.wikipedia.org/")// Wait until threads are finishedc.Wait() }

Proxy 切換

package mainimport ("bytes""log""github.com/gocolly/colly""github.com/gocolly/colly/proxy" )func main() {// Instantiate default collectorc := colly.NewCollector(colly.AllowURLRevisit())// Rotate two socks5 proxiesrp, err := proxy.RoundRobinProxySwitcher("socks5://127.0.0.1:1337", "socks5://127.0.0.1:1338")if err != nil {log.Fatal(err)}c.SetProxyFunc(rp)// Print the responsec.OnResponse(func(r *colly.Response) {log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))})// Fetch httpbin.org/ip five timesfor i := 0; i < 5; i++ {c.Visit("https://httpbin.org/ip")} }

隊(duì)列

package mainimport ("fmt""github.com/gocolly/colly""github.com/gocolly/colly/queue" )func main() {url := "https://httpbin.org/delay/1"// Instantiate default collectorc := colly.NewCollector()// create a request queue with 2 consumer threadsq, _ := queue.New(2, // Number of consumer threads&queue.InMemoryQueueStorage{MaxSize: 10000}, // Use default queue storage)c.OnRequest(func(r *colly.Request) {fmt.Println("visiting", r.URL)})for i := 0; i < 5; i++ {// Add URLs to the queueq.AddURL(fmt.Sprintf("%s?n=%d", url, i))}// Consume URLsq.Run(c) }

隨機(jī)延遲

package mainimport ("fmt""time""github.com/gocolly/colly""github.com/gocolly/colly/debug" )func main() {url := "https://httpbin.org/delay/2"// Instantiate default collectorc := colly.NewCollector(// Attach a debugger to the collectorcolly.Debugger(&debug.LogDebugger{}),colly.Async(true),)// Limit the number of threads started by colly to two// when visiting links which domains' matches "*httpbin.*" globc.Limit(&colly.LimitRule{DomainGlob: "*httpbin.*",Parallelism: 2,RandomDelay: 5 * time.Second,})// Start scraping in four threads on https://httpbin.org/delay/2for i := 0; i < 4; i++ {c.Visit(fmt.Sprintf("%s?n=%d", url, i))}// Start scraping on https://httpbin.org/delay/2c.Visit(url)// Wait until threads are finishedc.Wait() }

速率限制

package mainimport ("fmt""github.com/gocolly/colly""github.com/gocolly/colly/debug" )func main() {url := "https://httpbin.org/delay/2"// Instantiate default collectorc := colly.NewCollector(// Turn on asynchronous requestscolly.Async(true),// Attach a debugger to the collectorcolly.Debugger(&debug.LogDebugger{}),)// Limit the number of threads started by colly to two// when visiting links which domains' matches "*httpbin.*" globc.Limit(&colly.LimitRule{DomainGlob: "*httpbin.*",Parallelism: 2,//Delay: 5 * time.Second,})// Start scraping in five threads on https://httpbin.org/delay/2for i := 0; i < 5; i++ {c.Visit(fmt.Sprintf("%s?n=%d", url, i))}// Wait until threads are finishedc.Wait() }

Redis后端

package mainimport ("log""github.com/gocolly/colly""github.com/gocolly/colly/queue""github.com/gocolly/redisstorage" )func main() {urls := []string{"http://httpbin.org/","http://httpbin.org/ip","http://httpbin.org/cookies/set?a=b&c=d","http://httpbin.org/cookies",}c := colly.NewCollector()// create the redis storagestorage := &redisstorage.Storage{Address: "127.0.0.1:6379",Password: "",DB: 0,Prefix: "httpbin_test",}// add storage to the collectorerr := c.SetStorage(storage)if err != nil {panic(err)}// delete previous data from storageif err := storage.Clear(); err != nil {log.Fatal(err)}// close redis clientdefer storage.Client.Close()// create a new request queue with redis storage backendq, _ := queue.New(2, storage)c.OnResponse(func(r *colly.Response) {log.Println("Cookies:", c.Cookies(r.Request.URL.String()))})// add URLs to the queuefor _, u := range urls {q.AddURL(u)}// consume requestsq.Run(c) }

請求上下文

package mainimport ("fmt""github.com/gocolly/colly" )func main() {// Instantiate default collectorc := colly.NewCollector()// Before making a request put the URL with// the key of "url" into the context of the requestc.OnRequest(func(r *colly.Request) {r.Ctx.Put("url", r.URL.String())})// After making a request get "url" from// the context of the requestc.OnResponse(func(r *colly.Response) {fmt.Println(r.Ctx.Get("url"))})// Start scraping on https://en.wikipedia.orgc.Visit("https://en.wikipedia.org/") }

Scraper 服務(wù)

package mainimport ("encoding/json""log""net/http""github.com/gocolly/colly" )type pageInfo struct {StatusCode intLinks map[string]int }func handler(w http.ResponseWriter, r *http.Request) {URL := r.URL.Query().Get("url")if URL == "" {log.Println("missing URL argument")return}log.Println("visiting", URL)c := colly.NewCollector()p := &pageInfo{Links: make(map[string]int)}// count linksc.OnHTML("a[href]", func(e *colly.HTMLElement) {link := e.Request.AbsoluteURL(e.Attr("href"))if link != "" {p.Links[link]++}})// extract status codec.OnResponse(func(r *colly.Response) {log.Println("response received", r.StatusCode)p.StatusCode = r.StatusCode})c.OnError(func(r *colly.Response, err error) {log.Println("error:", r.StatusCode, err)p.StatusCode = r.StatusCode})c.Visit(URL)// dump resultsb, err := json.Marshal(p)if err != nil {log.Println("failed to serialize response:", err)return}w.Header().Add("Content-Type", "application/json")w.Write(b) }func main() {// example usage: curl -s 'http://127.0.0.1:7171/?url=http://go-colly.org/'addr := ":7171"http.HandleFunc("/", handler)log.Println("listening on", addr)log.Fatal(http.ListenAndServe(addr, nil)) }

Url 篩選

package mainimport ("fmt""regexp""github.com/gocolly/colly" )func main() {// Instantiate default collectorc := colly.NewCollector(// Visit only root url and urls which start with "e" or "h" on httpbin.orgcolly.URLFilters(regexp.MustCompile("http://httpbin\\.org/(|e.+)$"),regexp.MustCompile("http://httpbin\\.org/h.+"),),)// On every a element which has href attribute call callbackc.OnHTML("a[href]", func(e *colly.HTMLElement) {link := e.Attr("href")// Print linkfmt.Printf("Link found: %q -> %s\n", e.Text, link)// Visit link found on page// Only those links are visited which are matched by any of the URLFilter regexpsc.Visit(e.Request.AbsoluteURL(link))})// Before making a request print "Visiting ..."c.OnRequest(func(r *colly.Request) {fmt.Println("Visiting", r.URL.String())})// Start scraping on http://httpbin.orgc.Visit("http://httpbin.org/") }

總結(jié)

以上是生活随笔為你收集整理的go-colly官方文档翻译(持续翻译中)的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。