當前位置：首頁 > 编程语言 > python >内容正文

python

go语言爬虫教程python_Go语言爬虫 - Go语言中文网 - Golang中文社区

發布時間：2024/2/28 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 go语言爬虫教程python_Go语言爬虫 - Go语言中文网 - Golang中文社区小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

之前寫爬蟲都是用的python語言，最近發現go語言寫起來也挺方便的，下面簡單介紹一下。

這里說的爬蟲并不是對網絡中的很多資源進行不斷的循環抓取，而只是抓通過程序的手段都某些網頁實現特定的信息抓取。可以簡單分成兩個部分：抓取網頁，對網頁進行解析。

抓取網頁。一般是向服務器發送一個http get/post請求，得到response。go提供的http包可以很好的實現。

get方法：1resp, err := http.Get(“http://www.legendtkl.com")

post方法：1

2resp, err := http.Post(“http://example.com/upload”, “image/jpg”, &buf)

resp, err := http.PostForm(“http://example.com/form”, url.Values{“key”:{“Value”}, “id”:{“123"}})

當然如果要更具體的設置HTTP client，可以建一個client。1

3client := &http.Client{

CheckRedirect : redirectPolicyFunc,

}

client結構如下：1

6type Client struct {

Transport RoundTripper //Transport specifies the mechanism by which HTTP request are made

CheckRedirect func(req *Request, via []*Request) error //if nil, use default, namely 10 consecutive request

Jar CookieJar //specify the cookie jar

Timeout time.Duration //request timeout

}

HTTP header，就是我們post時候需要提交的key-value，定義如下1type Header map[string][]string

提供了Add, Del, Set等操作。

HTTP request，我們前面直接用Get向服務器地址請求，為了更好的處理，可以使用Do()發送http request。1

7func (c *client) Do(req *Request) (resp *Response, error)

client := &http.Client{

...

}

req, err := http.NewRequest(“GET”, “http://example.com”, nil)

req.Header.Add(“If-None-Match”, `W/“wyzzy"`)

resp, err := client.Do(req)

Request里面的數據結構屬于http的內容，這里就不細說了。

上面這些只是得到網頁內容，更重要的是網頁內容解析。我們這里要使用的是goquery，類似jQuery，可以很方便的解析html內容。下面先看一個抓取糗事百科的例子。1

24package main

import (

"fmt"

"github.com/PuerkitoBio/goquery"

"log"

)

func ExampleScrape() {

doc, err := goquery.NewDocument("http://www.qiushibaike.com")

if err != nil {

log.Fatal(err)

}

doc.Find(".article").Each(func(i int, s *goquery.Selection) {

if s.Find(".thumb").Nodes == nil && s.Find(".video_holder").Nodes == nil {

content := s.Find(".content").Text()

fmt.Printf("%s", content)

}

})

}

func main() {

ExampleScrape()

}

程序運行效果如下。

goquery中最核心的就是find函數，原型如下1func (s *Selection) Find(selector string) *Selection

返回的Selection數據結構如下1

3type Selection struct {

Nodes []*html.Node

}

html包是golang.org/x/net/html，Node結構如下1

8type Node struct {

Parent, FirstChild, LastChild, PrevSibling, NextSibling *Node

Type NodeType

DataAtom atom.Atom

Data string

Namespace string

Attr []Attribule

}

用來解析html網頁，上面代碼中的第二個if是用來過來糗事百科上面的視頻和圖片。

goquery內容很豐富，解析起來很方便，大家可以通過godoc查詢。

有疑問加站長微信聯系(非本文作者)

總結

以上是生活随笔為你收集整理的go语言爬虫教程python_Go语言爬虫 - Go语言中文网 - Golang中文社区的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 360n6手机内存卡取出全攻略，一根针搞
下一篇：告别卡顿，畅享游戏盛宴