當前位置：首頁 > 编程语言 > python >内容正文

python

web编程模块1 html,PYcore python programming笔记C20 Web编程

發布時間：2023/12/10 python 75 豆豆

生活随笔收集整理的這篇文章主要介紹了 web编程模块1 html,PYcore python programming笔记C20 Web编程小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

C20 Web編程

20.1介紹

C/S架構? 服務端永遠運行

HTTP協議：無狀態協議，不跟蹤一個客戶端到另一個客戶端的請求，但會被處理為獨立的服務請求

使用URL和cookie保存信息

URL? 統一資源定位器

URI?? 統一資源標識器

URL是URI的一部分

prot_sch://net_loc/path;params?query#frag

prot_sch網絡協議或者下載規劃

net_loc? 服務器位置和用戶信息具體為user:passwd@host:port

path?????? 斜杠(/)限定文件或者CGI應用程序的路徑

params? 可選參數

query????? 連接符(&)連接鍵值對

frag???????? 拆分文檔中的特殊錨

20.2.2? urlparse模塊

.urlparse(urlstr,defProSch=None,allowFrag=None)??? 將urlstr解析成各個部件，allowFrag決定是否有零部件

>>>urlparse.urlparse('http://www.python.org/doc/FAQ.html')

('http','www.python.org','/doc/FAQ.html','','','')

.urlunparse(urlup)????????????? 將urlup元組組合成一個url字符串

.urljoin(baseurl,newurl,allowFrag=None)??? ，將baseurl的部件的一部分替換成newurl(newurl不需要完整)，

返回字符串

20.2.3? urllib模塊#提供了一個高級Web庫，避免了使用httplib、ftplib、gopherlib等底層模塊

1.urllib.urlopen()

f=urllib.urlopen(urlstr,postQueryData=None)???? #以文件類型打開URL

.read([bytes])???????????? 從f中讀出所有或bytes個字節

.readline()?????????????????? 從f中讀出一行

.readlines()???????????????? 從f中讀出所有行并返回一個列表

.close()??????????????????????? 關閉url

.fileno()????????????????????? 返回f的文件句柄

.info()????????????????????????? 獲得f的MIME頭文件(通知使用哪類應用程序打開)

.geturl()????????????????????? 返回f所打開的真正的url

2.urllib.urlretrieve(urlstr,local-file=None,downloadStatusHook=None)

將文件下載為localfile或者tempfile，如果已下載則downloadStautsHook有統計信息

3.urllib.quote(urldata,safe='/')

將urldata的無效的url字符編碼，在safe列的則不必編碼，使適用打印或web服務器

urlilib.quote_plus(urldata,safe='/')

將空格編譯成(+)號(而非%20)，其他功能同上

4.urllib.unquote(urldata)

將urldata解碼

urllib.unquote_plus(urldata)

將加號解碼成空格，其他同上

5.urllib.urlencode(dict)

將字典鍵--值對編譯成有效的CGI請求字符串，用quote_plus()對鍵和值字符串分別編碼

20.2.4 urllib2模塊 #可以處理更復雜的請求，比如登錄名和密碼

方法一：建立一個基礎認證服務器(urllib2.HTTPBasicAuthHandler)，同時在基本URL或域上注冊一個登錄密碼。

方法二：當瀏覽器提示的時候，輸入用戶名和密碼，這樣就發送了一個帶有適當用戶請求的認證頭。

#urlopenAuth.py

import urllib2

PASSWD="you'll never guess"

URL='http://localhost'

def handler_version(url):

from urlparse import urlparse as up

hdlr=urllib2.HTTPBasicAuthHandler()

hdlr.add_password('Archives',up(url)[1],LOGIN,PASSWD)

opener=urllib2.build_opener(hdlr)

urllib2.install_opener(opener)

return url

def request_version(url):

from base64 import encodestring

req=urllib2.Request(url)

b64str=encodestring('%s%s'%(LOGIN,PASSWD))[:-1]

req.add_header("Authorization","Basic %s"%b64str)

return req

for funcType in('handler','request'):

print '***using %s'%funcType.upper()

url=eval('%s_version'%funcType)(URL)

f=urllib2.urlopen(url)

print f.readline()

f.close()

20.3 高級Web客戶端

網絡爬蟲：為搜索引擎建索引、脫機瀏覽、下載并保存歷史記錄或框架、Web頁的緩存節約訪問時間

#coding=UTF-8

#crawl.py

from sys import argv

from os import makedirs, unlink, sep

from os.path import dirname, exists, isdir, splitext

from string import replace, find, lower

from htmllib import HTMLParser

from urllib import urlretrieve

from urlparse import urlparse, urljoin

from formatter import DumbWriter, AbstractFormatter

from cStringIO import StringIO

class Retriever(object):#download Web pages

def __init__(self, url): #初始化

self.url = url

self.file = self.filename(url) #以filename方式新建文件

def filename(self, url, deffile='index.htm'): #開辟文件路徑

parsedurl = urlparse(url, 'http:', 0)

path = parsedurl[1] + parsedurl[2] #域名和遠程文件路徑

ext = splitext(path)

if ext[1] == '': #no file ,ues default

if path[-1] == '/':

path += deffile #后綴

else:

path += '/' + deffile

ldir = dirname(path)

#print "thatis",ldir

if sep != '/':

ldir = replace(ldir, '/', sep) #以系統的分隔符替換'/'

if not isdir(ldir): #不存在則創建文檔

if exists(ldir): unlink(ldir)

# print "thisis",ldir

makedirs(ldir or 'undenied')

return path

def download(self): #獲取文件

try:

retval = urlretrieve(self.url, self.file)

except:

retval = ('*** ERROR: invalid URL "%s"' % self.url,)

print "download done"

return retval

def parseAndGetLinks(self): #定義獲取links的方法

self.parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))

self.parser.feed(open(self.file).read())

self.parser.close()

return self.parser.anchorlist

class Crawler(object):

count = 0

def __init__(self, url):

self.q = [url] #初始化隊列

self.seen = []

self.dom = urlparse(url)[1] #域名部分

def getPage(self, url):

r = Retriever(url) #先下載了

retval = r.download() #下載的內容部分

if retval[0] == '*':

print retval, '... skipping parse'

return

Crawler.count += 1 #加一次爬取

print '\n(', Crawler.count, ')'

print 'URL:', url #打印鏈接

print 'FILE:', retval[0]

self.seen.append(url)#已爬取的添加到歷史記錄里

links = r.parseAndGetLinks()#獲取links

for eachLink in links:

if eachLink[:4] != 'http' and find(eachLink, '://') == -1:

eachLink = urljoin(url, eachLink)#補充成完整鏈接

print '* ', eachLink,

if find(lower(eachLink), 'mailto:') != -1: #濾過郵件鏈接

print '... discarded, mailto link'

continue

if eachLink not in self.seen: #歷史記錄里未爬取

if find(eachLink, self.dom) == -1: #濾過非域內鏈接

print '... discarded, not in domain'

else:

if eachLink not in self.q:

self.q.append(eachLink) #判定為新的非重復鏈接，添加

print '... new, added to Q'

else:

print '... discarded, already in Q'

else:

print '... discarded, already processed'#判定已爬取

def go(self):

while self.q: #隊列非空則繼續爬取

url = self.q.pop() #從隊列中取出

self.getPage(url) #下載并分析鏈接

def main():

if len(argv) > 1:

url = argv[1]

else:

try:

url = raw_input('Enter starting URL: ') #http://www.baidu.com/index.html

except:

url = ''

if not url: return

robot = Crawler(url) #入口,并初始化隊列

robot.go()

if __name__ == "__main__":

main() #首先看看main函數吧，再看到go

20.4 CGI:幫助Web服務器處理客戶端數據

#CGI程序與應用程序不同在于輸入、輸出及用戶和計算機交互方面

#cgi模塊主要類是FiledStorage,一旦實例化會具有一系列鍵值對，這些值本身可以是FiledStorage對象，也可以是MiniFiledStorage對象，或者是這些對象的列表。

20.5建立CGI應用程序

10.5.1建立Web服務器

可以用Apache

也可以建立一個基于web的簡單服務器，默認8000端口

$ Python -m CGIHTTPServer

20.9相關模塊 cgi?? cgitb? htmllib?? HTMLparser??? htmlentitydefs cookie?? cookielib? webbrowser??? sgmllib? robotparser?? httplib?? xmllib?? xml?? xml.sax?? xml.dom? xml.etree? xml.parsers.expat?? xmlrpclib SimpleXMLRPCServer??? DocXMLRPCServer??? BaseHTTPServer?? SimpleHTTPServer?? CGIHTTPServer wsgiref? HTMLgen? BeautifulSoup?? poplib?? imaplib?? email?? mailbox? mailcap?? mimetools?? mimetypes MimeWriter?? multifile quopri?? rfc822?? smtplib? base64?? binascii? binhex uu httplib? ftplib? gopherlib telnetlib? nntplib

總結

以上是生活随笔為你收集整理的web编程模块1 html,PYcore python programming笔记C20 Web编程的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：海盗分金问题冲突分析—非合作博弈
下一篇： audino python_用 Pyth

python

web编程 模块1 html,PYcore python programming笔记C20 Web编程

總結

web编程模块1 html,PYcore python programming笔记C20 Web编程