WEB数据挖掘(四)——数据采集
以前開發(fā)過數(shù)據(jù)采集的程序,這段時(shí)間重新翻出來重構(gòu)了一下代碼,程序還有很多需要改進(jìn)的地方
web數(shù)據(jù)采集從http提交方式可分為get方式和post方式(其實(shí)還有其他方式,不過目前瀏覽器不支持),針對(duì)這兩種方式的數(shù)據(jù)采集,當(dāng)時(shí)本人通過繼承抽象父類的方式來實(shí)現(xiàn)這兩種采集方式的請(qǐng)求參數(shù)封裝類,post方式的參數(shù)封裝類添加了post提交的參數(shù)(通過map成員變量保存post參數(shù))
原來針對(duì)某指定站點(diǎn)或站點(diǎn)欄目的多頁請(qǐng)求時(shí)通過一次性的構(gòu)造這些請(qǐng)求參數(shù)類的集合,然后在執(zhí)行http請(qǐng)求時(shí)通過遍歷該集合來抓取web數(shù)據(jù)
后來本人發(fā)現(xiàn),這種預(yù)先初始化請(qǐng)求參數(shù)類集合的處理方式在頁數(shù)比較大的時(shí)候,比如成千上萬的列表頁時(shí)初始化比較慢,并且性能也不理想
面對(duì)這種應(yīng)用場景,本人想到了要采用Iterator模式來重構(gòu),在需要提交當(dāng)前web請(qǐng)求時(shí),才將它的請(qǐng)求參數(shù)對(duì)象構(gòu)造出來
Iterator模式的原型如下
public interface NodeIterator {
/**
* Check if more nodes are available.
* @return <code>true</code> if a call to <code>nextHTMLNode()</code> will succeed.
*/
public boolean hasMoreNodes() throws ParserException;
/**
* Get the next node.
* @return The next node in the HTML stream, or null if there are no more nodes.
*/
public Node nextNode() throws ParserException;
}
通過實(shí)現(xiàn)該接口梯次構(gòu)造返回對(duì)象,而不是預(yù)先初始化List集合,參考實(shí)現(xiàn)類如下
public class IteratorImpl implements NodeIterator
{
Lexer mLexer;
ParserFeedback mFeedback;
Cursor mCursor;
public IteratorImpl (Lexer lexer, ParserFeedback fb)
{
mLexer = lexer;
mFeedback = fb;
mCursor = new Cursor (mLexer.getPage (), 0);
}
/**
* Check if more nodes are available.
* @return <code>true</code> if a call to <code>nextNode()</code> will succeed.
*/
public boolean hasMoreNodes() throws ParserException
{
boolean ret;
mCursor.setPosition (mLexer.getPosition ());
ret = Page.EOF != mLexer.getPage ().getCharacter (mCursor); // more characters?
return (ret);
}
/**
* Get the next node.
* @return The next node in the HTML stream, or null if there are no more nodes.
* @exception ParserException If an unrecoverable error occurs.
*/
public Node nextNode () throws ParserException
{
Tag tag;
Scanner scanner;
NodeList stack;
Node ret;
try
{
ret = mLexer.nextNode ();
if (null != ret)
{
// kick off recursion for the top level node
if (ret instanceof Tag)
{
tag = (Tag)ret;
if (!tag.isEndTag ())
{
// now recurse if there is a scanner for this type of tag
scanner = tag.getThisScanner ();
if (null != scanner)
{
stack = new NodeList ();
ret = scanner.scan (tag, mLexer, stack);
}
}
}
}
}
catch (ParserException pe)
{
throw pe; // no need to wrap an existing ParserException
}
catch (Exception e)
{
StringBuffer msgBuffer = new StringBuffer ();
msgBuffer.append ("Unexpected Exception occurred while reading ");
msgBuffer.append (mLexer.getPage ().getUrl ());
msgBuffer.append (", in nextNode");
// TODO: appendLineDetails (msgBuffer);
ParserException ex = new ParserException (msgBuffer.toString (), e);
mFeedback.error (msgBuffer.toString (), ex);
throw ex;
}
return (ret);
}
}
上面的代碼來自htmlparser組件的源碼,通過移動(dòng)當(dāng)前游標(biāo)的方式來構(gòu)造Node節(jié)點(diǎn)對(duì)象
本人參考以上的處理方式首先聲明接口
public interface ParamIterator
{
public boolean hasMoreParams();
public Param nextParam();
}
具體實(shí)現(xiàn)類如下(該類為內(nèi)部類,即內(nèi)稟迭代子)
private class ConcreteIterator implements ParamIterator
{
private int currentIndex=0;
private int start = 0;
private int end = 0;
private int step = 0;
private StringTokenizer st = new StringTokenizer(WebCate.this.single_links.trim());
private String urlexp=WebCate.this.expression.trim();
public ConcreteIterator()
{
//解析分頁表達(dá)式開始
if(StringUtils.hasLength(urlexp))
{
//解析分頁參數(shù)開始
//initpageparam(this.pageparam,start,end,step);
String pageparamstr=WebCate.this.pageparam.trim();
if(StringUtils.hasLength(pageparamstr))
{
if(pageparamstr.indexOf(",")>-1)
{
String[] arr=pageparamstr.split(",");
if(arr.length==2)
{
start=Integer.valueOf(arr[0]);
String endstr=arr[1];
step=1;
if(endstr.contains(":"))
{
String[] arr2=endstr.split(":");
end=Integer.valueOf(arr2[0]);
step=Integer.valueOf(arr2[1]);
}
else
{
end=Integer.valueOf(endstr);
}
}
}
}
}
currentIndex=start;
//解析分頁參數(shù)結(jié)束
}
@Override
public boolean hasMoreParams() {
// TODO Auto-generated method stub
// if(step>0)
// {
// return currentIndex<=end;
// }
// if(step<0)
// {
// return currentIndex>=end;
// }
return false;
}
@Override
public Param nextParam() {
// TODO Auto-generated method stub
Param param=null;
boolean single=true;
if(WebCate.this.httpmethod==0)
{
//解析單頁集合
if(StringUtils.hasLength(WebCate.this.single_links))
{
String str=null;
if(st.hasMoreElements() )
{
str=st.nextToken().trim();
if(StringUtils.hasLength(str))
{
param=new GetParam(str);
}
}
else
{
single=false;
}
}
}
if(StringUtils.hasLength(urlexp))
{
urlexp=transfer(urlexp,currentIndex);
if(WebCate.this.httpmethod==0)
{
if(!single)
{
if(step>0&¤tIndex<=end)
{
param=new GetParam(urlexp.replace("{*}", String.valueOf(currentIndex)));
}
if(step<0&¤tIndex>=end)
{
param=new GetParam(urlexp.replace("{*}", String.valueOf(currentIndex)));
}
}
}
else
{
if(step>0&¤tIndex<=end)
{
param=new PostParam(urlexp,buildmap(WebCate.this.postparam.trim(),currentIndex));
}
if(step<0&¤tIndex>=end)
{
param=new PostParam(urlexp,buildmap(WebCate.this.postparam.trim(),currentIndex));
}
}
currentIndex=currentIndex+step;
}
return param;
}
}
通過改變當(dāng)前索引的方式(int currentIndex)獲取下一個(gè)請(qǐng)求的參數(shù)對(duì)象(Param)
然后在請(qǐng)求參數(shù)類里面返回該對(duì)象
public ParamIterator elements()
{
return new ConcreteIterator();
}
然后我們?cè)趫?zhí)行Http請(qǐng)求時(shí)就可以通過迭代來獲取請(qǐng)求參數(shù)Param對(duì)象了
最終的采集效果如下
原型頁面如下
勘誤:
通過改變當(dāng)前索引的方式(int currentIndex)獲取下一個(gè)請(qǐng)求的參數(shù)對(duì)象(Param)
應(yīng)該是 改變當(dāng)前頁碼的方式 currentIndex命名為currentPage更合適
---------------------------------------------------------------------------
本系列WEB數(shù)據(jù)挖掘系本人原創(chuàng)
作者博客園刺猬的溫馴
本文鏈接http://www.cnblogs.com/chenying99/archive/2013/05/27/3100883.html
本文版權(quán)歸作者所有,未經(jīng)作者同意,嚴(yán)禁轉(zhuǎn)載及用作商業(yè)傳播,否則將追究法律責(zé)任。
總結(jié)
以上是生活随笔為你收集整理的WEB数据挖掘(四)——数据采集的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: startup毕业论文
- 下一篇: 无法使用此电子邮件地址。请选择其他电子邮