當前位置：首頁 >

网络爬虫：URL去重策略之布隆过滤器(BloomFilter)的使用

發布時間：2025/3/20 59 豆豆

生活随笔收集整理的這篇文章主要介紹了网络爬虫：URL去重策略之布隆过滤器(BloomFilter)的使用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

前言：

? 最近被網絡爬蟲中的去重策略所困擾。使用一些其他的“理想”的去重策略，不過在運行過程中總是會不太聽話。不過當我發現了BloomFilter這個東西的時候，的確，這里是我目前找到的最靠譜的一種方法。

? 如果，你說URL去重嘛，有什么難的。那么你可以看完下面的一些問題再說這句話。

關于BloomFilter：

? Bloom filter 是由 Howard Bloom 在 1970 年提出的二進制向量數據結構，它具有很好的空間和時間效率，被用來檢測一個元素是不是集合中的一個成員。如果檢測結果為是，該元素不一定在集合中；但如果檢測結果為否，該元素一定不在集合中。因此Bloom filter具有100%的召回率。這樣每個檢測請求返回有“在集合內（可能錯誤）”和“不在集合內（絕對不在集合內）”兩種情況，可見 Bloom filter 是犧牲了正確率以節省空間。

以前的去重策略：

1.想到過的URL去重策略

在數據庫中創建字段的UNIQUE屬性
在數據庫中創建一個唯一的索引，在插入數據之前檢查待插入的數據是否存在
使用Set或HashSet保存數據，確保唯一
使用Map或是一個定長數組記錄某一個URL是否被訪問過

2.以上去重策略存在的問題

? （1）對于在數據庫中創建字段的UNIQUE屬性，的確是可以避免一些重復性操作。不過在多次MySQL報錯之后，程序可能會直接崩潰，因此這種方式不可取

? （2）如果我們要在每一次插入數據之前都去檢查待插入的數據是否存在，這樣勢必會影響程序的效率

? （3）這種方式是我在第一次嘗試的時候使用的，放棄繼續使用的原因是：OOM。當然，這里并不是程序的內存泄露，而程序中真的有這么多內存需要被占用（因為從待訪問隊列中解析出來的URL要遠比它本身要多得多）

? （4）在前幾篇博客中，我就有提到使用Map對象來保存URL的訪問信息。不過，現在我要否定它。因為，在長時間運行之后，Map也是會占用大量的內存。只不過，會比第3種方式要小一些。下面是使用Map<Integer,?Integer>去重，在長時間運行中內存的使用情況：

BloomFilter的使用：

1.一般情況下BloomFilter使用內存的情況：

2.爬蟲程序中BloomFilter使用內存的情況（已運行4小時）：

3.程序結構圖

4.BloomFilter的一般使用

? 此處關于BloomFilter的Java代碼部分，參考于：http://www.cnblogs.com/heaad/archive/2011/01/02/1924195.html

? 如果你看了上面的文章，相信你已經了解到布隆過濾器的空間復雜度是S(n)=O(n)。關于這一點，相信你已經從上面的內存使用情況中了解到了這一點。那么以下會是一些相關的Java代碼展示。而在查重過程也很有效率，時間復雜度是T(n)=O(1)。

BloomFilter.java

import java.util.BitSet;public class BloomFilter {/* BitSet初始分配2^24個bit */private static final int DEFAULT_SIZE = 1 << 25;/* 不同哈希函數的種子，一般應取質數 */private static final int[] seeds = new int[] { 5, 7, 11, 13, 31, 37, 61 };private BitSet bits = new BitSet(DEFAULT_SIZE);/* 哈希函數對象 */private SimpleHash[] func = new SimpleHash[seeds.length];public BloomFilter() {for (int i = 0; i < seeds.length; i++) {func[i] = new SimpleHash(DEFAULT_SIZE, seeds[i]);}}// 將字符串標記到bits中public void add(String value) {for (SimpleHash f : func) {bits.set(f.hash(value), true);}}// 判斷字符串是否已經被bits標記public boolean contains(String value) {if (value == null) {return false;}boolean ret = true;for (SimpleHash f : func) {ret = ret && bits.get(f.hash(value));}return ret;}/* 哈希函數類 */public static class SimpleHash {private int cap;private int seed;public SimpleHash(int cap, int seed) {this.cap = cap;this.seed = seed;}// hash函數，采用簡單的加權和hashpublic int hash(String value) {int result = 0;int len = value.length();for (int i = 0; i < len; i++) {result = seed * result + value.charAt(i);}return (cap - 1) & result;}} } ????
Test.java

public class Test {private final String[] URLS = {"http://www.csdn.net/","http://www.baidu.com/","http://www.google.com.hk","http://www.cnblogs.com/","http://www.zhihu.com/","https://www.shiyanlou.com/","http://www.google.com.hk","https://www.shiyanlou.com/","http://www.csdn.net/"};private void testBloomFilter() {BloomFilter filter = new BloomFilter();for (int i = 0; i < URLS.length; i++) {if (filter.contains(URLS[i])) {System.out.println("contain: " + URLS[i]);continue;}filter.add(URLS[i]);}}public static void main(String[] args) {Test t = new Test();t.testBloomFilter();} }

???????????????????

5.BloomFilter在爬蟲中過濾重復的URL

public class ParserRunner implements Runnable {private SpiderSet mResultSet = null;private WebInfoModel mInfoModel = null;private int mIndex;private final boolean DEBUG = false;private SpiderBloomFilter mFlagBloomFilter = null;public ParserRunner(SpiderSet set, WebInfoModel model, int index, SpiderBloomFilter filter) {mResultSet = set;mInfoModel = model;mIndex = index;mFlagBloomFilter = filter;}@Overridepublic void run() {long t = System.currentTimeMillis();SpiderQueue tmpQueue = new SpiderQueue();PythonUtils.fillAddressQueueByPython(tmpQueue, mInfoModel.getAddress(), mInfoModel.getLevel());WebInfoModel model = null;while (!tmpQueue.isQueueEmpty()) {model = tmpQueue.poll();if (model == null || mFlagBloomFilter.contains(model.getAddress())) {continue;}mResultSet.add(model);mFlagBloomFilter.add(model.getAddress());}tmpQueue = null;model = null;System.err.println("Thread-" + mIndex + ", UsedTime-" + (System.currentTimeMillis() - t) + ", SetSize = " + mResultSet.size());t = 0;}@SuppressWarnings("unused")private void sleep(long millis) {try {Thread.sleep(millis);} catch (InterruptedException e) {e.printStackTrace();}} } ? 如果你看過我之前的博客，那么上面的這一段代碼相信你會比較熟悉。

? 這段代碼的功能是：生產者。從待訪問隊列中消費一個model，然后調用Python生產鏈接的列表Queue，并將生成的列表Queue offer到結果SpiderSet中。

總結

以上是生活随笔為你收集整理的网络爬虫：URL去重策略之布隆过滤器(BloomFilter)的使用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：网络爬虫：采用“负载均衡”策略来优化网络
下一篇：网站分类前导：获取网站标题和描述及对相关