日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

抓取网站数据入库详解,附图文

發布時間:2024/1/18 编程问答 28 豆豆
生活随笔 收集整理的這篇文章主要介紹了 抓取网站数据入库详解,附图文 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

抓取網站數據入庫詳解,附圖文

一. 分析需求

1.1 需求分析

  • 剛好有這樣一個需求,去抓取下方網站的頁面全部數據,并存入MySQL數據庫。
  • 這個頁面為: 爬取頁面
    • 年月日選擇
    • 出生于幾點,性別: 男或者女 選擇:
    • 選擇年月日小時,性別后,跳轉的頁面(目標就是爬取此頁面):

1.2 分析實現可行性

  • 經過對各個年份、月份、天、小時、男或女的點擊后進入的頁面發現如下特點:
  • 頁面數據是靜態數據,并非從后端讀取得到 (可考慮有哪些技術可以實現)
  • 頁面數據有固定的key:value屬性,比如 生肖: 牛星座:雙魚座,且每個頁面的key,value是固定的,簡單來說,每個頁面的key都是一樣的,只是具體的value是根據年月日小時,性別會相應變動 (可考慮入庫的時候,對數據庫字段的定義)
  • 頁面的路徑是有規律的。比如1950年1月1日0時,性別為女的,它的路徑為:http://www.8gua.cn/huashengsuanming/1950/w-1950-1-1-0.html,所以分析出路徑如下特點:
  • 路徑為: url/年/(男為m,女為w)性別-年-月-日-時.html組成;
  • 可選出生的小時為:
  • 頁面路徑與可選的小時有著一一對應的關系;

二. 分析技術

  • 解析靜態頁面,我們可以使用Jsoup來進行解析,它可以將頁面中的元素內容加載為Document文檔,我們可以操作指定;
    • Jsoup是什么? jsoup 是一款Java 的HTML解析器,可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API,可通過DOM,CSS以及類似于jQuery的操作方法來取出和操作數據。
    • 通過Jsoup,我們可以訪問指定的頁面,抓取其中的內容,解析為文本(抓取數據文本)
  • 提取關鍵詞key和value,我們可以使用正則表達式,將符合規則的數據截取出來(對數據文本的提取)
  • 期間還需要用到:
  • 獲取指定年、月有多少天的方法(對日期的處理)
  • 獲取指定范圍內的集合(如1950年~2019年的集合,1到12月的集合等)

去實現一個功能時,先逐個分析用到哪些東西可以實現。由點到面,這樣一個大的功能就可以落地了。

三. 業務流程梳理

  • 流程圖:
  • 具體實現總結:我們先進行年月日時的遍歷,然后選擇男或女,這樣能夠獲得1950年~2019年每一天中固定的那幾個小時范圍的男或女的數據;遍歷最深層中寫邏輯代碼,在最里面寫:
  • 通過年月日時,男或女,以及分頁等條件去拼接url
  • 通過Jsoup獲取指定url內的數據,將其主體內容轉為文本,并過濾掉不需要的內容
  • 通過正則表達式,將里面的數據進行提取,變為key:value形式;
  • 將結果封裝到對象中,然后存入數據庫;

四. 實戰代碼

4.1 公共方法以及依賴的引入

  • 實戰代碼示例,文中代碼是用Kotlin編寫,與Java相差不大。
  • 引入Jsoup依賴: compile group: 'org.jsoup', name: 'jsoup', version: '1.13.1'

    Java版本依賴可直接搜索jsoup ,去尋找Maven依賴即可

  • 獲取指定年、月下的最大天數: /*** 根據年 月 獲取對應的月份 天數*/fun getDaysByYearMonth(year: Int, month: Int): Int {val a = Calendar.getInstance()a[Calendar.YEAR] = yeara[Calendar.MONTH] = month - 1a[Calendar.DATE] = 1a.roll(Calendar.DATE, -1)return a[Calendar.DATE]}
  • 獲取指定字符串范圍內的數據(包含范圍數據): /*** 獲取從pre 開始 ,從post結束的字符串數據*/fun parseTextAll(content: String, pre: String, post: String): String {// 查找的字符串//正則表達式val pattern = "$pre(.*?)$post"; //Java正則表達式以括號分組,第一個括號表示以"(乙方):"開頭,第三個括號表示以" "(空格)結尾,中間括號為目標值,// 創建 Pattern 對象val r = Pattern.compile(pattern);// 創建 matcher 對象val m = r.matcher(content);while (m.find()) {// 自動遍歷打印所有結果 group方法打印捕獲的組內容,以正則的括號角標從1開始計算,我們這里要第2個括號里的// 值, 所以取 m.group(2), m.group(0)取整個表達式的值,如果越界取m.group(4),則拋出異常return m.group(0)}return ""}
  • 獲取指定字符串范圍內的數據(不包含范圍數據):/*** 獲取從pre 開始 ,從post結束的字符串數據,排除pre/post*/ fun parseText(content: String, pre: String, post: String): String {// 查找的字符串val pattern = "$pre(.*?)$post"// 創建 Pattern 對象val r = Pattern.compile(pattern);// 創建 matcher 對象val m = r.matcher(content);while (m.find()) {// 自動遍歷打印所有結果 group方法打印捕獲的組內容,以正則的括號角標從1開始計算,我們這里要第2個括號里的// 值, 所以取 m.group(2), m.group(0)取整個表達式的值,如果越界取m.group(4),則拋出異常return m.group(0).replace(pre, "").replace(post, "")}return "" }
  • Jsoup根據指定url路徑分析頁面的文本內容: fun parseContent(urls: List<String>): String {val builder = StringBuilder()urls.forEach {try {val document: Document = Jsoup.parse(URL(it), 3 * 1000)builder.append(document.getElementsByTag("p").text())} catch (e: Exception) {}}return builder.toString()}

    此處代碼 document.getElementsByTag().text() 為獲取指定標簽名的文本數據。這里則為獲取<p>標簽內的全部內容;

  • 根據指定范圍的值獲取范圍內的集合: fun getListByRange(startInt: Int, endInt: Int): List<Int>{val rangeList = mutableListOf<Int>()var startCount= startIntwhile (startCount <= endInt){rangeList.add(startCount++)}return rangeList }

4.2 數據庫及存儲相關的設計

  • 數據庫設計如下:

    此處content_text為存儲的全部內容的文本信息,因為數據量較大,且可能此字段使用率不高,博主暫時將其放棄,不為此字段賦值;

  • Java中的配置:
    • 我們使用的持久層框架為: Mybatis-plus
    • 數據庫此表名稱為:professional_letter
    • 創建的Mapper: interface ProfessionalLetterMapper : BaseMapper<ProfessionalLetterEntity>
    • 創建的Entity:@TableName("professional_letter") class ProfessionalLetterEntity{@ApiModelProperty("主鍵id")var id: Long? = 0L@ApiModelProperty("所屬年月日-時分秒,開始時間")var startTime: Date?=null@ApiModelProperty("所屬年月日-時分秒,結束時間")var endTime: Date?=null@ApiModelProperty("性別")var sex: Short?=null@ApiModelProperty("標題")var title: String?=null@ApiModelProperty("陽歷")var solarCalendar: String?=null@ApiModelProperty("農歷")var lunarCalendar: String?=null@ApiModelProperty("節氣")var solarTerms: String?=null@ApiModelProperty("星座")var constellation: String?= null@ApiModelProperty("十二生肖")var chineseZodiac: String?=null@ApiModelProperty("二十八星宿")var twentyEightNights: String?=null@ApiModelProperty("命主福元")var fortune: String?=null@ApiModelProperty("文本版內容")var contentText: String?=null@ApiModelProperty("json版本內容")var contentJson: String?=null@ApiModelProperty("胎元")var foetus: String?=null@ApiModelProperty("命宮")var mingGong: String?=null@ApiModelProperty("起大運周歲")var qiDaYun: String?=null }
    • 創建的ContentJson(用于保存解析后的全部字段數據)import com.sino.hardware.common.JsonSerializable import io.swagger.annotations.ApiModelProperty class ContentJson : JsonSerializable() {@ApiModelProperty("陽歷")var solarCalendar: String? = null@ApiModelProperty("農歷")var lunarCalendar: String? = null@ApiModelProperty("節氣")var solarTerms: String? = null@ApiModelProperty("起大運周歲")var qiDaYun: String? = null@ApiModelProperty("星座")var constellation: String? = null@ApiModelProperty("十二生肖")var chineseZodiac: String? = null@ApiModelProperty("二十八星宿")var twentyEightNights: String? = null@ApiModelProperty("命主福元")var fortune: String? = null@ApiModelProperty("八字納音")var baZiNaYin: String? = null@ApiModelProperty("排大運")var paiDaYun: String? = null@ApiModelProperty("排流年")var paiLiuNian: String? = null@ApiModelProperty("胎元")var foetus: String? = null@ApiModelProperty("命宮")var mingGong: String? = null@ApiModelProperty("終身卦")var zhongShenGua: String? = null@ApiModelProperty("吉神兇煞")var jiShenXiongSha: String? = null@ApiModelProperty("吉神兇煞提示")var jiShenXiongShaTiShi: String? = null@ApiModelProperty("命局生克制化")var mingJuShengKeZhiHua: String? = null@ApiModelProperty("日主綜得分")var riZhuZhongDeiFen: String? = null@ApiModelProperty("日主綜得分提示")var riZHuZhongDeiFenTiShi: String? = null@ApiModelProperty("三命通會論斷")var sanMingTongHuiLunDuan: String? = null@ApiModelProperty("窮通寶鑒-調候用神參考")var qiongTongBaoJian: String? = null@ApiModelProperty("十神定位論斷")var shiShenDingWeiLunDuan: String? = null@ApiModelProperty("八字重量")var baZiZhongLiang: String? = null@ApiModelProperty("八字重量提示")var baZiZhongLiangTiShi: String? = null@ApiModelProperty("命宮寓意")var mingGongYuYi: String? = null@ApiModelProperty("性格特征")var xingGeTeZheng: String? = null@ApiModelProperty("性格特征提示")var xingGeTeZhengTiShi: String? = null@ApiModelProperty("職業財運")var zhiYeCaiYun: String? = null@ApiModelProperty("功名官運")var gongMingGuanYun: String? = null@ApiModelProperty("婚姻擇偶")var hunYingZeOu: String? = null@ApiModelProperty("配偶方向")var peiOuFangXiang: String? = null@ApiModelProperty("配偶方向提示:")var peiOuFangXiangTiShi: String? = null@ApiModelProperty("祖業遺產")var zuYeYiChan: String? = null@ApiModelProperty("體質健康")var tiZhiJianKang: String? = null@ApiModelProperty("體質健康提示")var tiZhiJianKangTiShi: String? = null@ApiModelProperty("有利選擇")var youLiXuanZe: String? = null@ApiModelProperty("流年")var liuNianMap: Map<String,String>? = null@ApiModelProperty("起大運運勢")var qiDaYunMap: Map<String,String>? = null}

4.3 核心代碼

  • 引入Mapper:@Autowired private lateinit var professionalLetterMapper: ProfessionalLetterMapper
  • 創建Main函數方法,并包含條件范圍,并調用步驟3、步驟4:@Async fun insertDB() {val yearList = getListByRange(1950, 2019)val monthList = getListByRange(1, 12)val hourList = listOf(0, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23)// 遍歷每一年yearList.forEach { year ->run {// 遍歷每一個月monthList.forEach { month ->run {// 根據年月,得出此年此月中共有多少天val dayMaxCount = getDaysByYearMonth(year, month)// 將天數封裝為一個集合val dayList = getListByRange(1, dayMaxCount)// 去查詢數據dayList.forEach { day ->run {hourList.forEach { hour ->run {insertDB(year, month, day, hour)}}}}}}}}// 導入完成,輸出提示信息var entity = ProfessionalLetterEntity()entity.title="導出完成"professionalLetterMapper.insert(entity) }
  • 存入數據庫方法:// 存入數據庫 private fun insertDB(year: Int, month: Int, day: Int, hour: Int) {val entityM = parseUrl2Content(year, month, day, hour, "m")professionalLetterMapper.insert(entityM)val entityW = parseUrl2Content(year, month, day, hour, "w")professionalLetterMapper.insert(entityW) }
  • 創建根據Url獲取數據,通過Jsoup轉為文本,過濾不需要的數據,并封裝為存入數據庫的Entity對象的方法: private fun parseUrl2Content(year: Int, month: Int, day: Int, hour: Int, sex: String): ProfessionalLetterEntity {val urls = setUrls(year, month, day, hour, sex)var entity = ProfessionalLetterEntity()// 內容var content = parseContent(urls).replace("華盛算命", "").replace("www.8gua.cn", "").replace("- ", "").replace("Copyright ? 2014-2017華盛算命 www.8gua.cn", "").replace("Copyright?2014-2017華盛算命 www.8gua.cn", "").replace("【", "\n 【").replace("Copyright?2014-2017", "").replace("Copyright ? 2014-2017", "").replace("聯系QQ:139238028", "").replace("Copyright?2014-2017", "") + "\n"// 出生公歷val birthDay = parseText(content, "出生公歷:", "。")val birthDayNong = parseText(content, "出生農歷:", "。")val jieqi = parseText(content, "節氣:", "。")val qidayun = parseText(content, "起大運周歲:", "。")val xingzuo = parseText(content, "星座:", "。")val shengxiao = parseText(content, "生肖:", "。")val ershibaxiu = parseText(content, "二十八宿:", "。")val minzhufuyuan = parseText(content, "命主福元:", "。")val bazi = parseText(content, "。", "節氣:")val paidayun = parseText(content, "排大運:", "排流年:")val pailiunian = parseText(content, "排流年:", "※胎元:")val taiyuan = parseText(content, "※胎元:", "命宮:")val minggong = parseText(content, "命宮:", "終身卦:")val zhongshengua = parseText(content, "終身卦:", "吉神兇煞:")val jishenxiongsha = parseText(content, "吉神兇煞:", "☆星座:")val tishijishenxiongsha = parseTextAll(content, "※提示:神煞", "性質的作用。")val minjushengkezhihua = parseText(content, "命局生克制化:", "※日主綜合得分:")val rizongzhudeifen = parseText(content, "※日主綜合得分:", "※提示:這一步給出了整個八字最有價值的信息,")val tishirizongzhudeifen = parseTextAll(content, "※提示:這一步給出了整個八字最有價值的信息,", "利用這些數據,切記!")val sanmingtonghuilunduan = parseText(content, "三命通會論斷:", "窮通寶鑒-調侯用神參考: ");val qiongtongbaojian_diaotongyongshencankao = parseText(content, "窮通寶鑒-調侯用神參考:", "※提示:這里對命局生克關系有較好的論述,")val shishendingweilunduan = parseText(content, "十神定位論斷:", "※提示:分宮論斷,")val bazizhongliang = parseText(content, "\n", "※提示:這是一種神奇的斷命法,")val tishi_bazizhongliang = parseTextAll(content, "※提示:這是一種神奇的斷命法,", "命運輪廓,可參考。")val minggongyuyi = parseText(content, "※ 命宮寓意:", "※ 性格特征:")val xinggetezheng = parseText(content, "※ 性格特征:", "※提示:個人性格除稟受天賦外,")val tishi_xinggetezheng = parseTextAll(content, "※提示:個人性格除稟受天賦外,", "時代背景等。")val zhiyecaiyun = parseText(content, "職業財運:", "※提示:職業和財運密切像關")val gongmingguanyun = parseText(content, "功名官運:", "※提示:")val hunyingzeou = parseText(content, "※ 婚姻擇偶:", "\n")val peioufangxiang = parseText(content, "【配偶方向】", "※提示:看婚姻除看個人八字外,")val tishi_hunyingzeou = parseTextAll(content, "※提示:看婚姻除看個人八字外,", "不喜克害。")val zuyeyichan = parseText(content, "※ 祖業遺產:", "※ 家庭子女:")val tizhijiankang = parseText(content, "※ 體質健康:", "※提示:八字陰陽五行平衡,")val tishi_tizhijiankang = parseTextAll(content, "※提示:八字陰陽五行平衡,", "※ 有利選擇:")val youlixuanze = parseText(content, "※ 有利選擇:", "※ 未交大運前的運勢:")var liunianCount = 1val liunianyunshiList = mutableListOf<Int>()while (liunianCount <= 82) {liunianyunshiList.add(liunianCount++)}var qidayunCount = 0val qidayunCountList = listOf("一", "二", "三", "四", "五", "六", "七", "八")var contentNew = contentval qiDaYunMap = mutableMapOf<String, String>()while (qidayunCount < 8) {val index = qidayunCountList[qidayunCount++]val qidayunContent = parseText(contentNew, "第${index}步大運:", "\n")contentNew = contentNew.replace("${index}${qidayunContent}", "")qiDaYunMap.put(index, qidayunContent)}val liuNianMap = mutableMapOf<String, String>()liunianyunshiList.forEach {val liuNianContent = parseTextAll(contentNew, "【${it}歲流年:", "\n")liuNianMap.put("${it}", liuNianContent)}var map = mapOf("0" to "00:00|00:59","1" to "01:00|02:59","3" to "03:00|04:59","5" to "05:00|06:59","7" to "07:00|08:59","9" to "09:00|10:59","11" to "11:00|12:59","13" to "13:00|14:59","15" to "15:00|16:59","17" to "17:00|18:59","19" to "19:00|20:59","21" to "21:00|22:59","23" to "23:00|23:59")// 將結果封裝到數據json中val contentJson = ContentJson()contentJson.solarCalendar = birthDaycontentJson.lunarCalendar = birthDayNongcontentJson.solarTerms = jieqicontentJson.qiDaYun = qidayuncontentJson.constellation = xingzuocontentJson.chineseZodiac = shengxiaocontentJson.twentyEightNights = ershibaxiucontentJson.fortune = minzhufuyuancontentJson.baZiNaYin = bazicontentJson.paiDaYun = paidayuncontentJson.paiLiuNian = pailiuniancontentJson.foetus = taiyuancontentJson.mingGong = minggongcontentJson.zhongShenGua = zhongshenguacontentJson.jiShenXiongSha = jishenxiongshacontentJson.jiShenXiongShaTiShi = tishijishenxiongshacontentJson.mingJuShengKeZhiHua = minjushengkezhihuacontentJson.riZhuZhongDeiFen = rizongzhudeifencontentJson.riZHuZhongDeiFenTiShi = tishirizongzhudeifencontentJson.sanMingTongHuiLunDuan = sanmingtonghuilunduancontentJson.qiongTongBaoJian = qiongtongbaojian_diaotongyongshencankaocontentJson.shiShenDingWeiLunDuan = shishendingweilunduancontentJson.baZiZhongLiang = bazizhongliangcontentJson.baZiZhongLiangTiShi = tishi_bazizhongliangcontentJson.mingGongYuYi = minggongyuyicontentJson.xingGeTeZheng = xinggetezhengcontentJson.xingGeTeZhengTiShi = tishi_xinggetezhengcontentJson.zhiYeCaiYun = zhiyecaiyuncontentJson.gongMingGuanYun = gongmingguanyuncontentJson.hunYingZeOu = hunyingzeoucontentJson.peiOuFangXiang = peioufangxiangcontentJson.peiOuFangXiangTiShi = tishi_hunyingzeoucontentJson.zuYeYiChan = zuyeyichancontentJson.tiZhiJianKang = tizhijiankangcontentJson.tiZhiJianKangTiShi = tishi_tizhijiankangcontentJson.youLiXuanZe = youlixuanzecontentJson.liuNianMap = liuNianMapcontentJson.qiDaYunMap = qiDaYunMap// 將結果放到entity中val hourList = map.get("$hour")!!.split("|")val months = if (month < 10) "0$month" else "$month"val days = if (day < 10) "0$day" else "$day"entity.startTime = DateUtil.parseStrToDate("$year-${months}-${days} ${hourList[0]}:00", "yyyy-MM-dd HH:mm:ss")entity.endTime = DateUtil.parseStrToDate("$year-${months}-${days} ${hourList[1]}:59", "yyyy-MM-dd HH:mm:ss")entity.sex = if (sex == "m") 0 else 1entity.title = "八字詳批-${if (sex == "m") "" else ""}命公歷${year}年${months}月${days}日生于${hourList[0]}時~${hourList[1]}時的人"entity.solarCalendar = birthDayentity.lunarCalendar = birthDayNongentity.solarTerms = jieqientity.constellation = xingzuoentity.chineseZodiac = shengxiaoentity.twentyEightNights = ershibaxiuentity.fortune = minzhufuyuanentity.contentText = "" // 暫時不添加entity.contentJson = contentJson.toJSON()entity.mingGong = minggongentity.foetus = taiyuanentity.qiDaYun = qidayunreturn entity }
  • 五. 啟動后的注意點

    • 所有的都準備好了,我們點擊啟動,就可以自動去抓取,并且入庫了:
    • 數據已成功陸續插入:
    • 最后感言:
    • 因為數據量較大,我們可以放在服務器里進行執行;
    • 我們也可以做優化,比如分庫分表之類,后期可根據實際需求來
    • 我們可以做多線程等,同時執行;(這里的同時執行可以每個階段一個線程來存入,充分利用現代CPU的多核性能)
    • 解決需求的時候,可根據實際需求來選擇技術方案,沒有哪種技術方案可以適用于所有需求。

    總結

    以上是生活随笔為你收集整理的抓取网站数据入库详解,附图文的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。