javascript
如何使用浏览器控制台通过JavaScript抓取并将数据保存在文件中
by Praveen Dubey
通過(guò)Praveen Dubey
如何使用瀏覽器控制臺(tái)通過(guò)JavaScript抓取并將數(shù)據(jù)保存在文件中 (How to use the browser console to scrape and save data in a file with JavaScript)
A while back I had to crawl a site for links, and further use those page links to crawl data using selenium or puppeteer. Setup for the content on the site was bit uncanny so I couldn’t start directly with selenium and node. Also, unfortunately, data was huge on the site. I had to quickly come up with an approach to first crawl all the links and pass those for details crawling of each page.
前一陣子,我不得不對(duì)一個(gè)站點(diǎn)進(jìn)行爬網(wǎng)以獲取鏈接,并進(jìn)一步使用這些頁(yè)面鏈接來(lái)使用Selenium或puppeteer來(lái)對(duì)數(shù)據(jù)進(jìn)行爬網(wǎng)。 該網(wǎng)站上的內(nèi)容設(shè)置有點(diǎn)離奇,所以我不能直接從Selenium和Node開(kāi)始。 同樣,不幸的是,該站點(diǎn)上的數(shù)據(jù)非常龐大。 我必須快速想出一種方法,首先抓取所有鏈接,然后將其傳遞給每個(gè)頁(yè)面的詳細(xì)信息抓取。
That’s where I learned this cool stuff with the browser Console API. You can use this on any website without much setup, as it’s just JavaScript.
那是我從瀏覽器控制臺(tái)API那里學(xué)到的好東西。 您可以在任何網(wǎng)站上使用它,而無(wú)需進(jìn)行太多設(shè)置,因?yàn)樗皇荍avaScript。
Let’s jump into the technical details.
讓我們跳入技術(shù)細(xì)節(jié)。
高級(jí)概述 (High Level Overview)
For crawling all the links on a page, I wrote a small piece of JS in the console. This JavaScript crawls all the links (takes 1–2 hours, as it does pagination also) and dumps a json file with all the crawled data. The thing to keep in mind is that you need to make sure the website works similarly to a single page application. Otherwise, it does not reload the page if you want to crawl more than one page. If it does not, your console code will be gone.
為了抓取頁(yè)面上的所有鏈接,我在控制臺(tái)中編寫(xiě)了一小段JS。 此JavaScript會(huì)爬網(wǎng)所有鏈接(需要1到2個(gè)小時(shí),因?yàn)樗矔?huì)進(jìn)行分頁(yè))并轉(zhuǎn)儲(chǔ)包含所有已爬網(wǎng)數(shù)據(jù)的json文件。 要記住的事情是,您需要確保該網(wǎng)站的工作方式類(lèi)似于單頁(yè)應(yīng)用程序。 否則,如果您要爬網(wǎng)多個(gè)頁(yè)面,則不會(huì)重新加載頁(yè)面。 如果沒(méi)有,您的控制臺(tái)代碼將消失。
Medium does not refresh the page for some scenarios. For now, let’s crawl a story and save the scraped data in a file from the console automatically after scrapping.
中型在某些情況下不會(huì)刷新頁(yè)面。 現(xiàn)在,讓我們抓取一個(gè)故事,并將抓取的數(shù)據(jù)在抓取后自動(dòng)從控制臺(tái)保存到文件中。
But before we do that here’s a quick demo of the final execution.
但是在開(kāi)始之前,這里是最終執(zhí)行的快速演示。
1.從瀏覽器獲取控制臺(tái)對(duì)象實(shí)例 (1. Get the console object instance from the browser)
// Console API to clear console before logging new dataconsole.API;if (typeof console._commandLineAPI !== 'undefined') { console.API = console._commandLineAPI; //chrome} else if (typeof console._inspectorCommandLineAPI !== 'undefined'){ console.API = console._inspectorCommandLineAPI; //Safari} else if (typeof console.clear !== 'undefined') { console.API = console;}The code is simply trying to get the console object instance based on the user’s current browser. You can ignore and directly assign the instance to your browser.
該代碼只是試圖根據(jù)用戶(hù)當(dāng)前的瀏覽器獲取控制臺(tái)對(duì)象實(shí)例。 您可以忽略實(shí)例并將其直接分配給瀏覽器。
Example, if you using Chrome, the below code should be sufficient.
例如,如果您使用Chrome ,則下面的代碼應(yīng)該足夠了。
if (typeof console._commandLineAPI !== 'undefined') { console.API = console._commandLineAPI; //chrome}2.定義初級(jí)助手功能 (2. Defining the Junior helper function)
I’ll assume that you have opened a Medium story as of now in your browser. Lines 6 to 12 define the DOM element attributes which can be used to extract story title, clap count, user name, profile image URL, profile description and read time of the story, respectively.
我假設(shè)您已經(jīng)在瀏覽器中打開(kāi)了一個(gè)中型故事。 第6至12行定義DOM元素屬性,可分別用于提取故事標(biāo)題,拍手?jǐn)?shù),用戶(hù)名,個(gè)人資料圖像URL,個(gè)人資料描述和故事的讀取時(shí)間 。
These are the basic things which I want to show for this story. You can add a few more elements like extracting links from the story, all images, or embed links.
這些是我要為這個(gè)故事展示的基本內(nèi)容。 您可以添加更多元素,例如從故事中提取鏈接,所有圖像或嵌入鏈接。
3.定義我們的高級(jí)助手功能-野獸 (3. Defining our Senior helper function — the beast)
As we are crawling the page for different elements, we will save them in a collection. This collection will be passed to one of the main functions.
當(dāng)我們?cè)陧?yè)面上搜尋不同的元素時(shí),我們會(huì)將它們保存在集合中。 該集合將傳遞給主要功能之一。
We have defined a function name, console.save. The task for this function is to dump a csv / json file with the data passed.
我們定義了一個(gè)函數(shù)名稱(chēng)console.save 。 此功能的任務(wù)是轉(zhuǎn)儲(chǔ)帶有所傳遞數(shù)據(jù)的csv / json文件。
It creates a Blob Object with our data. A Blob object represents a file-like object of immutable, raw data. Blobs represent data that isn't necessarily in a JavaScript-native format.
它使用我們的數(shù)據(jù)創(chuàng)建一個(gè)Blob對(duì)象。 Blob對(duì)象代表不可變的原始數(shù)據(jù)的類(lèi)似文件的對(duì)象。 Blob表示的數(shù)據(jù)不一定是JavaScript原生格式。
Create blob is attached to a link tag <;a> on which a click event is triggered.
創(chuàng)建blob附加到鏈接標(biāo)簽< ; a>上,在該鏈接標(biāo)簽上觸發(fā)了點(diǎn)擊事件。
Here is the quick demo of console.save with a small array passed as data.
這是console.save的快速演示,其中有一個(gè)作為數(shù)據(jù)傳遞的小array 。
Putting together all the pieces of the code, this is what we have:
將所有代碼段放在一起,這就是我們所擁有的:
Let’s execute our console.save() in the browser to save the data in a file. For this, you can go to a story on Medium and execute this code in the browser console.
讓我們?cè)跒g覽器中執(zhí)行console.save()以將數(shù)據(jù)保存到文件中。 為此,您可以轉(zhuǎn)到Medium上的故事并在瀏覽器控制臺(tái)中執(zhí)行此代碼。
I have shown the demo with extracting data from a single page, but the same code can be tweaked to crawl multiple stories from a publisher’s home page. Take an example of freeCodeCamp: you can navigate from one story to another and come back (using the browser’s back button) to the publisher home page without the page being refreshed.
我已經(jīng)演示了從單個(gè)頁(yè)面提取數(shù)據(jù)的演示,但是可以對(duì)相同的代碼進(jìn)行調(diào)整,以從發(fā)布者的主頁(yè)中抓取多個(gè)故事。 以freeCodeCamp為例 :您可以從一個(gè)故事導(dǎo)航到另一個(gè)故事,然后(使用瀏覽器的后退按鈕)返回到發(fā)布者主頁(yè),而無(wú)需刷新頁(yè)面。
Below is the bare minimum code you need to extract multiple stories from a publisher’s home page.
下面是從發(fā)布者的主頁(yè)中提取多個(gè)故事所需的最低限度代碼。
Let’s see the code in action for getting the profile description from multiple stories.
讓我們看一下從多個(gè)故事中獲取個(gè)人檔案描述的代碼。
For any such type of application, once you have scrapped the data, you can pass it to our console.save function and store it in a file.
對(duì)于任何這種類(lèi)型的應(yīng)用程序,一旦您將數(shù)據(jù)抓取后,就可以將其傳遞給我們的console.save函數(shù)并將其存儲(chǔ)在文件中。
The console save function can be quickly attached to your console code and can help you to dump the data in the file. I am not saying you have to use the console for scraping data, but sometimes this will be a way quicker approach since we all are very familiar working with the DOM using CSS selectors.
控制臺(tái)保存功能可以快速附加到控制臺(tái)代碼中,并可以幫助您轉(zhuǎn)儲(chǔ)文件中的數(shù)據(jù)。 我并不是說(shuō)您必須使用控制臺(tái)來(lái)抓取數(shù)據(jù),但是有時(shí)這將是一種更快的方法,因?yàn)槲覀兌挤浅J煜な褂肅SS選擇器來(lái)處理DOM。
You can download the code from Github
您可以從Github下載代碼
Thank you for reading this article! Hope it gave you cool idea to scrape some data quickly without much setup. Hit the clap button if it enjoyed it! If you have any questions, send me an email (praveend806 [at] gmail [dot] com).感謝您閱讀本文! 希望它為您提供了一個(gè)不錯(cuò)的主意,使您無(wú)需進(jìn)行太多設(shè)置即可快速抓取一些數(shù)據(jù)。 如果喜歡,請(qǐng)按拍手按鈕! 如果您有任何疑問(wèn),請(qǐng)給我發(fā)送電子郵件(praveend806 [at] gmail [dot] com)。了解更多有關(guān)控制臺(tái)的資源: (Resources to learn more about the Console:)
Using the Console | Tools for Web Developers | Google DevelopersLearn how to navigate the Chrome DevTools JavaScript Console.developers.google.comBrowser ConsoleThe Browser Console is like the Web Console, but applied to the whole browser rather than a single content tab.developer.mozilla.orgBlobA Blob object represents a file-like object of immutable, raw data. Blobs represent data that isn't necessarily in a…developer.mozilla.org
使用控制臺(tái)| Web開(kāi)發(fā)人員工具| Google Developers 了解如何瀏覽Chrome DevTools JavaScript控制臺(tái)。 developers.google.com 瀏覽器控制臺(tái) 瀏覽器控制臺(tái)類(lèi)似于Web控制臺(tái),但應(yīng)用于整個(gè)瀏覽器,而不是單個(gè)內(nèi)容選項(xiàng)卡。 developer.mozilla.org Blob Blob對(duì)象表示不可變的原始數(shù)據(jù)的類(lèi)似文件的對(duì)象。 Blob代表不一定要包含在…中的數(shù)據(jù)... developer.mozilla.org
翻譯自: https://www.freecodecamp.org/news/how-to-use-the-browser-console-to-scrape-and-save-data-in-a-file-with-javascript-b40f4ded87ef/
總結(jié)
以上是生活随笔為你收集整理的如何使用浏览器控制台通过JavaScript抓取并将数据保存在文件中的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到死人坟墓是什么意思啊
- 下一篇: 如果您是JavaScript开发人员,为