當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用Apache Hadoop计算PageRanks

發(fā)布時間：2023/12/3 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了使用Apache Hadoop计算PageRanks 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

目前，我正在接受Coursera的培訓“ 挖掘海量數(shù)據(jù)集 ”。我對MapReduce和Apache Hadoop感興趣已有一段時間了，通過本課程，我希望對何時以及如何MapReduce可以幫助解決一些現(xiàn)實世界中的業(yè)務問題有更多的了解（我在這里介紹了另一種解決方法）。該Coursera課程主要側重于使用算法的理論，而較少涉及編碼本身。第一周是關于PageRanking以及Google如何使用它來對頁面進行排名。幸運的是，與Hadoop結合可以找到很多關于該主題的信息。我到這里結束并決定仔細看一下這段代碼。

我所做的就是獲取這段代碼（將其分叉）并重新編寫了一下。我創(chuàng)建的映射器單元測試和減速器跟我描述這里。作為測試用例，我使用了課程中的示例。我們有三個相互鏈接和/或彼此鏈接的網(wǎng)頁：

此鏈接方案應解析為以下頁面排名：

Y 7/33
5/33
M 21/33

由于MapReduce示例代碼期望輸入“ Wiki頁面” XML ，因此我創(chuàng)建了以下測試集：

原始頁面本身已經(jīng)很好地解釋了它的全局工作方式。我將僅描述我創(chuàng)建的單元測試。有了原始的解釋和我的單元測試，您應該能夠解決問題并了解發(fā)生了什么。

如上所述，整個工作分為三個部分：

解析

計算

訂購

在解析部分中，將原始XML提取，拆分成多個頁面并進行映射，以便我們將頁面作為鍵和它具有傳出鏈接的頁面的值作為輸出獲得。因此，單元測試的輸入將是三個“ Wiki”頁面XML，如上所示。預期帶有鏈接頁面的頁面的“標題”。單元測試如下所示：

package net.pascalalma.hadoop.job1;...public class WikiPageLinksMapperTest {MapDriver<LongWritable, Text, Text, Text> mapDriver;String testPageA = " <page>\n" +" <title>A</title>\n" +" ..." +" <text xml:space=\"preserve\" bytes=\"6523\">[[Y]] [[M]]</text>\n" +" </revision>";String testPageY = " <page>\n" +" <title>Y</title>\n" +" ..." +" <text xml:space=\"preserve\" bytes=\"6523\">[[A]] [[Y]]</text>\n" +" </revision>\n" +" </page>";String testPageM = " <page>\n" +" <title>M</title>\n" +" ..." +" <text xml:space=\"preserve\" bytes=\"6523\">[[M]]</text>\n" +" </revision>\n" +" </page>";@Beforepublic void setUp() {WikiPageLinksMapper mapper = new WikiPageLinksMapper();mapDriver = MapDriver.newMapDriver(mapper);}@Testpublic void testMapper() throws IOException {mapDriver.withInput(new LongWritable(1), new Text(testPageA));mapDriver.withInput(new LongWritable(2), new Text(testPageM));mapDriver.withInput(new LongWritable(3), new Text(testPageY));mapDriver.withOutput(new Text("A"), new Text("Y"));mapDriver.withOutput(new Text("A"), new Text("M"));mapDriver.withOutput(new Text("Y"), new Text("A"));mapDriver.withOutput(new Text("Y"), new Text("Y"));mapDriver.withOutput(new Text("M"), new Text("M"));mapDriver.runTest(false);} }

映射器的輸出將成為我們的reducer的輸入。那個的單元測試如下：

package net.pascalalma.hadoop.job1; ... public class WikiLinksReducerTest {ReduceDriver<Text, Text, Text, Text> reduceDriver;@Beforepublic void setUp() {WikiLinksReducer reducer = new WikiLinksReducer();reduceDriver = ReduceDriver.newReduceDriver(reducer);}@Testpublic void testReducer() throws IOException {List<Text> valuesA = new ArrayList<Text>();valuesA.add(new Text("M"));valuesA.add(new Text("Y"));reduceDriver.withInput(new Text("A"), valuesA);reduceDriver.withOutput(new Text("A"), new Text("1.0\tM,Y"));reduceDriver.runTest();} }

如單元測試所示，我們期望reducer將輸入減少到“初始”頁面等級1.0的值，該等級與（關鍵）頁面具有傳出鏈接的所有頁面連接。這是該階段的輸出，將用作“計算”階段的輸入。
在計算部分中，將對傳入的頁面排名進行重新計算，以實現(xiàn)“ 冪迭代 ”方法。將多次執(zhí)行此步驟，以獲得給定頁面集的可接受頁面排名。如前所述，前一步的輸出是該步驟的輸入，正如我們在此映射器的單元測試中所看到的：

package net.pascalalma.hadoop.job2; ... public class RankCalculateMapperTest {MapDriver<LongWritable, Text, Text, Text> mapDriver;@Beforepublic void setUp() {RankCalculateMapper mapper = new RankCalculateMapper();mapDriver = MapDriver.newMapDriver(mapper);}@Testpublic void testMapper() throws IOException {mapDriver.withInput(new LongWritable(1), new Text("A\t1.0\tM,Y"));mapDriver.withInput(new LongWritable(2), new Text("M\t1.0\tM"));mapDriver.withInput(new LongWritable(3), new Text("Y\t1.0\tY,A"));mapDriver.withOutput(new Text("M"), new Text("A\t1.0\t2"));mapDriver.withOutput(new Text("A"), new Text("Y\t1.0\t2"));mapDriver.withOutput(new Text("Y"), new Text("A\t1.0\t2"));mapDriver.withOutput(new Text("A"), new Text("|M,Y"));mapDriver.withOutput(new Text("M"), new Text("M\t1.0\t1"));mapDriver.withOutput(new Text("Y"), new Text("Y\t1.0\t2"));mapDriver.withOutput(new Text("A"), new Text("!"));mapDriver.withOutput(new Text("M"), new Text("|M"));mapDriver.withOutput(new Text("M"), new Text("!"));mapDriver.withOutput(new Text("Y"), new Text("|Y,A"));mapDriver.withOutput(new Text("Y"), new Text("!"));mapDriver.runTest(false);} }

源頁面中說明了此處的輸出。 “額外”項目帶有“！” 和'|' 在減少步驟中對于計算是必需的。減速器的單元測試如下：

package net.pascalalma.hadoop.job2; ... public class RankCalculateReduceTest {ReduceDriver<Text, Text, Text, Text> reduceDriver;@Beforepublic void setUp() {RankCalculateReduce reducer = new RankCalculateReduce();reduceDriver = ReduceDriver.newReduceDriver(reducer);}@Testpublic void testReducer() throws IOException {List<Text> valuesM = new ArrayList<Text>();valuesM.add(new Text("A\t1.0\t2"));valuesM.add(new Text("M\t1.0\t1"));valuesM.add(new Text("|M"));valuesM.add(new Text("!"));reduceDriver.withInput(new Text("M"), valuesM);List<Text> valuesA = new ArrayList<Text>();valuesA.add(new Text("Y\t1.0\t2"));valuesA.add(new Text("|M,Y"));valuesA.add(new Text("!"));reduceDriver.withInput(new Text("A"), valuesA);List<Text> valuesY = new ArrayList<Text>();valuesY.add(new Text("Y\t1.0\t2"));valuesY.add(new Text("|Y,A"));valuesY.add(new Text("!"));valuesY.add(new Text("A\t1.0\t2"));reduceDriver.withInput(new Text("Y"), valuesY);reduceDriver.withOutput(new Text("A"), new Text("0.6\tM,Y"));reduceDriver.withOutput(new Text("M"), new Text("1.4000001\tM"));reduceDriver.withOutput(new Text("Y"), new Text("1.0\tY,A"));reduceDriver.runTest(false);} }

如圖所示，映射器的輸出被重新創(chuàng)建為輸入，我們檢查reducer的輸出是否與頁面等級計算的第一次迭代相匹配。每次迭代將導致相同的輸出格式，但可能具有不同的頁面等級值。
最后一步是“訂購”部分。這非常簡單，單元測試也是如此。這部分僅包含一個映射器，該映射器獲取上一步的輸出并將其“重新格式化”為所需的格式：pagerank +按pagerank的頁面順序。當將映射器結果提供給化簡器步驟時，按鍵進行排序是由Hadoop框架完成的，因此該排序不會反映在Mapper單元測試中。此單元測試的代碼是：

package net.pascalalma.hadoop.job3; ... public class RankingMapperTest {MapDriver<LongWritable, Text, FloatWritable, Text> mapDriver;@Beforepublic void setUp() {RankingMapper mapper = new RankingMapper();mapDriver = MapDriver.newMapDriver(mapper);}@Testpublic void testMapper() throws IOException {mapDriver.withInput(new LongWritable(1), new Text("A\t0.454545\tM,Y"));mapDriver.withInput(new LongWritable(2), new Text("M\t1.90\tM"));mapDriver.withInput(new LongWritable(3), new Text("Y\t0.68898\tY,A"));//Please note that we cannot check for ordering here because that is done by Hadoop after the Map phasemapDriver.withOutput(new FloatWritable(0.454545f), new Text("A"));mapDriver.withOutput(new FloatWritable(1.9f), new Text("M"));mapDriver.withOutput(new FloatWritable(0.68898f), new Text("Y"));mapDriver.runTest(false);} }

因此，在這里，我們只檢查映射器是否接受輸入并正確格式化輸出。

總結了單元測試的所有示例。通過這個項目，您應該能夠自己對其進行測試，并且對原始代碼的工作方式有更深入的了解。它肯定有助于我理解它！

包括單元測試在內的完整代碼版本可以在這里找到。

翻譯自: https://www.javacodegeeks.com/2015/02/calculate-pageranks-apache-hadoop.html

總結

以上是生活随笔為你收集整理的使用Apache Hadoop计算PageRanks的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：雪佛兰汽车官宣限时降价，迈锐宝 XL、探
下一篇：在运行时打开GC日志记录