當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

4 开发MapReduce应用程序

發(fā)布時間：2023/11/30 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 4 开发MapReduce应用程序小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

系統(tǒng)參數(shù)配置

Configuration類由源來設(shè)置，每個源包含以XML形式出現(xiàn)的一系列屬性/值對。如：

configuration-default.xml

configuration-site.xml

Configuration conf = new Configuration();

conf.addResource("configuraition-default.xml");

conf.addResource("|configuration-site.xml");

后添加進(jìn)來的屬性取值覆蓋前面所添加資源中的屬性取值，除非前面的屬性值被標(biāo)記為final。

Hadoop默認(rèn)使用兩個源進(jìn)行配置，順序加載core-default.xml和core-site.xml。

前者定義系統(tǒng)默認(rèn)屬性，后者定義在特定的地方重寫。

性能調(diào)優(yōu)

在正確完成功能的基礎(chǔ)上，使執(zhí)行的時間盡量短，占用的空間盡量小。

輸入采用大文件

1000個2.3M的文件運(yùn)行33分鐘；合并為1個2.2G的文件后運(yùn)行3分鐘。

也可借用Hadoop中的CombineFileInputFormat，它將多個文件打包到一個輸入單元中，從而每次執(zhí)行Map操作就會處理更多的數(shù)據(jù)。

壓縮文件

對Map的輸出進(jìn)行壓縮，好處：減少存儲文件的空間；加快在網(wǎng)絡(luò)上的傳輸速度；減少數(shù)據(jù)在內(nèi)存和磁盤間交換的時間。

mapred.compress.map.output設(shè)置為true來對Map的輸出數(shù)據(jù)進(jìn)行壓縮；

mapred.map.output.compression.codec設(shè)置壓縮格式

修改作業(yè)屬性

在conf目錄下修改屬性

mapred.tasktracker.map.tasks.maximum

mapred.tasktracker.reduce.tasks.maximum

設(shè)置Map/Reduce任務(wù)槽數(shù)，默認(rèn)均為2。

MapReduce工作流

如果處理過程變得復(fù)雜了，可以通過更加復(fù)雜、完善的Map和Reduce函數(shù)，甚至更多的MapReduce工作來體現(xiàn)。

復(fù)雜的Map和Reduce函數(shù)

基本的MapReduce作業(yè)僅僅集成并覆蓋了基類Mapper和Reducer中的核心函數(shù)Map或Reduce。

下面介紹基類中的其他函數(shù)，使大家能夠編寫功能更加復(fù)雜、控制更加完備的Map和Reduce函數(shù)。

setup函數(shù)

源碼如下

/** * Called once at the start of the task */ protected void setup( Context context) throws IOException, InterruptedException {//NOTHING }

此函數(shù)在task啟動開始時調(diào)用一次。

每個task以Map類或Reduce類為處理方法主體，輸入分片為處理方法的輸入，自己的分片處理完之后task也就銷毀了。

setup函數(shù)在task啟動之后數(shù)據(jù)處理之前只調(diào)用一次，而覆蓋的Map函數(shù)或Reduce函數(shù)會針對輸入分片中的每個key調(diào)用一次。

可以將Map或Reduce函數(shù)中的重復(fù)處理放置到setup函數(shù)中；

可以將Map或Reduce函數(shù)處理過程中可能使用到的全局變量進(jìn)行初始化；

可以從作業(yè)信息中獲取全局變量；

可以監(jiān)控task的啟動。

cleanup函數(shù)

/** * Called noce at the end of the task */ protected void cleanup(Context context) throws IOException, InterruptedException {//NOTHING }

和setup相似，不同之處在于cleanup函數(shù)是在task銷毀之前執(zhí)行的。

run函數(shù)

/** * Expert users can override this method for more complete control over the execution of the Mapper. *@param context *@throws IOException */ public void run(Context context) throws IOException, InterruptedException {setup (context);while (context.nextKeyValue()) {map(context.getCurrentKey(), context.getCurrentValue(), context);}cleanup(context); }

此函數(shù)是map函數(shù)或Reduce函數(shù)的啟動方法。

如果想更完備地控制Map或者Reduce，可以覆蓋此函數(shù)。

MapReduce中全局共享數(shù)據(jù)方法

1、讀寫HDFS文件

利用Hadoop的Java API來實現(xiàn)。

需要注意：多個Map或Reduce的寫操作會產(chǎn)生沖突，覆蓋原有數(shù)據(jù)。

優(yōu)點(diǎn)：能夠?qū)崿F(xiàn)讀寫，比較直觀；

缺點(diǎn)：要貢獻(xiàn)一些很小的全局?jǐn)?shù)據(jù)也需要使用IO，這將占用系統(tǒng)資源，增加作業(yè)完成的資源消耗。

2、配置Job屬性

在任務(wù)啟動之初利用Configuration類中的set(String name, String value)將一些簡單的全局?jǐn)?shù)據(jù)封裝到作業(yè)的配置屬性中；

然后在task中利用Configuration類中的get(String name)獲取配置到屬性中的全局?jǐn)?shù)據(jù)。

優(yōu)點(diǎn)：簡單，資源消耗小；

缺點(diǎn)：對量比較大的共享數(shù)據(jù)顯得比較無力。

3、使用DistributedCache

為應(yīng)用提供緩存文件的只讀工具，可以緩存文本文件、壓縮文件、jar文件。

在使用時，用戶可以在作業(yè)配置時使用本地或HDFS文件的UCRL來將其設(shè)置成共享緩存文件。

在作業(yè)啟動之后和task啟動之前，MapReduce框架會將可能需要的緩存文件復(fù)制到執(zhí)行任務(wù)結(jié)點(diǎn)的本地。

優(yōu)點(diǎn)：每個Job共享文件只會在啟動之后復(fù)制一次，適用于大量的共享數(shù)據(jù)；

缺點(diǎn)：只讀。

//配置 Configuration conf = new Configuration(); DistributedCache.addCacheFile(new URI("/myapp/lookup"), conf); //在Map函數(shù)中使用： public static class Map extends Mapper<...>{private Path[] localArchives;private Paht[] localFiles;public void setup (Context context) throws IOException, InterruptedException{Configuration conf = context.getConfiguration();localArchives = DistributedCache.getLocalCacheArchives(conf);localFiles = DistributedCache.getLocalCacheFiles(conf);}public void map(K key, V value, Context context) throws IOException {//使用從緩存文件中獲取的數(shù)據(jù)context.collect(k, v);} }

鏈接MapReduce Job

如果問題不是一個MapReduce作業(yè)就能解決，就需要在工作流中安排多個MapReduce作業(yè)，讓它們配合起來自動完成一些復(fù)雜的任務(wù)，而不需要用戶手動啟動每一個作業(yè)。

1、線性MapReduce Job流

最簡單的辦法是設(shè)置多個有一定順序的Job，每個Job以前一個Job的輸入作為輸入，經(jīng)過處理，將數(shù)據(jù)再輸入到下一個Job中。

這種辦法的實現(xiàn)非常簡單，將每個Job的啟動代碼設(shè)置成只有上一個Job結(jié)束之后才執(zhí)行，然后將Job的輸入設(shè)置成上一個Job的輸出路徑。

2、復(fù)雜MapReduce Job流

第一種方法在某些復(fù)雜任務(wù)下仍然不能滿足需求。

如Job3需要將Job1和Job2的輸出結(jié)果組合起來進(jìn)行處理。這種情況下Job3的啟動依賴于Job1和Job2的完成，但Job1和Job2之間沒有關(guān)系。

針對這種復(fù)雜情況，MapReduce框架提供了讓用戶將Job組織成復(fù)雜Job流的API--ControlledJob類和JobControl類。這兩個類屬于org.apache.hadoop.mapreduce.lib.jobcontrol包。

具體做法：

先按照正常情況配置各個Job；

配置完成后再將各個Job封裝到對應(yīng)的ControlledJob對象中；

然后使用ControlledJob的addDependingJob()設(shè)置依賴關(guān)系；

接著再實例化一個JobControl對象，并使用addJob()方法將所有的Job注入JobControl對象中；

最后使用JobControl對象的run方法啟動Job流。

3、Job設(shè)置預(yù)處理和后處理過程

org.apache.hadoop.mapred.lib包下的ChainMapper和ChainReducer兩個靜態(tài)類來實現(xiàn)。

The ChainMapper class allows to use multiple Mapper classes within a single Map task.

The Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.

The ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer task.

For each record output by the Reducer, the Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output. ?

The key functionality of this feature is that the Mappers in the chain do not need to be aware that they are executed in a chain. This enables having reusable specialized Mappers that can be combined to perform composite operations within a single task.

Special care has to be taken when creating chains that the key/values output by a Mapper are valid for the following Mapper in the chain. It is assumed all Mappers and the Reduce in the chain use maching output and input key and value classes as no conversion is done by the chaining code.

Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like[MAP+ / REDUCE MAP*]. And immediate benefit of this pattern is a dramatic reduction in disk IO.

IMPORTANT: There is no need to specify the output key/value classes for the ChainMapper, this is done by the addMapper for the last mapper in the chain.

ChainMapper usage pattern:

...conf.setJobName("chain");conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);

JobConf mapAConf = new JobConf(false);...ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class,Text.class, Text.class, true, mapAConf);

JobConf mapBConf = new JobConf(false);...ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class,LongWritable.class, Text.class, false, mapBConf);

JobConf reduceConf = new JobConf(false);...ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class,Text.class, Text.class, true, reduceConf);

ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class,LongWritable.class, Text.class, false, null);

ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class,LongWritable.class, LongWritable.class, true, null);

FileInputFormat.setInputPaths(conf, inDir);FileOutputFormat.setOutputPath(conf, outDir);...

JobClient jc = new JobClient(conf);RunningJob job = jc.submitJob(conf);...

總結(jié)

以上是生活随笔為你收集整理的4 开发MapReduce应用程序的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。