當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

MapReduce英语面试

發布時間：2025/6/15 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 MapReduce英语面试小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1——What's Mapreduce.(How does Mapreduce works?)

Mapreduce is a progarmming model to process data process.Mapduce works by breaking the processing into two phases:the map phase and the reduce phase.

Each phase has key-value pairs as input and output,the types of which can be chosen by programmers.(InputFormat).To implenment the Mapredue,we need to specify two functions:map function and reduce funciton.

?2——......

Rather than use build-in java types,Hadoop provides its own sets of basis types that are optimized for network seralization,which we can find it in the package of? org.apache.haoop.io.

?3——Data Flow

A Mapreduce is a unit work that the clients want to be performed:it consits of the input data,the Mapreduce program and the configuration information.Hadoop run the job by dividing it into tasks,of which there are two types:map tasks and reduce tasks.

There are two types of nodes that control the job execution process：a jobtracker and a number of tasktrackers .The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.Tasktracters run tasks and send progress reports to the job tracker,which keeps a record of overall progress of each job.If a task fails,the jobtracker can reschedule it on a different tasktracker.

Hadoop divides the input to a Mapreduce job into fixed-size pieces called input splits.Hadoop creates one map task for each split,which run the-user difined map function for each record in the split.(For most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default, although this
can be changed for the cluster, or specified when each fileis created.)

? 4——HDFS

?When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary? to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems. Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.

?HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

? 5——Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

?? 6——NameNode and DataNode

?An HDFS cluster has two types of node: a namenode which is the master and a number of datanodes which act as the workers.The namenode manages the
filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not? store? block? locations? persistently,? since? this? information? is? reconstructed? from datanodes when the system starts.

Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

?For secondaryNameNode its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.

??? 7-- Serialization

Serializationis the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage.Deserializationis the reverse process of turning a byte stream back into a series of structured objects.

Serialization? appears? in? two? quite? distinct? areas? of? distributed? data? processing:? for interprocess communication and for persistent storage.

轉載于:https://www.cnblogs.com/conie/p/3632429.html

總結

以上是生活随笔為你收集整理的MapReduce英语面试的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Ruby on Rails入门(2.1)
下一篇： SkipList 跳跃表