日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

The evolution of cluster scheduler architectures--转

發布時間:2025/4/5 编程问答 23 豆豆
生活随笔 收集整理的這篇文章主要介紹了 The evolution of cluster scheduler architectures--转 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

原文地址:http://www.firmament.io/blog/scheduler-architectures.html

cluster schedulers are an important component of modern infrastructure, and have evolved significantly in the last few years. Their architecture has moved from monolithic designs to much more flexible, disaggregated and distributed designs. However, many current open-source offerings are either still monolithic, or otherwise lack key features. These features matter to real-world users, as they are required to achieve good utilization.


This post is our first in a series of posts about?task scheduling on large clusters, such as those operated by internet companies like Amazon, Google, Facebook, Microsoft, or Yahoo!, but increasingly elsewhere too. Scheduling is an important topic because it directly affects the cost of operating a cluster: a poor scheduler results in low?utilization, which costs money as expensive machines are left idle. High utilization, however, is not sufficient on its own: antagonistic workloads interfere with other workloads unless the decisions are made carefully.

Architectural evolution

This post discusses how scheduler architectures have evolved over the last few years, and why this happened. Figure 1 visualises the different approaches: a gray square corresponds to a machine, a coloured circle to a task, and a rounded rectangle with an "S" inside corresponds to a scheduler.0?Arrows indicate placement decisions made by schedulers, and the three colours correspond to different workloads (e.g., web serving, batch analytics, and machine learning).

(a)?Monolithic scheduler.(b)?Two-level scheduling.(c)?Shared-state scheduling.(d)?Distributed scheduling.(e)?Hybrid scheduling.

Figure 1:?Different cluster scheduler architectures. Gray boxes represent cluster machines, circles correspond to tasks and?Si?denotes scheduler?i.

Many cluster schedulers – such as most high-performance computing (HPC) schedulers, the?Borg scheduler, various early Hadoop schedulers and the Kubernetes scheduler – are?monolithic. A single scheduler process runs on one machine (e.g., the?JobTracker?in Hadoop v1, and?kube-scheduler?in Kubernetes) and assigns tasks to machines. All workloads are handled by the same scheduler, and all tasks run through the same scheduling logic (Figure 1a). This is simple and uniform, and has led to increasingly sophisticated schedulers being developed. As an example, see the?Paragon?and?Quasar?schedulers, which use a machine learning approach to avoid negative interference between workloads competing for resources.

Most clusters run different types of applications today (as opposed to, say, just Hadoop MapReduce jobs in the early days). However, maintaining a single scheduler implementation that handles mixed (heterogeneous) workloads can be tricky, for several reasons:

  • It is quite reasonable to expect a scheduler to treat long-running service jobs and batch analytics jobs differently.
  • Since different applications have different needs, supporting them all keeps adding features to the scheduler, increasing the complexity of its logic and implementation.
  • The order in which the scheduler processes tasks becomes an issue: queueing effects (e.g., head-of-line blocking) and backlog can become an issue unless the scheduler is carefully designed.
  • Overall, this sounds like the makings of an engineering nightmare – and the never-ending lists of feature requests that scheduler maintainers receive attests to this.1

    Two-level scheduling architectures?address this problem by separating the concerns of?resource allocation?and?task placement. This allows the task placement logic to be tailored towards specific applications, but also maintains the ability to share the cluster between them. The?Mesos?cluster manager pioneered this approach, and?YARN?supports a limited version of it. In Mesos, resources are?offered?to application-level schedulers (which may pick and choose from them), while YARN allows the application-level schedulers to?requestresources (and receive allocations in? return).2?Figure 1b?shows the general idea: workload-specific schedulers (S0–S2) interact with a resource manager that carves out dynamic partitions of the cluster resources for each workload. This is a very flexible approach that allows for custom, workload-specific scheduling policies.

    Yet, the separation of concerns in two-level architectures comes with a drawback: the application-level schedulers lose?omniscience, i.e., they cannot see?all?the possible placement options any more.3?Instead, they merely see those options that correspond to resources offered (Mesos) or allocated (YARN) by the resource manager component. This has several disadvantages:

  • Priority preemption?(higher priority tasks kick out lower priority ones) becomes difficult to implement: in an offer-based model, the resources occupied by running tasks aren't visible to the upper-level schedulers; in a request-based model, the lower-level resource manager must understand the preemption policy (which may be application-dependent).
  • Schedulers are unable to consider?interference from running workloads?that may degrade resource quality (e.g., "noisy neighbours" that saturate I/O bandwidth), since they cannot see them.
  • Application-specific schedulers care about many different aspects of the underlying resources, but their only means of choosing resources is the offer/request interface with the resource manager. This interface can easily become quite complex.
  • Shared-state architectures?address this by moving to a semi-distributed model,4?in which multiple replicas of cluster state are independently updated by application-level schedulers, as shown in?Figure 1c. After the change is applied locally, the scheduler issues an optimistically concurrent transaction to update the shared cluster state. This transaction may fail, of course: another scheduler may have made a conflicting change in the meantime.

    The most prominent examples of shared-state designs are?Omega?at Google, and?Apollo?at Microsoft, as well as the?Nomad?container scheduler by Hashicorp. All of these materialise the?shared cluster state?in a single location: the "cell state" in Omega, the "resource monitor" in Apollo, and the "plan queue" in Nomad.5?Apollo differs from the other two as its shared-state is read-only, and the scheduling transactions are submitted directly to the cluster machines. The machines themselves check for conflicts and accept or reject the changes. This allows Apollo to make progress even if the shared-state is temporarily unavailable.6

    A "logical" shared-state design can also be achieved without materialising the full cluster state anywhere. In this approach (somewhat similar to what Apollo does), each machine maintains its own state and sends updates to different interested agents such as schedulers, machine health monitors, and resource monitoring systems. Each machine's local view of its state now forms a "shard" of the global shared-state.

    However, shared-state architectures have some drawbacks, too: they must work with stale information (unlike a centralized scheduler), and may experience degraded scheduler performance under high contention (although this can apply to other architectures as well).

    Fully-distributed architectures?take the disaggregation even further: they have no coordination between schedulers at all, and use many independent schedulers to service the incoming workload, as shown in?Figure 1d. Each of these schedulers works purely with its local, partial, and often out-of-date view of the cluster. Jobs can typically be submitted to any scheduler, and each scheduler may place tasks anywhere in the cluster. Unlike with two-level schedulers, there are no partitions that each scheduler is responsible for. Instead, the overall schedule and resource partitioning are emergent consequences of statistical multiplexing and randomness in workload and scheduler decisions – similar to shared-state schedulers, albeit without any central control at all.

    The recent distributed scheduler movement probably started with the?Sparrow?paper, although the underlying concept (power of multiple random choices)?first appeared in 1996. The key premise of Sparrow is a hypothesis that the tasks we run on clusters are becoming ever shorter in duration, supported by?an argument?that fine-grained tasks have many benefits. Consequently, the authors assume that tasks are becoming more numerous, meaning that a higher decision throughput must be supported by the scheduler. Since a single scheduler may not be able to keep up with this throughput (assumed to be a million tasks per second!), Sparrow spreads the load across many schedulers.

    This makes perfect sense: and the lack of central control can be conceptually appealing, and it suits some workloads very well – more on this in a future post. For the moment, it suffices to note that since the distributed schedulers are uncoordinated, they apply significantly simpler logic than advanced monolithic, two-level, or shared-state schedulers. For example:

  • Distributed schedulers are typically based on a simple "slot" concept that chops each machine into?n?uniform slots, and places up to?n?parallel tasks. This simplifies over the fact that tasks' resource requirements are not uniform.
  • They also use worker-side queues with simple service disciplines (e.g., FIFO in Sparrow), which restricts scheduling flexibility, as the scheduler can merely choose at which machine to enqueue a task.
  • Distributed schedulers have difficulty enforcing global invariants (e.g., fairness policies or strict priority precedence), since there is no central control.
  • Since they are designed for rapid decisions based on minimal knowledge, distributed schedulers cannot support or afford complex or application-specific scheduling policies. Avoiding interference between tasks, for example, becomes tricky.
  • Hybrid architectures?are a recent (mostly academic) invention that seeks to address these drawbacks of fully distributed architectures by combining them with monolithic or shared-state designs. The way this typically works – e.g., in?Tarcil,?Mercury, and?Hawk?– is that there really are two scheduling paths: a distributed one for part of the workload (e.g., very short tasks, or low-priority batch workloads), and a centralized one for the rest.?Figure 1eillustrates this design. The behaviour of each constituent part of a hybrid scheduler? is identical to the part's architecture described above. In practice, no hybrid schedulers have been deployed in production settings yet, however, as far as I know.

    What does this mean in practice?

    Discussion about the relative merits of different scheduler architectures is not merely an academic topic, although it naturally revolves around research papers. For an extensive discussion of the Borg, Mesos and Omega papers from an industry perspective, for example, see?Andrew Wang's excellent blog post. Moreover, many of the systems discussed are deployed in production settings at large enterprises (e.g., Apollo at Microsoft, Borg at Google, and Mesos at Apple), and they have in turn inspired other systems that are available as open source projects.

    These days, many clusters run containerised workloads, and consequently a variety of contained-focused "orchestration frameworks" have appeared. These are similar to what Google and others call "cluster managers". However, there are few detailed discussions of the schedulers within these frameworks and their design principles, and they typically focus more on the user-facing scheduler APIs (e.g.,?this report by Armand Grillet, which compares Docker Swarm, Mesos/Marathon, and the Kubernetes default scheduler). Moreover, many users neither know what difference the scheduler architecture makes, nor which one is most suitable for their applications.

    Figure 2?shows an overview of a selection of open-source orchestration frameworks, their architecture and the features supported by their schedulers. At the bottom of the table, We also include closed-source systems at Google and Microsoft for reference. The resource granularity column indicates whether the scheduler assigns tasks to fixed-size slots, or whether it allocates resources in multiple dimensions (e.g., CPU, memory, disk I/O bandwidth, network bandwidth, etc.).

    ?FrameworkArchitectureResource granularityMulti-schedulerPluggable logicPriority preemptionRe-schedulingOversubscriptionResource estimationAvoid interference
    O
    P
    E
    N
    Kubernetesmonolithicmulti-dimensionalN[v1.2,?DD,Issue]Y[DD]N[Issue]N[Issue]Y[DD]NN
    Swarmmonolithicmulti-dimensionalNNN[Issue]NNNN
    YARNmonolithic/
    two-level
    RAM/CPU slotsYN[app-lvl. only]N[JIRA]NN[JIRA]NN
    Mesostwo-levelmulti-dimensionalYY[framework-lvl.]N[JIRA]NY[v0.23,?Doc]NN
    Nomadshared-statemulti-dimensionalYYN[Issue]N[Issue]N[Issue]NN
    Sparrowfully-distributedfixed slotsYNNNNNN
    C
    L
    O
    S
    E
    D
    Borgmonolithic[7]multi-dimensionalN[7]N[7]YYYYN
    Omegashared-statemulti-dimensionalYYYYYYN
    Apolloshared-statemulti-dimensionalYYYYNNN

    Figure 2:?Architectural classifaction and feature matrix of widely-used orchestration frameworks, compared to closed-source systems.

    One key aspect that helps determine an appropriate scheduler architecture is whether your cluster runs a?heterogeneous?(i.e., mixed) workload. This is the case, for example, when combining production front-end services (e.g., load-balanced web servers and memcached) with batch data analytics (e.g., MapReduce or Spark). Such combinations make sense in order to improve utilization, but the different applications have different scheduling needs. In a mixed setting, a monolithic scheduler likely results in sub-optimal assignments, since the logic cannot be diversified on a per-application basis. A two-level or shared-state scheduler will likely offer benefits here.?8

    Most user-facing service workloads run with resource allocations sized to serve peak demand expected of each container, but in practice they typically under-utilize their allocations substantially. In this situation, being able to opportunistically over-subscribe the resources with lower-priority workloads (while maintaining QoS guarantees) is the key to an efficient cluster. Mesos is currently the only open-source system that ships support for such over-subscription, although Kubernetes has?a fairly mature proposal?for adding it. We should expect more activity in this space in the future, since the utilization of most clusters is still substantially lower than the 60-70%?reported?for Google's Borg clusters. We will focus on resource estimation, over-subscription and efficient machine utilization in a future post in this series.

    Finally, specific analytics and OLAP-style applications (for example, Dremel or SparkSQL queries) can benefit from fully-distributed schedulers. However, fully-distributed schedulers (like e.g., Sparrow) come with fairly restricted feature sets, and thus work best when the workload is homogeneous (i.e., all tasks run for roughly the same time), set-up times are low (i.e., tasks are scheduled to long-running workers, as e.g., with MapReduce application-level tasks in YARN), and task churn is very high (i.e., many scheduling decisions must be made in a short time). We will talk more about these conditions and why fully-distributed schedulers – and the distributed components of hybrid schedulers – only make sense for these applications in the next blog post in this series. For now, it sufficies to observe that distributed schedulers are substantially simpler than others, and do not support multiple resource dimensions, over-subscription, or re-scheduling.

    Overall, the table in Figure 2 is evidence that the open-source frameworks still have some way to go until they match the feature sets of advanced, but closed-source systems. This should serve as a call to action: as a result of missing features, utilization suffers, task performance is unpredictable, noisy neighbours cause pagers to go off, and elaborate hacks are required to coerce schedulers into supporting some user needs.

    However, there are some good news: while many frameworks have monolithic schedulers today, many are also moving towards more flexible designs. Kubernetes already supports pluggable schedulers (the?kube-scheduler?pod can be replaced by another API-compatible scheduler pod),?multiple schedulers from v1.2, and has ongoing work on?"extenders" to supply custom policies. Docker Swarm may – to my understanding – also gain pluggable scheduler support in the future.

    What's next?

    The next blog post in this series will look at the question of whether fully distributed architectures are the key innovation required to?scale?cluster schedulers further (spoiler: not necessarily). After that, we will also look at resource-fitting strategies (essential for good utilisation), and finally discuss how our Firmament scheduling platform combines many of the benefits of a shared state architecture with the scheduling quality of monolithic schedulers and the speed of fully-distributed schedulers.


    ?

    0?– This figure simplifies things a bit: of course, in practice each machine runs more than one task, and many schedulers fit tasks in multiple resource dimensions, rather than into simple slots.

    1?– As an illustrative example,?kube-scheduler?in Kubernetes currently has outstanding feature requests forre-scheduling (pod migration),?priority preemption, and?resource oversubscription?in its monolithic scheduler.

    2?– YARN's approach is restricted compared to Mesos because the application-level logic cannot choose resources (unless it requests much more than it needs from the resource manager), but it can only place application-level "tasks" to pre-existing containers that represent cluster-level tasks.
    This is a good fit for a system like Hadoop MapReduce, in which application-level tasks (maps and reduces) must be assigned to a dynamic collection of workers in an application-specific way (e.g., optimised for data locality and per-job). It is less suited to building a more general, multi-application scheduler on top – for example, a service scheduler like the "Marathon" framework for Mesos.
    Monolithic schedulers like the Kubernetes one do not have an "upper half" and rely on the application to do its own work scheduling (e.g.,?running a Spark "worker controller" as a long-running service). Consequently, there are efforts to put?Kubernetes on top of YARN?via a special?YARNScheduler?extension – requiring two complex systems to be administered. However, there are also long-term efforts to?improve native "big data" batch processing support in Kubernetes.

    3?– In the Omega paper, this problem was referred to as "information hiding".

    4?– Curiously, the literature does not appear to be quite sure in agreement about whether to consider shared-state schedulers centralized or distributed: the?Hawk paper?treats them as examples of distributed schedulers, while the?Mercury paper?refers to them as examples of a centralized architecture!

    5?– Nomad actually uses a slightly different approach to Omega and Apollo: while multiple independent schedulers exist, jobs are not submitted directly to them, but instead?arrive via a centralised "evaluation broker" queue.

    6?– It's worth noting that the same optimisation – taking the shared state off the critical path to enacting scheduling decisions – can be applied to Omega, but?not?to Nomad (in its current design): Omega can ship deltas directly to machines and update the cell state out-of-band, while Nomad's design is premised on the leader reconciling changes in the plan queue.

    7?– The table entry reflects the original Borg, but the?Borg paper?and the recent?ACM Queue paper?note that multi-scheduler support and other features have been back-ported into from Omega into Borg.

    8?– That said, having multiple schedulers is not a?necessary?precondition for serving mixed workloads: theBorg scheduler?is a case in point that a sophisticated single scheduler can serve both long-running service and batch workloads. However, this comes at the expense of higher scheduler implementation complexity – a key motivation for?Omega's multi-scheduler design.

    轉載于:https://www.cnblogs.com/davidwang456/p/6829370.html

    總結

    以上是生活随笔為你收集整理的The evolution of cluster scheduler architectures--转的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

    主站蜘蛛池模板: 日本免费网址 | 久久99精品久久只有精品 | 欧美色图片区 | 久久成人一区 | 性感美女一区 | 97精品熟女少妇一区二区三区 | 黄av网 | 欧洲亚洲天堂 | 销魂奶水汁系列小说 | 成年人免费看的视频 | 中文字幕在线看人 | 精品国产九九 | 欧美在线免费观看视频 | 欧美日韩国产精品成人 | 精品动漫av| 久久99精品久久久久久噜噜 | 天天摸天天舔 | 亚洲最大成人网色 | 精品久久久久久久久久久 | 亚洲一级无毛 | 黄色a站| 激情小说亚洲色图 | 成人网址在线观看 | 在线观看午夜 | 欧美精品一区二区三区蜜臀 | 丝袜脚交国产在线观看 | 新超碰97| 国产操比视频 | 天天躁日日躁aaaaxxxx | 伊人天堂在线 | 国产午夜精品一区二区理论影院 | 青娱乐超碰在线 | 91国产丝袜播放在线 | 日本高清xxx | 秘密基地免费观看完整版中文 | 国产综合在线视频 | 免费观看日批视频 | 国产又粗又大又爽 | 污污网站在线看 | 色偷偷888欧美精品久久久 | 葵司av电影 | 老司机午夜视频 | 黄色网址大全免费 | 久久久久久久久久久丰满 | 国产suv精品一区二区四 | 国产综合久久久久久鬼色 | 涩涩资源站 | 久久理论电影 | 91插插插永久免费 | 午夜激情小视频 | 99精品视频免费在线观看 | 国产二区av| 精品国产视频一区二区 | 日韩欧美国产三级 | 日本爽爽| 欧美特级黄色 | 亚洲天堂国产精品 | 一本到视频 | av中字在线 | 免费欧美大片 | www黄在线观看 | www.一起操| 中文一区二区在线观看 | 女人下面无遮挡 | 精品国产aⅴ一区二区三区四川人 | 国产日韩欧美精品一区二区 | 午夜爱爱免费视频 | 欧美黑人性生活 | 亚洲午夜精品一区 | 日韩精品免费一区二区 | 任我爽在线| 黄色网在线免费观看 | 欧美7777| 熊猫成人网| 亚洲精品一区二区三区不卡 | 美女扒开尿口来摸 | 国产在成人精品线拍偷自揄拍 | 免费黄色一级视频 | 国产亚洲精品久久 | 善良的女邻居在线观看 | 亚洲人妖在线 | 日本久操 | 91色在线观看 | 亚洲综合色一区二区 | 操批网站| 极品人妻videosss人妻 | 久久99久久99精品蜜柚传媒 | 一级片免费网站 | 国产有码在线观看 | √天堂| 成人性生交大片免费看vrv66 | 日韩日b | 99久久婷婷国产一区二区三区 | 欧美日韩一级二级三级 | 国产一区二区三区四区三区四 | 久久久免费观看视频 | 人人澡超碰碰 | 97在线观看免费视频 | 国产裸体永久免费无遮挡 |