日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

高性能、高并发、高扩展性和可读性的网络服务器架构:StateThreads

發(fā)布時(shí)間:2024/2/28 编程问答 35 豆豆
生活随笔 收集整理的這篇文章主要介紹了 高性能、高并发、高扩展性和可读性的网络服务器架构:StateThreads 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

譯文在后面;代碼我放在GITHUB了:http://github.com/ossrs/state-threads。

State Threads for Internet Applications

Introduction

State Threads is an application library which provides a foundation for writing fast and highly scalable Internet Applications on UNIX-like platforms. It combines the simplicity of the multithreaded programming paradigm, in which one thread supports each simultaneous connection, with the performance and scalability of an event-driven state machine architecture.

1. Definitions

1.1 Internet Applications

An Internet Application (IA) is either a server or client network application that accepts connections from clients and may or may not connect to servers. In an IA the arrival or departure of network data often controls processing (that is, IA is a data-driven application). For each connection, an IA does some finite amount of work involving data exchange with its peer, where its peer may be either a client or a server. The typical transaction steps of an IA are to accept a connection, read a request, do some finite and predictable amount of work to process the request, then write a response to the peer that sent the request. One example of an IA is a Web server; the most general example of an IA is a proxy server, because it both accepts connections from clients and connects to other servers.

We assume that the performance of an IA is constrained by available CPU cycles rather than network bandwidth or disk I/O (that is, CPU is a bottleneck resource).

1.2 Performance and Scalability

The performance of an IA is usually evaluated as its throughput measured in transactions per second or bytes per second (one can be converted to the other, given the average transaction size). There are several benchmarks that can be used to measure throughput of Web serving applications for specific workloads (such as SPECweb96, WebStone, WebBench). Although there is no common definition for scalability, in general it expresses the ability of an application to sustain its performance when some external condition changes. For IAs this external condition is either the number of clients (also known as "users," "simultaneous connections," or "load generators") or the underlying hardware system size (number of CPUs, memory size, and so on). Thus there are two types of scalability: load scalability and system scalability, respectively.

The figure below shows how the throughput of an idealized IA changes with the increasing number of clients (solid blue line). Initially the throughput grows linearly (the slope represents the maximal throughput that one client can provide). Within this initial range, the IA is underutilized and CPUs are partially idle. Further increase in the number of clients leads to a system saturation, and the throughput gradually stops growing as all CPUs become fully utilized. After that point, the throughput stays flat because there are no more CPU cycles available. In the real world, however, each simultaneous connection consumes some computational and memory resources, even when idle, and this overhead grows with the number of clients. Therefore, the throughput of the real world IA starts dropping after some point (dashed blue line in the figure below). The rate at which the throughput drops depends, among other things, on application design.

We say that an application has a good load scalability if it can sustain its throughput over a wide range of loads. Interestingly, the SPECweb99 benchmark somewhat reflects the Web server's load scalability because it measures the number of clients (load generators) given a mandatory minimal throughput per client (that is, it measures the server's capacity). This is unlike SPECweb96 and other benchmarks that use the throughput as their main metric (see the figure below).


System scalability is the ability of an application to sustain its performance per hardware unit (such as a CPU) with the increasing number of these units. In other words, good system scalability means that doubling the number of processors will roughly double the application's throughput (dashed green line). We assume here that the underlying operating system also scales well. Good system scalability allows you to initially run an application on the smallest system possible, while retaining the ability to move that application to a larger system if necessary, without excessive effort or expense. That is, an application need not be rewritten or even undergo a major porting effort when changing system size.

Although scalability and performance are more important in the case of server IAs, they should also be considered for some client applications (such as benchmark load generators).

1.3 Concurrency

Concurrency reflects the parallelism in a system. The two unrelated types are virtual concurrency and real concurrency.

Virtual (or apparent) concurrency is the number of simultaneous connections that a system supports.

Real concurrency is the number of hardware devices, including CPUs, network cards, and disks, that actually allow a system to perform tasks in parallel.

An IA must provide virtual concurrency in order to serve many users simultaneously. To achieve maximum performance and scalability in doing so, the number of programming entities than an IA creates to be scheduled by the OS kernel should be kept close to (within an order of magnitude of) the real concurrency found on the system. These programming entities scheduled by the kernel are known as kernel execution vehicles. Examples of kernel execution vehicles include Solaris lightweight processes and IRIX kernel threads. In other words, the number of kernel execution vehicles should be dictated by the system size and not by the number of simultaneous connections.

2. Existing Architectures

There are a few different architectures that are commonly used by IAs. These include the Multi-Process, Multi-Threaded, and Event-Driven State Machine architectures.

2.1 Multi-Process Architecture

In the Multi-Process (MP) architecture, an individual process is dedicated to each simultaneous connection. A process performs all of a transaction's initialization steps and services a connection completely before moving on to service a new connection.

User sessions in IAs are relatively independent; therefore, no synchronization between processes handling different connections is necessary. Because each process has its own private address space, this architecture is very robust. If a process serving one of the connections crashes, the other sessions will not be affected. However, to serve many concurrent connections, an equal number of processes must be employed. Because processes are kernel entities (and are in fact the heaviest ones), the number of kernel entities will be at least as large as the number of concurrent sessions. On most systems, good performance will not be achieved when more than a few hundred processes are created because of the high context-switching overhead. In other words, MP applications have poor load scalability.

On the other hand, MP applications have very good system scalability, because no resources are shared among different processes and there is no synchronization overhead.

The Apache Web Server 1.x ([Reference 1]) uses the MP architecture on UNIX systems.

2.2 Multi-Threaded Architecture

In the Multi-Threaded (MT) architecture, multiple independent threads of control are employed within a single shared address space. Like a process in the MP architecture, each thread performs all of a transaction's initialization steps and services a connection completely before moving on to service a new connection.

Many modern UNIX operating systems implement a many-to-few model when mapping user-level threads to kernel entities. In this model, an arbitrarily large number of user-level threads is multiplexed onto a lesser number of kernel execution vehicles. Kernel execution vehicles are also known as virtual processors. Whenever a user-level thread makes a blocking system call, the kernel execution vehicle it is using will become blocked in the kernel. If there are no other non-blocked kernel execution vehicles and there are other runnable user-level threads, a new kernel execution vehicle will be created automatically. This prevents the application from blocking when it can continue to make useful forward progress.

Because IAs are by nature network I/O driven, all concurrent sessions block on network I/O at various points. As a result, the number of virtual processors created in the kernel grows close to the number of user-level threads (or simultaneous connections). When this occurs, the many-to-few model effectively degenerates to a one-to-one model. Again, like in the MP architecture, the number of kernel execution vehicles is dictated by the number of simultaneous connections rather than by number of CPUs. This reduces an application's load scalability. However, because kernel threads (lightweight processes) use fewer resources and are more light-weight than traditional UNIX processes, an MT application should scale better with load than an MP application.

Unexpectedly, the small number of virtual processors sharing the same address space in the MT architecture destroys an application's system scalability because of contention among the threads on various locks. Even if an application itself is carefully optimized to avoid lock contention around its own global data (a non-trivial task), there are still standard library functions and system calls that use common resources hidden from the application. For example, on many platforms thread safety of memory allocation routines (malloc(3), free(3), and so on) is achieved by using a single global lock. Another example is a per-process file descriptor table. This common resource table is shared by all kernel execution vehicles within the same process and must be protected when one modifies it via certain system calls (such as open(2), close(2), and so on). In addition to that, maintaining the caches coherent among CPUs on multiprocessor systems hurts performance when different threads running on different CPUs modify data items on the same cache line.

In order to improve load scalability, some applications employ a different type of MT architecture: they create one or more thread(s) per task rather than one thread per connection. For example, one small group of threads may be responsible for accepting client connections, another for request processing, and yet another for serving responses. The main advantage of this architecture is that it eliminates the tight coupling between the number of threads and number of simultaneous connections. However, in this architecture, different task-specific thread groups must share common work queues that must be protected by mutual exclusion locks (a typical producer-consumer problem). This adds synchronization overhead that causes an application to perform badly on multiprocessor systems. In other words, in this architecture, the application's system scalability is sacrificed for the sake of load scalability.

Of course, the usual nightmares of threaded programming, including data corruption, deadlocks, and race conditions, also make MT architecture (in any form) non-simplistic to use.

2.3 Event-Driven State Machine Architecture

In the Event-Driven State Machine (EDSM) architecture, a single process is employed to concurrently process multiple connections. The basics of this architecture are described in Comer and Stevens [Reference 2]. The EDSM architecture performs one basic data-driven step associated with a particular connection at a time, thus multiplexing many concurrent connections. The process operates as a state machine that receives an event and then reacts to it.

In the idle state the EDSM calls select(2) or poll(2) to wait for network I/O events. When a particular file descriptor is ready for I/O, the EDSM completes the corresponding basic step (usually by invoking a handler function) and starts the next one. This architecture uses non-blocking system calls to perform asynchronous network I/O operations. For more details on non-blocking I/O see Stevens [Reference 3].

To take advantage of hardware parallelism (real concurrency), multiple identical processes may be created. This is called Symmetric Multi-Process EDSM and is used, for example, in the Zeus Web Server ([Reference 4]). To more efficiently multiplex disk I/O, special "helper" processes may be created. This is called Asymmetric Multi-Process EDSM and was proposed for Web servers by Druschel and others [Reference 5].

EDSM is probably the most scalable architecture for IAs. Because the number of simultaneous connections (virtual concurrency) is completely decoupled from the number of kernel execution vehicles (processes), this architecture has very good load scalability. It requires only minimal user-level resources to create and maintain additional connection.

Like MP applications, Multi-Process EDSM has very good system scalability because no resources are shared among different processes and there is no synchronization overhead.

Unfortunately, the EDSM architecture is monolithic rather than based on the concept of threads, so new applications generally need to be implemented from the ground up. In effect, the EDSM architecture simulates threads and their stacks the hard way.

3. State Threads Library

The State Threads library combines the advantages of all of the above architectures. The interface preserves the programming simplicity of thread abstraction, allowing each simultaneous connection to be treated as a separate thread of execution within a single process. The underlying implementation is close to the EDSM architecture as the state of each particular concurrent session is saved in a separate memory segment.

3.1 State Changes and Scheduling

The state of each concurrent session includes its stack environment (stack pointer, program counter, CPU registers) and its stack. Conceptually, a thread context switch can be viewed as a process changing its state. There are no kernel entities involved other than processes. Unlike other general-purpose threading libraries, the State Threads library is fully deterministic. The thread context switch (process state change) can only happen in a well-known set of functions (at I/O points or at explicit synchronization points). As a result, process-specific global data does not have to be protected by mutual exclusion locks in most cases. The entire application is free to use all the static variables and non-reentrant library functions it wants, greatly simplifying programming and debugging while increasing performance. This is somewhat similar to a co-routine model (co-operatively multitasked threads), except that no explicit yield is needed -- sooner or later, a thread performs a blocking I/O operation and thus surrenders control. All threads of execution (simultaneous connections) have the same priority, so scheduling is non-preemptive, like in the EDSM architecture. Because IAs are data-driven (processing is limited by the size of network buffers and data arrival rates), scheduling is non-time-slicing.

Only two types of external events are handled by the library's scheduler, because only these events can be detected by select(2) or poll(2): I/O events (a file descriptor is ready for I/O) and time events (some timeout has expired). However, other types of events (such as a signal sent to a process) can also be handled by converting them to I/O events. For example, a signal handling function can perform a write to a pipe (write(2) is reentrant/asynchronous-safe), thus converting a signal event to an I/O event.

To take advantage of hardware parallelism, as in the EDSM architecture, multiple processes can be created in either a symmetric or asymmetric manner. Process management is not in the library's scope but instead is left up to the application.

There are several general-purpose threading libraries that implement a many-to-one model (many user-level threads to one kernel execution vehicle), using the same basic techniques as the State Threads library (non-blocking I/O, event-driven scheduler, and so on). For an example, see GNU Portable Threads ([Reference 6]). Because they are general-purpose, these libraries have different objectives than the State Threads library. The State Threads library is not a general-purpose threading library, but rather an application library that targets only certain types of applications (IAs) in order to achieve the highest possible performance and scalability for those applications.

3.2 Scalability

State threads are very lightweight user-level entities, and therefore creating and maintaining user connections requires minimal resources. An application using the State Threads library scales very well with the increasing number of connections.

On multiprocessor systems an application should create multiple processes to take advantage of hardware parallelism. Using multiple separate processes is the only way to achieve the highest possible system scalability. This is because duplicating per-process resources is the only way to avoid significant synchronization overhead on multiprocessor systems. Creating separate UNIX processes naturally offers resource duplication. Again, as in the EDSM architecture, there is no connection between the number of simultaneous connections (which may be very large and changes within a wide range) and the number of kernel entities (which is usually small and constant). In other words, the State Threads library makes it possible to multiplex a large number of simultaneous connections onto a much smaller number of separate processes, thus allowing an application to scale well with both the load and system size.

3.3 Performance

Performance is one of the library's main objectives. The State Threads library is implemented to minimize the number of system calls and to make thread creation and context switching as fast as possible. For example, per-thread signal mask does not exist (unlike POSIX threads), so there is no need to save and restore a process's signal mask on every thread context switch. This eliminates two system calls per context switch. Signal events can be handled much more efficiently by converting them to I/O events (see above).

3.4 Portability

The library uses the same general, underlying concepts as the EDSM architecture, including non-blocking I/O, file descriptors, and I/O multiplexing. These concepts are available in some form on most UNIX platforms, making the library very portable across many flavors of UNIX. There are only a few platform-dependent sections in the source.

3.5 State Threads and NSPR

The State Threads library is a derivative of the Netscape Portable Runtime library (NSPR) [Reference 7]. The primary goal of NSPR is to provide a platform-independent layer for system facilities, where system facilities include threads, thread synchronization, and I/O. Performance and scalability are not the main concern of NSPR. The State Threads library addresses performance and scalability while remaining much smaller than NSPR. It is contained in 8 source files as opposed to more than 400, but provides all the functionality that is needed to write efficient IAs on UNIX-like platforms.

?NSPRState Threads
Lines of code~150,000~3000
Dynamic library size(debug version)
IRIX~700 KB~60 KB
Linux~900 KB~70 KB

Conclusion

State Threads is an application library which provides a foundation for writing Internet Applications. To summarize, it has the following advantages:

It allows the design of fast and highly scalable applications. An application will scale well with both load and number of CPUs.

It greatly simplifies application programming and debugging because, as a rule, no mutual exclusion locking is necessary and the entire application is free to use static variables and non-reentrant library functions.

The library's main limitation:

All I/O operations on sockets must use the State Thread library's I/O functions because only those functions perform thread scheduling and prevent the application's processes from blocking.

References

  • Apache Software Foundation, http://www.apache.org.
  • Douglas E. Comer, David L. Stevens, Internetworking With TCP/IP, Vol. III: Client-Server Programming And Applications, Second Edition, Ch. 8, 12.
  • W. Richard Stevens, UNIX Network Programming, Second Edition, Vol. 1, Ch. 15.
  • Zeus Technology Limited, http://www.zeus.co.uk.
  • Peter Druschel, Vivek S. Pai, Willy Zwaenepoel, Flash: An Efficient and Portable Web Server. In Proceedings of the USENIX 1999 Annual Technical Conference, Monterey, CA, June 1999.
  • GNU Portable Threads, http://www.gnu.org/software/pth/.
  • Netscape Portable Runtime, http://www.mozilla.org/docs/refList/refNSPR/.
  • Other resources covering various architectural issues in IAs

  • Dan Kegel, The C10K problem, http://www.kegel.com/c10k.html.
  • James C. Hu, Douglas C. Schmidt, Irfan Pyarali, JAWS: Understanding High Performance Web Systems, http://www.cs.wustl.edu/~jxh/research/research.html.
  • Portions created by SGI are Copyright ? 2000 Silicon Graphics, Inc. All rights reserved.

    網(wǎng)絡(luò)架構(gòu)庫:StateThreads

    介紹

    StateThreads是一個(gè)C的網(wǎng)絡(luò)程序開發(fā)庫,提供了編寫高性能、高并發(fā)、高可讀性的網(wǎng)絡(luò)程序的開發(fā)庫,支持UNIX-like平臺(tái)。它結(jié)合了多線程編寫并行成的簡單性,一個(gè)進(jìn)程支持多個(gè)并發(fā),支持基于事件的狀態(tài)機(jī)架構(gòu)的高性能和高并發(fā)能力。

    (譯注:提供了EDSM的高性能、高并發(fā)、穩(wěn)定性,“多線程”形式的簡單編程方式,用setjmp和longjmp實(shí)現(xiàn)的一個(gè)線程模擬多線程,即用戶空間的多線程,類似于現(xiàn)在的協(xié)程和纖程)

    1. 定義

    1.1 網(wǎng)絡(luò)程序(Internet Applications)

    網(wǎng)絡(luò)程序(Internet Application)(IA)是一個(gè)網(wǎng)絡(luò)的客戶端或者服務(wù)器程序,它接受客戶端連接,同時(shí)可能需要連接到其他服務(wù)器。在IA中,數(shù)據(jù)的到達(dá)和發(fā)送完畢經(jīng)常操縱控制流,就是說IA是數(shù)據(jù)驅(qū)動(dòng)的程序。對每個(gè)連接,IA做一些有限的工作,包括和peer的數(shù)據(jù)交換,peer可能是客戶端或服務(wù)器。IA典型的事務(wù)步驟是:接受連接,讀取請求,做一些有限的工作處理請求,將相應(yīng)寫入peer。一個(gè)iA的例子是Web服務(wù)器,更典型的例子是代理服務(wù)器,因?yàn)樗邮芸蛻舳诉B接,同時(shí)也連接到其他服務(wù)器。

    我們假定IA的性能由CPU決定,而不是由網(wǎng)絡(luò)帶寬或磁盤IO決定,即CPU是系統(tǒng)瓶頸。

    1.2 性能和可擴(kuò)展性

    IA的性能一般可以用吞吐量來評(píng)估,即每秒的事務(wù)數(shù),或每秒的字節(jié)數(shù)(兩者可以相互轉(zhuǎn)換,給定事務(wù)的平均大小就可以)。有很多種工具可以用來測量Web程序的特定負(fù)載,譬如SPECweb96, WebStone, WebBench。盡管對擴(kuò)展性沒有通用的定義,一般而言,可擴(kuò)展性指系統(tǒng)在外部條件改變時(shí)維持它的性能的能力。對于IAs而言,外部條件指連接數(shù)(并發(fā)),或者底層硬件(CPU數(shù)目,內(nèi)存等)。因此,有兩種系統(tǒng)的擴(kuò)展性:負(fù)載能力和系統(tǒng)能力。

    (譯注:scalability可擴(kuò)展性,指條件改變了系統(tǒng)是否還能高效運(yùn)行,譬如負(fù)載能力指并發(fā)(條件)增多時(shí)系統(tǒng)是否能承擔(dān)這么多負(fù)載,系統(tǒng)能力指CPU等增多時(shí)是否能高效的利用多CPU達(dá)到更強(qiáng)的能力)

    下圖描述了客戶端數(shù)目增多時(shí)系統(tǒng)的吞吐量的變化,藍(lán)色線條表示理想狀況。最開始時(shí)吞吐量程線性增長,這個(gè)區(qū)間系統(tǒng)和CPU較為空閑。繼續(xù)增長的連接數(shù)導(dǎo)致系統(tǒng)開始飽和,吞吐量開始觸及天花板(CPU跑滿能跑到的吞吐量),在天花板之后吞吐量變?yōu)槠叫芯€不再增長,因?yàn)镃PU能力到達(dá)了極限。在實(shí)際應(yīng)用中,每個(gè)連接消耗了計(jì)算資源和內(nèi)存資源,就算是空閑狀態(tài),這些負(fù)擔(dān)都隨連接數(shù)而增長,因此,實(shí)際的IA吞吐量在某個(gè)點(diǎn)之后開始往下落(藍(lán)色虛線表示)。開始掉的點(diǎn),不是其他的原因,而是由系統(tǒng)架構(gòu)決定的。

    我們將系統(tǒng)有好的負(fù)載能力,是指系統(tǒng)在高負(fù)載時(shí)仍能很好的工作。SPECweb99基準(zhǔn)測試能較好的反應(yīng)系統(tǒng)的負(fù)載能力,因?yàn)樗鼫y量的是連接在最小流量需求時(shí)系統(tǒng)能支持的最大連接數(shù)(譯注:如圖中Capacity所指出的點(diǎn)即灰色斜線和藍(lán)色線交叉的點(diǎn))。而不像SPECweb96或其他的基準(zhǔn)測試,是以系統(tǒng)的吞吐量來衡量的(譯注:圖中Max throughout,即藍(lán)色線的天花板)。


    系統(tǒng)能力指程序在增加硬件單元例如加CPU時(shí)系統(tǒng)的性能,換句話說,好的系統(tǒng)能力意味著CPU加倍時(shí)吞吐量會(huì)加倍(圖中綠色虛線)。我們假設(shè)底層操作系統(tǒng)也具有很好的系統(tǒng)能力。好的系統(tǒng)能力指假設(shè)程序在一個(gè)小的機(jī)器上運(yùn)行很好,當(dāng)有需要換到大型服務(wù)器上運(yùn)行時(shí)也能獲得很高的性能。就是說,改變服務(wù)器環(huán)境時(shí),系統(tǒng)不需要重寫或者費(fèi)很大的勁。

    (譯注:

    縱坐標(biāo)是吞吐量,橫坐標(biāo)是連接數(shù)。

    灰色的線(min acceptable throughout pre client)表示是客戶端的需要的吞吐量,至少這個(gè)量才流暢。

    藍(lán)色表示理想狀態(tài)的server,系統(tǒng)能力一直沒有問題,能達(dá)到最大吞吐量,CPU跑滿能跑到的吞吐量。

    藍(lán)色虛線表示實(shí)際的server,每個(gè)連接都會(huì)消耗CPU和內(nèi)存,所以在某個(gè)臨界點(diǎn)之后吞吐量開始往下掉,這個(gè)臨界點(diǎn)就是系統(tǒng)結(jié)構(gòu)決定的。好的系統(tǒng)架構(gòu)能將臨界點(diǎn)往后推,穩(wěn)定的支持更高的并發(fā);差的架構(gòu)在并發(fā)增加時(shí)可能系統(tǒng)就僵死了。

    灰色虛線表示兩個(gè)測量基準(zhǔn),一個(gè)是SPECweb96測量的是系統(tǒng)最大吞吐量,一個(gè)是SPECweb99測量每個(gè)連接在最小要求流量下系統(tǒng)能達(dá)到的最大連接數(shù),后者更能反應(yīng)系統(tǒng)的負(fù)載能力,因?yàn)樗鼫y量不同的連接的狀況下系統(tǒng)的負(fù)載能力。

    負(fù)載能力指的是系統(tǒng)支撐的最大負(fù)載,圖中的橫坐標(biāo)上的值,對應(yīng)的藍(lán)色線和灰色線交叉的點(diǎn),或者是藍(lán)色線往下掉的點(diǎn)。

    系統(tǒng)能力指的是增加服務(wù)器能力,如加CPU時(shí),系統(tǒng)的吞吐量是否也會(huì)增加,圖中綠色線表示。好的系統(tǒng)能力會(huì)在CPU增加時(shí)性能更高,差的系統(tǒng)能力增加CPU也不會(huì)更強(qiáng)。

    )

    盡管性能和擴(kuò)展性對服務(wù)器來講更重要,客戶端也必須考慮這個(gè)問題,例如性能測試工具。

    1.3 并發(fā)

    并發(fā)反應(yīng)了系統(tǒng)的并行能力,分為虛擬并發(fā)和物理并發(fā):

    虛擬并發(fā)是指操作系統(tǒng)同時(shí)支持很多并發(fā)的連接。

    物理并發(fā)是指硬件設(shè)備,例如CPU,網(wǎng)卡,硬盤等,允許系統(tǒng)并行執(zhí)行任務(wù)。

    IA必須提供虛擬并發(fā)來支持用戶的并發(fā)訪問,為了達(dá)到最大的性能,IA創(chuàng)建的由內(nèi)核調(diào)度的編程實(shí)體數(shù)目基本上和物理并發(fā)的數(shù)量要保持一致(在一個(gè)數(shù)量級(jí)上)(譯注:有多少個(gè)CPU就用多少個(gè)進(jìn)程)。內(nèi)核調(diào)度的編程實(shí)體即內(nèi)核執(zhí)行對象(kernel execution vehicles),包括Solaris輕量級(jí)進(jìn)程,IRIX內(nèi)核線程。換句話說,內(nèi)核執(zhí)行對象應(yīng)該由物理?xiàng)l件決定,而不是由并發(fā)決定(譯注:即進(jìn)程數(shù)目應(yīng)該由CPU決定,而不是由連接數(shù)決定)。

    2. 現(xiàn)有的架構(gòu)

    IAs(Internet Applications)有一些常見的被廣泛使用的架構(gòu),包括基于進(jìn)程的架構(gòu)(Multi-Process),基于線程的架構(gòu)(Multi-Threaded), 和事件驅(qū)動(dòng)的狀態(tài)機(jī)架構(gòu)(Event-Driven State Machine)。

    2.1 基于進(jìn)程的架構(gòu):MP

    (譯注:Multi-Process字面意思是多進(jìn)程,但事件驅(qū)動(dòng)的狀態(tài)機(jī)EDSM也常用多進(jìn)程,所以為了區(qū)分,使用“基于進(jìn)程的架構(gòu)”,意為每個(gè)連接一個(gè)進(jìn)程的架構(gòu))

    在基于進(jìn)程的架構(gòu)(MP)中,一個(gè)獨(dú)立的進(jìn)程用來服務(wù)一個(gè)連接。一個(gè)進(jìn)程從初始化到服務(wù)這個(gè)連接,直到服務(wù)完畢才服務(wù)其他連接。

    用戶Session是完全獨(dú)立的,因此,在這些處理不同的連接的進(jìn)程之間,完全沒有同步的必要。因?yàn)槊總€(gè)進(jìn)程有自己獨(dú)立的地址空間,這種架構(gòu)非常強(qiáng)壯。若服務(wù)某個(gè)連接的進(jìn)程崩潰,其他的連接不會(huì)受到任何影響。然而,為了服務(wù)很多并發(fā)的連接,必須創(chuàng)建相等數(shù)量的進(jìn)程。因?yàn)檫M(jìn)程是內(nèi)核對象,實(shí)際上是最“重”的一種對象,所以至少需要再內(nèi)核創(chuàng)建和連接數(shù)相等的進(jìn)程。在大多數(shù)的系統(tǒng)中,當(dāng)創(chuàng)建了上千個(gè)進(jìn)程時(shí),系統(tǒng)性能將大幅降低,因?yàn)槌?fù)荷的上下文切換。也就是說,MP架構(gòu)負(fù)載能力很弱,無法支持高負(fù)載(高并發(fā))。

    另一方面,MP架構(gòu)有很高的系統(tǒng)能力(利用系統(tǒng)資源,穩(wěn)定性,復(fù)雜度),因?yàn)椴煌倪M(jìn)程之間沒有共享資源,因而沒有同步的負(fù)擔(dān)。

    ApacheWeb服務(wù)器就是采用的MP架構(gòu)。

    2.2 基于線程的架構(gòu):MT

    (譯注:Multi-Threaded字面意思是多線程,但側(cè)重一個(gè)線程服務(wù)一個(gè)連接的方式,用“基于線程”會(huì)更準(zhǔn)確)

    在基于線程(MT)架構(gòu)中,使用多個(gè)獨(dú)立的線程,它們共享地址空間。和MP結(jié)構(gòu)的進(jìn)程一樣,每個(gè)線程獨(dú)立服務(wù)每個(gè)連接直到服務(wù)完畢,這個(gè)線程才用來服務(wù)其他連接。

    很多現(xiàn)代的UNIX操作系統(tǒng)實(shí)現(xiàn)了一個(gè)多對一的模型,用來映射用戶空間的線程到系統(tǒng)內(nèi)核對象。在這個(gè)模型中,任意多數(shù)量的用戶空間線程復(fù)用少量的內(nèi)核執(zhí)行對象,內(nèi)核執(zhí)行對象即為虛擬處理器。當(dāng)用戶空間線程調(diào)用了一個(gè)阻塞的系統(tǒng)調(diào)用時(shí),內(nèi)核執(zhí)行對象也會(huì)在內(nèi)核阻塞。如果沒有其他沒有阻塞的內(nèi)核執(zhí)行對象,或者有其他需要運(yùn)行的用戶空間線程,一個(gè)新的內(nèi)核執(zhí)行對象會(huì)被自動(dòng)創(chuàng)建,這樣就防止一個(gè)線程阻塞時(shí)其他線程都被阻塞。

    由于IAs由網(wǎng)絡(luò)IO驅(qū)動(dòng),所有的并發(fā)連接都會(huì)阻塞在不同的地方。因此,內(nèi)核執(zhí)行對象的數(shù)目會(huì)接近用戶空間線程的數(shù)目,也就是連接的數(shù)目。此時(shí),多對一的模型就退化為一對一的模型,和MP架構(gòu)一樣,內(nèi)核執(zhí)行對象的數(shù)目由并發(fā)決定而不是由CPU數(shù)目決定。和MP一樣,這降低了系統(tǒng)的負(fù)載能力。盡管這樣,由于內(nèi)核線程是輕量級(jí)進(jìn)程,使用了較少的資源,比內(nèi)核進(jìn)程要輕,MT架構(gòu)比MP架構(gòu)在負(fù)載能力方面稍強(qiáng)一些。

    在MT架構(gòu)中,內(nèi)核線程共享了地址空間,各種同步鎖破壞了系統(tǒng)能力。盡管程序可以很小心的避免鎖來提高程序性能(是個(gè)復(fù)雜的任務(wù)),標(biāo)準(zhǔn)庫函數(shù)和系統(tǒng)調(diào)用也會(huì)對通用資源上鎖,例如,平臺(tái)提供的線程安全函數(shù),例如內(nèi)存分配函數(shù)(malloc,free等)都是用了一個(gè)全局鎖。另外一個(gè)例子是進(jìn)程的文件描述表,這個(gè)表被內(nèi)核線程共享,在系統(tǒng)調(diào)用(open,close等)時(shí)需要保護(hù)。除此之外,多核系統(tǒng)中需要在CPU之間維護(hù)緩存的一致,當(dāng)不同的線程運(yùn)行在不同的CPU上并修改同樣的數(shù)據(jù)時(shí),嚴(yán)重降低了系統(tǒng)的性能。

    為了提高負(fù)載能力,產(chǎn)生了一些不同類型的MT架構(gòu):創(chuàng)建多組線程,每組線程服務(wù)一個(gè)任務(wù),而不是一個(gè)線程服務(wù)一個(gè)連接。例如,一小組線程負(fù)責(zé)處理客戶端連接的任務(wù),另外一組負(fù)責(zé)處理請求,其他的負(fù)責(zé)處理響應(yīng)。這種架構(gòu)的主要優(yōu)點(diǎn)是它對并發(fā)和線程解耦了,不再需要同等數(shù)量的線程服務(wù)連接。盡管這樣,線程組之間必須共享任務(wù)隊(duì)列,任務(wù)隊(duì)列需要用鎖來保護(hù)(典型的生產(chǎn)者-消費(fèi)者問題)。額外的線程同步負(fù)擔(dān)導(dǎo)致在多處理器系統(tǒng)上性能很低。也就是說,這種架構(gòu)用系統(tǒng)能力換取了負(fù)載能力(用性能換高并發(fā))。

    當(dāng)然,線程編程的噩夢,包括數(shù)據(jù)破壞,死鎖,條件競爭,也導(dǎo)致了任何形式的MT架構(gòu)無法實(shí)用。

    2.3 基于事件的狀態(tài)機(jī)架構(gòu):EDSM

    在基于事件驅(qū)動(dòng)的狀態(tài)機(jī)架構(gòu)(EDSM)中,一個(gè)進(jìn)程用來處理多個(gè)并發(fā)。Comer和Stevens[Reference?2]描述了這個(gè)架構(gòu)的基礎(chǔ)。EDSM架構(gòu)中,每次每個(gè)連接只由數(shù)據(jù)驅(qū)動(dòng)一步(譯注:例如,收一個(gè)包,動(dòng)作一次),因此必須復(fù)用多個(gè)并發(fā)的連接(譯注:必須復(fù)用一個(gè)進(jìn)程處理多個(gè)連接),進(jìn)程設(shè)計(jì)成狀態(tài)機(jī)每次收到一個(gè)時(shí)間就處理并變換到下一個(gè)狀態(tài)。

    在空閑狀態(tài)時(shí),EDSM調(diào)用select/poll/epoll等待網(wǎng)絡(luò)事件,當(dāng)一個(gè)特殊的連接可以讀寫時(shí),EDSM調(diào)用響應(yīng)的處理函數(shù)處理,然后處理下一個(gè)連接。EDSM架構(gòu)使用非阻塞的系統(tǒng)調(diào)用完成異步的網(wǎng)絡(luò)IO。關(guān)于非阻塞的IO,請參考Stevens [Reference 3]。

    為了利用硬件并行性能,可以創(chuàng)建多個(gè)獨(dú)立的進(jìn)程,這叫均衡的多進(jìn)程EDSM,例如ZeusWeb服務(wù)器[Reference 4](譯注:商業(yè)的高性能服務(wù)器)。為了更好的利用多磁盤的IO性能,可以創(chuàng)建一些輔助進(jìn)程,這叫非均衡的多進(jìn)程EDSM,例如DruschelWeb服務(wù)器[Reference 5]。

    EDSM架構(gòu)可能是IAs的最佳架構(gòu),因?yàn)椴l(fā)連接完全和內(nèi)核進(jìn)程解耦,這種架構(gòu)有很高的負(fù)載能力,它僅僅需要少量的用戶空間的資源來管理連接。

    和MP架構(gòu)一樣,多核的EDSM架構(gòu)也有很高的系統(tǒng)能力(多核性能,穩(wěn)定性等),因?yàn)檫M(jìn)程間沒有資源共享,所以沒有同步鎖的負(fù)擔(dān)。

    不幸的是,EDSM架構(gòu)實(shí)際上是基于線程的概念(譯注:狀態(tài)機(jī)保存的其實(shí)就是線程的棧,上次調(diào)用的位置,下次繼續(xù)從這個(gè)狀態(tài)開始執(zhí)行,和線程是一樣的),所以新的EDSM系統(tǒng)需要從頭開始實(shí)現(xiàn)狀態(tài)機(jī)。實(shí)際上,EDSM架構(gòu)用很復(fù)雜的方式模擬了多線程。

    3. State Threads Library

    StateThreads庫結(jié)合了上面所有架構(gòu)的優(yōu)點(diǎn),它的api提供了像線程一樣的編程方式,允許一個(gè)并發(fā)在一個(gè)“線程”里面執(zhí)行,但這些線程都在一個(gè)進(jìn)程里面。底層的實(shí)現(xiàn)和EDSM架構(gòu)類似,每個(gè)并發(fā)連接的session在單獨(dú)的內(nèi)存空間。

    (譯注:StateThreads提供的就是EDSM機(jī)制,只是將狀態(tài)機(jī)換成了它的“線程”(協(xié)程或纖程),這些“線程”實(shí)際上是一個(gè)進(jìn)程一個(gè)線程實(shí)現(xiàn)但表現(xiàn)起來像多線程。所以StateThread的模型是EDSM的高性能和高并發(fā),然后提供了MT的可編程性和簡單接口,簡化了EDSM的狀態(tài)機(jī)部分。)

    3.1 狀態(tài)改變和調(diào)度

    每個(gè)并發(fā)的session包含它自己的棧環(huán)境(棧指針,PC,CPU寄存器)和它的棧。從概念上講,一次線程上下文切換相當(dāng)于進(jìn)程改變它的狀態(tài)。當(dāng)然除了進(jìn)程之外,并沒有使用線程(譯注:它是單線程的方式模擬多線程)。和其他通用的線程庫不一樣,StateThreads庫的設(shè)計(jì)目標(biāo)很明確。線程上下文切換(進(jìn)程狀態(tài)改變)只會(huì)在一些函數(shù)中才會(huì)發(fā)生(IO點(diǎn),或者明確的同步點(diǎn))。所以,進(jìn)程級(jí)別的數(shù)據(jù)不需要鎖來保護(hù),因?yàn)槭菃尉€程。整個(gè)程序可以自由的使用靜態(tài)變量和不可重入的函數(shù),極大的簡化了編程和調(diào)試,從而增加了性能。這實(shí)際上是和協(xié)程(co-routine)類似,但是不需要顯式的用yield指定——線程調(diào)用阻塞的IO函數(shù)被阻塞而交出控制權(quán)是早晚的事。所有的線程(并發(fā)連接)都有同樣的優(yōu)先級(jí),所以是非搶占式的調(diào)度,和EDSM架構(gòu)類似。由于IAs是數(shù)據(jù)驅(qū)動(dòng)(處理流程由網(wǎng)絡(luò)緩沖區(qū)大小和數(shù)據(jù)到達(dá)的次序決定),調(diào)度不是按時(shí)間切片的。

    只有兩類的外部事件可以被庫的調(diào)度器處理,因?yàn)橹挥羞@類事件能被select/poll檢測到:

    1. IO事件:一個(gè)文件描述符可讀寫時(shí)。

    2. 定時(shí)器時(shí)間:指定了timeout。

    盡管這樣,其他類型的事件(譬如發(fā)送給進(jìn)程的信號(hào))也能被轉(zhuǎn)換成IO事件來處理。例如,信號(hào)處理函數(shù)收到信號(hào)時(shí)可以寫入pipe,因此將信號(hào)轉(zhuǎn)換成了IO事件。

    為了能更好的發(fā)揮硬件并行的性能,和EDSM架構(gòu)一樣,可以創(chuàng)建均衡和非均衡的進(jìn)程。進(jìn)程管理不是庫的功能,而是留給用戶處理。

    有一些通用的線程庫,實(shí)現(xiàn)了多對一的模型(多個(gè)用戶空間的線程,對一個(gè)內(nèi)核執(zhí)行對象),使用了和StateThreads庫類似的技術(shù)(非阻塞IO,事件驅(qū)動(dòng)的調(diào)度器等)。譬如,GNU Portable Threads?[Reference 6]。因?yàn)樗麄兪峭ㄓ脦?#xff0c;所以它們和StateThreads有不同的目標(biāo)。StateThreads不是通用的線程庫,而是為少數(shù)的需要獲得高性能、高并發(fā)、高擴(kuò)展性和可讀性的IAs系統(tǒng)而設(shè)計(jì)的。

    3.2 可擴(kuò)展性

    StateThreads是非常輕量級(jí)的用戶空間線程,因此創(chuàng)建和維護(hù)用戶連接需要很少的資源。使用StateThreads的系統(tǒng)在高并發(fā)時(shí)能獲得很高性能。

    多CPU的系統(tǒng)上,程序需要?jiǎng)?chuàng)建多個(gè)進(jìn)程才能利用硬件的平行能力。使用獨(dú)立的進(jìn)程是唯一獲取高系統(tǒng)能力的方式,因?yàn)閺?fù)制進(jìn)程的資源是唯一的方式來避免鎖和同步這種負(fù)擔(dān)的唯一方式。創(chuàng)建UNIX進(jìn)程一般會(huì)復(fù)制進(jìn)程的資源。再次強(qiáng)調(diào),EDSM架構(gòu)中,并發(fā)的連接和系統(tǒng)對象(進(jìn)程線程)沒有任何的聯(lián)系,也就是說,StateThreads庫將大量并發(fā)復(fù)用到了少量的獨(dú)立的進(jìn)程上,因此獲得很高的系統(tǒng)能力和負(fù)載能力。

    3.3 性能

    高性能是StateThreads庫的主要目標(biāo)之一,它實(shí)現(xiàn)了一系列的系統(tǒng)調(diào)用,盡可能的提高線程創(chuàng)建和切換的速度。例如,沒有線程級(jí)別的信號(hào)屏蔽(和POSIX線程不一樣),所以線程切換時(shí)不需要保存和恢復(fù)進(jìn)程的信號(hào)屏蔽字,這樣在線程切換時(shí)少了兩個(gè)系統(tǒng)調(diào)用。信號(hào)事件能被高效的轉(zhuǎn)換成IO事件(如上所述)。

    3.4 便攜性

    StateThreads庫使用了和EDSM架構(gòu)同樣的基礎(chǔ)概念,包括非阻塞IO,文件描述符,IO復(fù)用。這些概念在大多數(shù)的UNIX平臺(tái)都通用,所以UNIX下庫的通用性很好,只有少數(shù)幾個(gè)平臺(tái)相關(guān)的特性。

    3.5 State Threads 和 NSPR

    StateThreads庫是從Netscape Portable Runtime library (NSPR) [Reference 7]發(fā)展來的。NSPR主要的目標(biāo)是提供一個(gè)平臺(tái)無關(guān)的系統(tǒng)功能,包括線程,線程同步和IO。性能和可擴(kuò)展性不是NSPR主要考慮的問題。StateThreads解決了性能和可擴(kuò)展性問題,但是比NSPR要小很多;它僅僅包含8個(gè)源文件,卻提供了在UNIX下寫高效IAs系統(tǒng)的必要功能:

    ?NSPRState Threads
    Lines of code~150,000~3000
    Dynamic library size(debug version)
    IRIX~700 KB~60 KB
    Linux~900 KB~70 KB

    總結(jié)

    StateThreads是一個(gè)提供了編寫IA的基礎(chǔ)庫,它包含以下優(yōu)點(diǎn):

    1. 能設(shè)計(jì)出高效的IA系統(tǒng),包括很高的負(fù)載能力和系統(tǒng)能力。

    2. 簡化了編程和調(diào)試,因?yàn)闆]有同步鎖,可以使用靜態(tài)變量和不可重入函數(shù)。

    它主要的限制:

    1. 所有socket的IO必須要使用庫的IO函數(shù),因?yàn)檎{(diào)度器可以避免被阻塞(譯注:用操作系統(tǒng)的socket的IO函數(shù)自然調(diào)度器就管不了了)。

    References

  • Apache Software Foundation, http://www.apache.org.
  • Douglas E. Comer, David L. Stevens, Internetworking With TCP/IP, Vol. III: Client-Server Programming And Applications, Second Edition, Ch. 8, 12.
  • W. Richard Stevens, UNIX Network Programming, Second Edition, Vol. 1, Ch. 15.
  • Zeus Technology Limited, http://www.zeus.co.uk.
  • Peter Druschel, Vivek S. Pai, Willy Zwaenepoel, Flash: An Efficient and Portable Web Server. In Proceedings of the USENIX 1999 Annual Technical Conference, Monterey, CA, June 1999.
  • GNU Portable Threads, http://www.gnu.org/software/pth/.
  • Netscape Portable Runtime, http://www.mozilla.org/docs/refList/refNSPR/.
  • Other resources covering various architectural issues in IAs

  • Dan Kegel, The C10K problem, http://www.kegel.com/c10k.html.
  • James C. Hu, Douglas C. Schmidt, Irfan Pyarali, JAWS: Understanding High Performance Web Systems, http://www.cs.wustl.edu/~jxh/research/research.html.

  • 譯注: 用StateThread寫了幾個(gè)程序。

    開啟10K和30K個(gè)線程的程序:

    [cpp]?view plaincopy
  • #include?<stdio.h>??
  • ??
  • /*?
  • build?and?execute?
  • ????gcc?-I../obj?-g?huge_threads.c?../obj/libst.a??-o?huge_threads;?
  • ????./huge_threads?10000?
  • 10K?report:?
  • ????10000?threads,?running?on?1?CPU?512M?machine,?
  • ????CPU?6%,?MEM?8.2%?(~42M?=?42991K?=?4.3K/thread)?
  • 30K?report:?
  • ????30000?threads,?running?on?1CPU?512M?machine,?
  • ????CPU?3%,?MEM?24.3%?(4.3K/thread)?
  • */??
  • #include?<st.h>???
  • ??
  • void*?do_calc(void*?arg){??
  • ????int?sleep_ms?=?(int)(long?int)(char*)arg?*?10;??
  • ??????
  • ????for(;;){??
  • ????????printf("in?sthread?#%dms\n",?sleep_ms);??
  • ????????st_usleep(sleep_ms?*?1000);??
  • ????}??
  • ??????
  • ????return?NULL;??
  • }??
  • ??
  • int?main(int?argc,?char**?argv){??
  • ????if(argc?<=?1){??
  • ????????printf("Test?the?concurrence?of?state-threads!\n"??
  • ????????????"Usage:?%s?<sthread_count>\n"??
  • ????????????"eg.?%s?10000\n",?argv[0],?argv[0]);??
  • ????????return?-1;??
  • ????}??
  • ??????
  • ????if(st_init()?<?0){??
  • ????????printf("error!");??
  • ????????return?-1;??
  • ????}??
  • ??????
  • ????int?i;??
  • ????int?count?=?atoi(argv[1]);??
  • ????for(i?=?1;?i?<=?count;?i++){??
  • ????????if(st_thread_create(do_calc,?(void*)i,?0,?0)?==?NULL){??
  • ????????????printf("error!");??
  • ????????????return?-1;??
  • ????????}??
  • ????}??
  • ??????
  • ????st_thread_exit(NULL);??
  • ??????
  • ????return?0;??
  • } ?
  • 總結(jié)

    以上是生活随笔為你收集整理的高性能、高并发、高扩展性和可读性的网络服务器架构:StateThreads的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。