Last Tuesday you asked Ingo Molnar, Red Hat kernel hacker, about the means by which his TUX Web server recently achieved such fantastic results in SpecWeb99 . He was kind enough to respond with at-length answers addressing licensing, the reality of threads under Linux, the realism of benchmarks, and more. Thanks, Ingo!
1) TUX Architecture
"You appear to have take an "architectural" approach to designing TUX, so I have some architectural questions.
1.The choice of a kernel space implementation is probably going to be a controversial one. You suggest that HTTP is commonly used enough to go in the kernel just as TCP/IP did years ago. What performance or architectural advantages do you see to moving application protocols into the kernel that cannot be achieved in user space?
Ingo Molnar The biggest advantage i see is to have encapsulation, security and performance available to dynamic web applications *at the same time*.
There are various popular ways to create dynamic web content. Encapsulation, security is provided by CGI, various scripting and virtual machine models - unpriviledged/prototype/buggy CGIs are sufficiently isolated both from HTTP protocol details, from the webserving context, from the web-client and from each other.
But all CGI/scripting/virtual-machine models (including fast-CGI) lack the possibility of performing with 'maximum performance' if the webserver is in another process context. 'maximum performance' means that there should be only one context running (no context switching done), and the application writer should have the freedom to use C code, or even assembly code. (not that using assembly in web-applications would be too common.)
ISAPI, NSAPI dynamic webserver applications can have maximum performance because they run in the webserver's context, but they do not provide enough security - ie. a buggy ISAPI DLL which is loaded into IIS's address space can take down IIS and the whole site, including unrelated virtual sites. Not a very safe programming model. Additionally ISAPI/NSAPI modules often have to care about HTTP protocol details, and can cause HTTP-nonconform replies being generated. Additionally, debugging ISAPI/NSAPI modules is hard and a pain, and nobody does it on live servers. [i intentionally omitted Apache modules, because Apache modules are primarily used to extend the capabilities of Apache, they are not ment to provide a platform for dynamic applications.]
The TUX dynamic API is implemented through a kernel-based subsystem, which is accessible via a system-call which is available to 'unpriviledged' user-space code. There is no context switching (ie. we can have maximum performance) and still no user-space code can take the webserver down, because kernel-space is isolated from user-space. Obviously the TUX subsystem checks user-space parameters with extreme prejudice. It's a goal of TUX to also enforce RFC-conform replies being sent to the web-client - just like CGIs. Plus TUX modules can be in separate process contexts as well. (in this case there is going to be casual context-switching between these process contexts.)
TUX also has kernel-space modules - for the truly performance-hungry and kernel-savvy web coder. Some TUX features (such as CGI execution) are implemented as TUX kernel-space modules. Here are the various layers of modularity in TUX:
- 'accelerated requests', automated by TUX. (static GETs, etc.)
- 'kernel-space TUX modules'
- 'user-space TUX modules'
- 'fast socket-redirection to other webservers'
- 'external CGIs'
Ingo: This is how it works, simplified:
there is a 'HTTP request' user-space structure that is manipulated by the TUX kernel-space subsystem. Whenever a requested operation (eg. 'send data to client', or 'read web-object from disk') is about to block then TUX 'schedules' another request context and returns it to user-space. User-space can put event codes (or other private state) into the request structure, so that it can track the state of the request. This programming model is harder to code than other, more 'synchronous' solutions, but avoids the context-switch problem. Whenever there is no more work to be done by user-space, TUX suspends the process until there are new connections or any other progress.
3. Are there any plans to generalize the infrastructure elements of TUX so that other protocol servers can take advantage of the TUX architecture and also go fast as hell?"
Ingo: Yep i think so - i'm thinking about writing a FTP TUX component ...2) Relative Impact
TUX includes a variety of kernel and apache changes. Can you give a rough measure of how each of the changes improved the http performance? I'm interested in the amount of improvement as well as why it improved performance. Do those particular changes have negative impact on the performance of other applications?
Ingo: i have no exact numbers because TUX advances in an 'evolutionary' way and closely followed/follows the 2.3/2.4 kernel line - the impact of particular changes is hard to judge exactly. Here are a couple of new features of the 2.4 kernel that are used by TUX and make a visible performance impact (without trying to list all the improvements):
- the single biggest change that enabled TUX was the inclusion of the new Linux TCP/IP architecture into 2.3 early this year. This is the 'softnet' architecture which has been written by David Miller and Alexey Kuznetsov, and is a very SMP-centric rewrite of the Linux TCP/IP stack. In Windows speak: 'the Linux 2.4 TCP/IP stack is completely deserialized'.
- another important change for TUX were the many VFS cleanups and scalability improvements done by Alexander Viro and Linus.
- a patch from Manfred Spraul went into the kernel recently, the per-CPU SLAB cache (the concept is this: when freeing buffers the kernel keeps them in variable-size per-CPU pools, and if the same type of buffer gets allocated shortly afterwards then the kernel picks from the pool that belongs to that CPU. This leads to dramatically better cache-locality.) Per-CPU SLAB pools are used by TUX, obviously.
- another feature which made a noticeable difference is the I/O scheduler rewrite by Andrea Arcangeli and Jens Axboe.
- large file support (up to 2TB files) from Matti Aarnio and Ben LaHaise is crucial as well - the logfile of the 4200 connections SPECweb99 result was bigger than 5 GB.
other TUX-related "generic kernel changes" are included in the mainstream 2.4.0-test3 kernel already: IRQ affinity, process pinning, polling-idle, better IRQ scalability, highmem support.
there are other important scalability improvements all around the place in 2.4 Linux, i think it's fair to say that almost every single line in the 'main kernel' got replaced by something else.
TUX includes no Apache changes right now - the 'redirection' feature can be used to feed connections to an (unchanged) Apache setup. But enabling a future mod_tux.c was definitely a goal, to get deeper Apache integration.
I have a few questions about TUX's caching system. Before I go any further, I want to say that I'm incredibly impressed by the results. I've been following specWeb99 for a while and have been wondering when someone would manage to build a great dynamic cache like this one. I hope it'll get the wide acceptance it seems to deserve.
First, it seems that basically the entire test file set was loaded into memory ahead of time for use by TUX. [...]
Ingo: no, this was not the case actually. In the 1-CPU Dell system the fileset size was 4182 MB, RAM size was 2GB. In the 4-CPU Dell system the total fileset size was 13561 MB, RAM size was 8GB. So the IO and cache capabilities of TUX and the hardware were tested heavily. Eg. the best Windows 2000 + IIS result had a fileset size of 5241 MB, with 8GB RAM. (ie. fully cached, only logging IO.)
[...] How adaptable is TUX to more dynamic, limited-memory environments in terms of setting cache size limitations, selectivity (e.g. "cache all .GIFs, but not .html files"), and expiration/reloading algorithms?
Ingo: i do not want to raise expectations unnecesserily, but it's an integral part of TUX to perform async IO and 'cachemiss' operations effectively. Cached objects are timed out by a LRU mechanizm, so objects accessed less frequently will get deallocated if memory pressure rises.
Second, can a tux module programmer modify the basic tux commands, or do they always do the same thing? For instance, if I were adapting TUX to work with a web proxy cache, I'd want TUX_ACTION_GET_OBJECT to actually go out over the network and do a GET request if it couldn't find a requested object in the cache. You can imagine lots of other circumstances where this would come up as well. [...]
Ingo: TUX is a webserver, and has no proxy capabilities yet. It would be a reasonable and natural extension of the GET_OBJECT mechanizm to fetch objects from a cache hierarchy or from origin servers, yes.
Third, is it possible to execute more than one user-space TUX module at one time? [...]
Ingo: yes. TUX modules right now are compiled as shared libraries and thus can be loaded/unloaded runtime. An unlimited number of user-space modules can be used without performance impact. Fourth, when can we play with the code? Thanks a lot!
Ingo: SPEC rules which regulate the acceptance of not-yet-released products require us to release TUX sometime in August - the whole TUX codebase is going to be released under the GPL.
4)Integration into RedHat?
How will the TUX Webserver integrate with RedHat's Linux distributions? Will RedHat create a special distribution with an identical setup to yours?
Ingo: there are going to be RPMs which can be installed on Red Hat 6.2, plus source-code which can be used with any distribution. Since TUX is a kernel-subsystem mainly, TUX is fundamentally distribution-neutral.
Will RedHat start releasing more specialized distributions, preferably ones more suited to a secure server environment but focused on performance like your setup was?
Ingo: well i cannot comment on specific products (being a kernel guy), but we always try to maximize the security and performance of all our products. TUX itself can be safer than user-space webservers, simply due to the fact that the kernel is a much more controlled, predictable and dedicated programming environment. Nevertheless we try to minimize the amount of code put into the kernel.
5) Performance using dynamic content
How would TUX perform using CGI/Servlets/PHP/etc. compared to Apache or IIS? The ability to serve static pages fast is not that useful in the real world, as all the sites that get really big hits-per-second are those with dynamic content (Yahoo, Slashdot, Amazon.com, etc.)"
Ingo: SPECweb99 is based on real-life logged web-traffic and as a result of that it includes 30% dynamic content -- which dynamic content uses the same fileset as static replies. So the dynamic workload of SPECweb99 is neither trivial, nor isolated.
as explained earlier, TUX is designed to generate very fast dynamic content, the possibility to accelerate static content is a natural step enabled by the kernel-space HTTP protocol stack. After all the kernel does eg. TCP handshake fully in kernel-space as well, and the user gets the simpler 'connect()' functionality. Nevertheless static content is still very important, if take a look at the workload of really big sites like Yahoo, Slashdot or CNN then you'll notice that lots of content (especially content that attracts many hits) is still static. But TUX tries to provide an environment for 'next generation' web-content (ie. web applications), where dynamic output is just as natural as static.
6) Threading and Linux
Unix programmers seems to dislike using threads in their applications. After all, they can just fork(); and run along instead of using the thread functions. But, that's not important right now.
What is your opinion on the current thread implementation in the Linux kernel compared to systems designed from the ground up to support threads (like BeOS, OS/2 and Windows NT)? In which way could the kernel developers make the threads work better?"
Ingo: thats a misconception. The Linux kernel is *fundamentally* 'threaded'. Within the Linux kernel there are only threads. Full stop. Threads either share or do not share various system resources like VM (ie. page tables) or files. If a thread has 'all-private' resources then it behaves like a process. If a thread has shared resources (eg. shares files and page tables) then it's a 'thread'. Some OSs have a rigid distinction between threads and processes - Linux is more flexible, eg. you can have two threads that share all files but have private page-tables. Or you can have threads that have the same page-tables but do not share files. Within the kernel i couldnt even make a distinction between 'processes' and 'threads', because everything is a thread to the kernel.
This means that in Linux every system-call is 'thread-safe', from grounds up. You program 'threads' the same way as 'processes'. There are some popular shared-VM thread APIs, and Linux implements the pthreads API - which btw. is a user-space wrapper exposing already existing kernel-provided APIs. Just to show that the Linux kernel has only one notion for 'context of execution': under Linux the context-switch time between two 'threads' and two 'processes' is all the same: around 2 microseconds on a 500MHz PIII.
programming 'with threads' (ie.: with Linux threads that share page tables) is fundamentally more error-prone that coding isolated threads (ie. processes). This is why you see all those lazy Linux programmers using processes (ie. isolated threads) - if there is no need to share too much state, why go the error-prone path? Under Linux processes scale just as fine on SMP as threads.
the only area where 'all-shared-VM threads' are needed is where there is massive and complex interaction between threads. 98% of the programming tasks are not such. Additionally, on SMP systems threads are *fundamentally slower*, because there has to be (inevitable, hardware-mandated) synchronization between CPUs if shared VM is used.
this whole threading issue i believe comes from the fact that it's so hard and slow to program isolated threads (processes) under NT (NT processes are painfully slow to be created for example) - so all programming tasks which are performance-sensitive are forced to use all-shared-VM threads. Then this technological disadvantage of NT is spinned into a magical 'using threads is better' mantra. IMHO it's a fundamentally bad (and rude) thing to force some stupid all-shared-VM concept on all multi-context programming tasks.
for example, the submitted SPECweb99 TUX results were done in a setup where every CPU was running an isolated thread. Windows 2000 will never be able to do stuff like this without redesigning their whole OS, because processes are just so much fscked up there, and all the APIs (and programming tools) have this stupid bias towards all-shared-VM threads.
7) Kernel modules decrease portability?
You mentioned in the second Linux Today article that you intend to integrate TUX with Apache. However, Apache has always been a cross-platform server and is heavily used on *BSD and Solaris. Do you feel that this integration will undermine the portability work of the Apache team, or will it simply provide an incentive for web servers to be running Linux? [...]
Ingo: TUX is a kernel subsystem with a small amount of user-space glue code to make it easier to use the TUX system-call. I believe that integrating kernel-based HTTP protocol stacks into Apache makes sense - i dont think this will 'undermine' anything, to the contrary, it will enable similar solutions on other OSs as well.
" If you intend to encourage people to move to Linux, can a similar idea as TUX be applied to an SQL server to make up for the speed deficit between Linux SQL servers and Microsoft SQL?
Ingo: I dont care about Microsoft SQL Server being too slow. ;-)
Database servers are not networking protocols - so the concepts of TUX and SQL servers cannot be compared. While there are a couple of simple RFCs describing the HTTP protocol, SQL is a few orders more complex, has programming extensions and all sorts of proprietary flavors.
8) Load balancing
by Ex Machina
Does/Will TUX provide any sort of load balancing for a cluster of heterogenous TUX servers?
Ingo: not the initial release, but yes, it's on the feature list :-)
9) It works with Apache...
by Ex Machina
It works with Apache but is TUX generic enough to be interfaced with another server?
Ingo: yes, i think so.
10) Version 1
By An Unnamed Correspondent
Ingo: actually if you watch the development of Apache, Apache got faster with every major release. No, i dont think that additional features will slow TUX down, once the internal architecture is extensible enough there is no problem if additional features are *added*. It brings an obvious performance penalty if you *use* a given feature (eg. SSI). A webserver is badly designed if it gets slower doing the very same task.
Will there be a way to port an existing Apache configuration across to the TuX configuration? How about IIS, Netscape, Zeus, etc? Will TuX have the option of a GUI setup screen for those who don't like the command line? Will TuX have a simple installer?
Ingo: i dont know. The initial release will have no 'fancy' tools, this will implicitly 'encourage' the early adopters to be technically savvy. (and thus helping TUX development indirectly.) The initial TUX release is expected to be raw and uncut.
11) Re:Versions for other OSs?
by Jason Earl
Actually there is a specific feature that would probably make TUX incompatible with the BSDs. TUX is licensed under the GPL and the BSD maintainers would probably be very reluctant to port it to their OSes. Especially since it is possible that this would require them to release the derivative work under the GPL.
Which leads to the obvious question for Ingo. You mention a specific disclaimer that would allow the Apache to be linked with TUX, do the BSDs get the same privilege?
Ingo: TUX is not 'linked' to Apache. Apache can use the TUX system-call, and applications are 'isolated' from the GPL license of the kernel. Can BSD-licensed software use the TUX system-call? Of course!
Not that I particularly care, as I am not a BSD user, but the putting such a nifty program as TUX under the GPL is bound to cause weeping and gnashing of teeth in the BSD camp. Which brings up another question. How much pressure do you get from your BSD compatriots to release software like this under a more liberal BSD-friendly license?
Ingo: TUX has to be under the GPL because it's a Linux kernel subsystem, and because Red Hat releases all source code as open-source. That having said, i wouldnt write Linux kernel code if i didnt agree with the ethics of the GPL. Putting it simply, the GPL guarantees that all derivatives of source code that *we* have written stay under the GPL as well.
Isnt this a reasonable thing? Nobody is forced to use *our* source code as a base for their project, but if you freely chose to use our source-code, then isnt it very reasonable to ask for the same kind of openness that enabled you using this source code in the first place?
To put in another way, you get our source code only if you agree to be just as open to us as open we are to you. You always have the option to not use our source code. You are free to weigh the benefits of using our source code, against the 'disadvantage' of having to show us your redistributed improvements.
but again, the GPL is only about *our* *source* code. You are completely free to put your source code under whatever license you wish to. I respect BSD kernel hackers and closed-source kernel hackers just as much as Linux kernel hackers, based on the code they produce. It's their private matter to decide under what license they put their code. It's their godsent right to do whatever they wish to do with the source code they wrote.