Oracle Blog

Rebooting is obsolete

Wednesday Mar 16, 2011

It's the end of the day on Friday. On your laptop, in an ssh session on a work machine, you check on long.sh, which has been running all day and has another 8 or 9 hours to go. You start to close your laptop.

You freeze for a second and groan.

This was supposed to be running under a screen session. You know that if you kill the ssh connection, that'll also kill long.sh. What are you going to do? Leave your laptop for the weekend? Kill the job, losing the last 8 hours of work?

You think about what long.sh does for a minute, and breathe a sigh of relief. The output is written to a file, so you don't care about terminal output. This means you can use disown.

How does this little shell built-in let your jobs finish even when you kill the parent process (the shell inside the ssh connection)?

PROCESS STATE CODES
Here are the different values that the s, stat and state output specifiers (header "STAT" or "S")
will display to describe the state of a process.
D Uninterruptible sleep (usually IO)
R Running or runnable (on run queue)
S Interruptible sleep (waiting for an event to complete)
T Stopped, either by a job control signal or because it is being traced.
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z Defunct ("zombie") process, terminated but not reaped by its parent.
For BSD formats and when the stat keyword is used, additional characters may be displayed:
< high-priority (not nice to other users)
N low-priority (nice to other users)
L has pages locked into memory (for real-time and custom IO)
s is a session leader
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
+ is in the foreground process group

And here is a transcript of the steps to disownlong.sh. To the right of each step is some useful ps output, in particular the parent process id (PPID), what process state our long job is in (STAT), and the controlling terminal (TT). I've highlighted the interesting changes:

When we run long.sh from the command line, its parent is the shell (PID 26145 in this example). Even though it looks like it is running as we watch it in the terminal, it mostly isn't; long.sh is waiting on some resource or event, so it is in process state S for interruptible sleep. It is in fact in the foreground, so it also gets a +.

First, we suspend the program with Ctl-z. By ``suspend'', we mean send it the SIGTSTP signal, which is like SIGSTOP except that you can install your own signal handler for or ignore it. We see proof in the state change: it's now in T for stopped.

Next, bg sets our process running again, but in the background, so we get the S for interruptible sleep, but no +.

Finally, we can use disown to remove the process from the jobs list that our shell maintains. Our process has to be active when it is removed from the list or it'll get reaped when we kill the parent shell, which is why we needed the bg step.

When we exit the shell, we are sending it a SIGHUP, which it propagates to all children in the jobs table**. By default, a SIGHUP will terminate a process. Because we removed our job from the jobs table, it doesn't get the SIGHUP and keeps on running (STAT S). However, since its parent the shell died, and the shell was the session leader in charge of the controlling tty, it doesn't have a tty anymore (TT ?). Additionally, our long job needs a new parent, so init, with PID 1, becomes the new parent process.

**This is not always true, as it turns out. In the bash shell, for example, there is a huponexit shell option. If this option is disabled, a SIGHUP to the shell isn't propagated to the children. This means if you have a backgrounded, active process (you followed steps 1, 2, and 3 above, or you started the process backgrounded with ``&'') and you exit the shell, you don't have to use disown for the process to keep running. You can check or toggle the huponexit shell option with the shopt shell built-in.

And that is disown in a nutshell.

What else can we learn about process states?

Dissecting disown presents enough interesting tangents about signals, process states, and job control for a small novel. Focusing on process states for this post, here are a few such tangents:

1. There are a lot of process states and modifiers. We saw some interruptible sleeps and suspended processes with disown, but what states are most common?

Using my laptop as a data source and taking advantage of ps format specifiers, we can get counts for the different process states:

The numbers range from 19 (super friendly, low scheduling priority) to -20 (a total bully, high scheduling priority). The 6 processes with negative numbers are the 6 with a < process state modifier in the ``ps -e h -o stat'' output, and the 3 with positive numbers have the Ns. Most processes don't run under a special scheduling priority.

Why is almost nothing actually running?

In the ``ps -e h -o stat'' output above, only 1 process was marked as R running or runnable. This is a multi-processor machine, and there are over 150 other processes, so why isn't something running on the other processor?

The answer is that on an unloaded system, most processes really are waiting on an event or resource, so they can't run. On the laptop where I ran these tests, uptime tells us that we have a load average under 1:

The machine has 4 processors. On average, 3 or 4 processors have processes running (in the R state). To get a sense of how the running processes change over time, throw the ps line under watch:

watch -n 1 "ps -e -o stat,cmd | awk '{if (\$1 ~/R/) print}'"

We get something like:

2. What about the zombies?

Noticeably absent in the process state summaries above are zombie processes (STAT Z) and processes in uninterruptible sleep (STAT D).

A process becomes a zombie when it has completed execution but hasn't been reaped by its parent. If a program produces long-lived zombies, this is usually a bug; zombies are undesirable because they take up process IDs, which are a limited resource.

I had to dig around a bit to find real examples of zombies. The winners were old barnowl zephyr clients (zephyr is a popular instant messaging system at MIT):

When you run this script, the parent dies after 60 seconds, init becomes the zombie child's new parent, and init quickly reaps the child by making a wait system call on the child's PID, which removes it from the system process table.

3. What about the uninterruptible sleeps?

A process is put in an uninterruptible sleep (STAT D) when it needs to wait on something (typically I/O) and shouldn't be handling signals while waiting. This means you can't kill it, because all kill does is send it signals. This might happen in the real world if you unplug your NFS server while other machines have open network connections to it.

We can create our own uninterruptible processes of limited duration by taking advantage of the vfork system call†. vfork is like fork, except the address space is not copied from the parent into the child, in anticipation of an exec which would just throw out the copied data. Conveniently for us, when you vfork the parent waits uninterruptibly (by way of wait_on_completion) on the child's exec or exit:

We see the child (PID 1973, PPID 1972) in an interruptible sleep and the parent (PID 1972, PPID 13291 -- the shell) in an uninterruptible sleep while it waits for 60 seconds on the child.

One neat (mischievous?) thing about this script is that processes in an uninterruptible sleep contribute to the load average for a machine. So you could run this script 100 times to temporarily give a machine a load average elevated by 100, as reported by uptime.

Wednesday May 19, 2010

Wireless traffic is both interesting and delightfully accessible thanks to the broadcast nature of 802.11. I have spent many a lazy weekend afternoon watching my laptop, the Tivo, and the router chatting away in a Wireshark window.

As fun as the wireless traffic in one's house may be, there's something to be said for being able to observe a much larger ecosystem - one with more people with a more diverse set of operating systems, browsers, and intentions as they work on their wireless-enabled devices, giving rise to more interesting background and active traffic patterns and a greater set of protocols in play.

Now, it happens to be the case that sniffing other people's wireless traffic breaks a number of federal and local laws, including wiretapping laws, and while I am interested in observing other people's wireless traffic, I'm not interested in breaking the law. Fortunately, Ksplice is down the road from a wonderful school that fosters this kind of intellectual curiosity.

Some interesting results from the data set collected are summarized below. Traffic was gathered with tcpdump on my laptop as I sat in the middle of the classroom. The data was imported into Wireshark, spit back out as a CSV, and imported into a sqlite database for aggregation queries, read back into tcpdump and filtered there, or hacked up with judicious use of grep, as different needs arose.

Basic statisticsTime spent capturing: 45 minutesPackets captured: 853436Number of traffic sources in the room: 21Number of distinct source and destination IPv4 and IPv6 addresses: 5117Number of "active" traffic addresses (eg using HTTP, SSH, not background traffic): 581Number of protocols represented: 48 (note that Wireshark buckets based on the top layer for a packet, so for example TCP is in this count because someone was sending actual TCP traffic without an application layer on top and not because TCP is the transport protocol for HTTP, which is also in this count). These protocols and how much traffic was sent over them are in the table on the left.

2.15 IPv4 packets were sent for every IPv6 packet. IPv6 was only used for background traffic, serving as the internet layer for the following protocols: DHCPv6, DNS, ICMPv6, LLMNR, MDNS, SSDP, TCP, and UDP. The TCP over IPv6 packets were all icslap, ldap, or wsdapi communications between our Windows user discussed below and his or her remote desktop. The UDP over IPv6 packets were all ws-discovery communications, part of a local multicast discovery protocol most likely being used by the Windows machines in the room.

1936.12 ICMPv6 packets were sent for every ICMP packet. The reason is that ICMPv6 is doing IPv6 work that is taken care of by other link layer protocols in IPv4. Looking at the ICMP and ICMPv6 packets by type:

ICMP Type/Code

Pkts

Dest unreachable (Host administratively prohibited)

1

Dest unreachable (Port unreachable)

35

Echo (ping) request

9

Time-to-live exceeded in transit

15

ICMPv6 Type/Code

Pkts

Echo request

8

Multicast Listener Report msg v2

7236

Multicast listener done

86

Multicast listener report

548

Neighbor advertisement

806

Neighbor solicitation

105710

Redirect

353

Router advertisement

461

Router solicitation

956

Time exceeded (In-transit)

1

Unknown (0x40) (Unknown (0x87))

1

Unreachable (Port unreachable)

1

These ICMPv6 packets are mostly doing Neighbor Discover Protocol (NDP) and Multicast Listener Discovery (MLD) work. NDP handles router and neighbor solicitation and is similar to ARP and ICMP under IPv4, and MLD handles multicast listener discovery similar to IGMP under IPv4.

TCP v. UDPNumber of TCP packets: 383122Number of UDP packets: 350067

1.09 TCP packets were sent for every UDP packet. I would have thought TCP would be a clear winner, but given that MDNS traffic, which is over UDP, makes up over 30% of the packets captured, I guess this isn't surprising. The 14% of packets unaccounted for at the transport layer are mostly ARP and ICMP traffic. See also this post.

Instant Messenging

Awesomely, AIM, Jabber, MSN Messenger, and Yahoo! Messenger were all represented in the traffic:

Participants

# Packets

AIM

22

580

Jabber

6

2507

YMSG

4

35

MSNMS

3

192

AIM is the clear favorite (at least with this small sample size). Note that Jabber has about 1/4th the AIM participants but 4x the number of packets. Either the Jabberers are extra chatty, or the fact that Jabber is an XML-based protocol inflates the size of a conversation dramatically on the wire. Note that some IM traffic (like Google Chat) might have instead been bucketed as HTTP/XML by Wireshark.

That Windows Remote Desktop Person

119489 packets, or 14% of the traffic, were between a computer in the classroom and what is with high probability a Windows machine on campus running the Microsoft Remote Desktop Protocol (see also this support page for a discussion of the protocol specifics).

Most of the traffic is T.125 payload packets. TPKTs encapulate transport protocol data units (TPDUs) in the ISO Transport Service on top of TCP. TPKT traffic was all "Continuation" traffic. X.224 transmits status and error codes. TCP "ms-wbt-server" traffic to port 3389 on the remote machine seals the deal on this being an RDP setup.

Security

All SSH and SSHv2 traffic was to Linerva, a public-access Linux server run by SIPB for the MIT community, except for one person talking to a personal server on campus.

Protocol

# Packets

% Packets

TLSv1

14390

81.8

SSLv3

3094

17.6

SSL

105

.59

SSLv2

8

.045

5 clients were involved with negotiations with SSLv2, which is insecure and disabled by default on most browsers and never got past a "Client Hello".

HTTP

I wanted to be able to answer questions like "what were the top 20 most visited website" in this traffic capture. The proliferation of content distribution networks makes it harder to track all traffic associated with a popular website by IP addresses or hostnames. I ended up doing a crude but quick grep "Host:" pcap.plaintext | sort | uniq -c | sort -n -r on the expanded plaintext version of the data exported from Wireshark, which gives the most visited hosts based on the number of GET requests. The top 20 most visited hosts by that method were:

Rank

GETs

Host

Rank

GETs

Host

1

336

www.blogcdn.com

11

88

newsimg.bbc.co.uk

2

229

www.blogsmithmedia.com

12

87

www.google.com

3

211

assets.speedtv.com

13

66

sih.onemadogre.com

4

167

profile.ak.fbcdn.net

14

66

ensidia.com

5

149

images.hardwareinfo.net

15

64

student.mit.edu

6

114

static.ensidia.com

16

61

s0.2mdn.net

7

113

www.facebook.com

17

57

alum.mit.edu

8

111

www.blogsmithcdn.com

18

56

www.wired.com

9

93

ad.doubleclick.net

19

56

cdn.eyewonder.com

10

90

static.mmo-champion.com

20

55

www.google-analytics.com

Alas, pretty boring. The blogcdn, blogsmith and eyewonder hosts are all for Engadget, and fbcdn is part of Facebook. I'll admit that I'd been a little hopeful that some enterprising student would try to screw up my data by scripting visits to a bunch of porn sites or something. CDNs dominate the top 20, and in fact almost all of the roughly 410 web server IP addresses gathered are for CDNs. Akamai led with 39 distinct IPs, followed by Amazon AWS with 23, and Facebook and Panther CDN with 16, with many more at lower numbers.

Wrap-up

Using the Internet means a lot more than HTTP traffic! 45 minutes of traffic capture gave us 48 protocols to explore. Most of the captured traffic was background traffic, and in particular discovery traffic local to the subnet.

Sniffing wireless traffic (legally) is a great way to learn about networking protocols and the way the Internet works in practice. It can also be incredibly useful for debugging networking problems.

Tuesday Mar 09, 2010

Startup companies are always hunting for ways to accomplish as much as possible with what they have available. Last December we realized that we had a growing queue of important engineering projects outside of our core technology that our team didn't have the time to finish anytime soon. To make matters worse, we wanted the projects completed right away, in time for our planned product launch in early February.

So what did we do? The logical solution, of course. We quadrupled the size of our company's engineering team for one month using paid student interns.

The Ksplice interns, ca. January 2010.

Now, if you happen to know Fred Brooks, please don't tell him what we did. He managed the Windows Vista of the 1960s---IBM's OS/360, a software project of unprecedented size, complexity, and lateness---and wrote up the resulting hard-earned lessons in The Mythical Man-Month, which everyone managing software projects should read. The big one is Brooks's Law: "adding manpower to a late software project makes it later". Oops.

Brooks's observation usually holds. New people take time to get up to speed on any system---both their own time, and the time of your experienced people mentoring them. They also add communication costs that grow quadratically with the size of the team.

Fortunately, Ksplice benefits from a bit of an engineering reality distortion field (our very product is supposed to be technologically impossible) and, with the right techniques, quadrupling our engineering team's size for one month worked out great. Every intern was responsible for one of the company's top unaddressed priorities for the month, and every intern successfully completed their project.

So, how do you quadruple the size of your engineering team in one month and still keep everyone productive?

At work in the Ksplice engineering office.

Tolerate a little crowding. It took a little creativity to suddenly find a dozen new workspaces in our two-room office. Fortunately, we've found that a room can always fit one more person---and by induction, you can fit as many as you need. (All those years we spent proving math theorems came in handy after all.) Seating everyone close to each other has an important advantage, too: when lots of people on your team have just started, it's handy for them to work right next to the mentors who are answering their questions and helping them ramp up on the learning curve of the organization. With the right team, the crowding can also create an energetic office environment that makes people love to come in to work. (Sometimes it gets in the way of concentration, though---that's when I put on a good pair of headphones.)

Locate next to a deep pool of hackers. OK, so we're a bit spoiled by being headquartered a few blocks away from the Massachusetts Institute of Technology. At MIT, January is set aside for students to pursue projects outside of the curriculum---perfect for hiring an intern army. Many other institutions have either a similar "January term", or a program for students to spend time working in industry during the term.

Know who the best people are and only hire them. Ksplice was born four years ago at SIPB, MIT's student computing group. When a group of students run computing servicesthousands of peoplerely on, and spend hours each week discussing, dreaming, collaborating, and learning from each other on computer systems---some of them get really good at it. Even better, everyone sees everyone else in action and knows exactly what it's like to work with them. Investing some time into getting involved with technical communities makes it possible to hire people based on personal experience with them and their work, which is so much better than hiring based on resumes and interviews. Companies like Google and Red Hat have known for years that being involved in the open source community can provide an excellent source of vetted job candidates.

Pay well. In some industries, "intern" means "unpaid"---but computer science students have plenty of options, and you want to be able to hire the best people. We looked at pay rates for jobs on campus, and pegged our rate to the high end of those.

Divide tasks to be as loosely-coupled as possible. Our internship program would never have worked if we had assigned a dozen new people to hack on our kernel code---the training time and communication costs that drive Brooks' Law would have swallowed their efforts whole. Fortunately, like any growing business, we had a constellation of tasks that lie around the edges of our core technology: infrastructure upgrades, additional layers of QA, business analytics, and new features in the management side of our product. These had manageable technical interfaces our existing software, so our interns were able to become productive with minimal ramp-up and rely on relatively little communication to get their projects done.

Design your intern projects in advance. A key challenge when scaling up your engineering team quickly is making sure that the interfaces are all well designed and the new projects will meet the company's needs. So we spent a good deal of time getting these designs together before the interns started. We also allocated plenty of our core engineers' time for code reviews and other feedback with our interns in order to make sure their work would be maintainable after they left.

Have you achieved more in one month than anyone thought should be possible? How did you do it?

[Edited to make clear our interns were paid and to say more about how we designed their projects for high-quality output.]

About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.