Me Too!

One way of telling how long a Microsoft employee has been working here is their reaction to the phrase “Bedlam DL3”. Just for grins, I was at lunch in the cafeteria with a bunch of co-workers and I blurted out, totally out of context: “Bedlam DL3”. About 3 of the old-timers in the group responded, in chorus “Me Too!”

So why does everyone know about this rather mysterious phrase?

Well, Microsoft’s a pretty big organization. We’ve got well over 100,000 mailboxes in our email infrastructure, and at times it can become rather cumbersome to manage all these. One of the developers in our Internal Technologies Group (also known as ITG, basically the MIS department at Microsoft) was working on a new tool to manage communications with the various employees at Microsoft, and as a part of this tool, he created several distribution lists. Each distribution list had about a quarter of the mailboxes in the company on it (so there were about 13,000 mailboxes on each list). For whatever reason, the distribution lists were named “Bedlam DL<n>” (maybe the tool was named Bedlam? I’m not totally sure).

Well the name of the lists certainly proved prophetic.

It all started one morning when someone looked at the list of DL’s they were on, and discovered that they were on this mysterious distribution list called “Bedlam DL3”. So they did what every person should do in that circumstance (not!).

You know what? They were right – the company’s email system did NOT deal with this gracefully.

Why? Well, you’ve got to know a bit more about how Exchange works internally.

First off, the original mail went to 13,000 users. Assuming that 1,000 of those 13,000 users replied, that means that there are 1,000 replies being sent to those 13,000 users. And it turns out that a number of these people had their email client set to request read receipts and delivery receipts. Each read and delivery receipt causes ANOTHER email to be sent from the recipient back to the sender (all 13,000 recipients). Assuming that 20% of the 1,000 users replying had read receipts or delivery receipts set, that meant that every one of the message that they sent caused another message to be sent for every one of the 13,000 recipients. So how many messages were sent?

First there were the basic messages – that’s 13,000,000 messages.Next there were the receipts – 200 users, 13,000 receipts – that’s and additional 2,600,000 messages.So about 15.5 MILLION messages were sent through the system. In about an hour.

So at a minimum, 15,600,000 email messages will be delivered into peoples mailboxes. But Exchange can handle 15,600,000 email messages EASILY. There’s another problem that’s somewhat deeper.

An Exchange email message actually has TWO recipient lists – there’s the recipient list that the user sees in the To: line on their email message. This is called the P2 recipient list. This is the recipient list that the user typed in. There’s also a SECOND recipient list, called the P1 recipient list that contains the list of ACTUAL recipients of the message. The P1 recipient list is totally hidden from the user, it’s used by the MTA to route email messages to the correct destination server.

Internally, the P1 list is kept as the original recipient list, plus all of the users on the destination servers. As a result, the P1 list is significantly larger than the P2 list.

For the sake of argument, let’s assume that 10% of the recipients on each message (130) are on each server. So each message had 100 recipients in the P1 header, plus the original DL. Assuming 100 bytes per recipient email address, this bloats each email message by 13K. And this assumes that there are 0 bytes in the message – just the headers involve 13K.

So those 15,000,000 email messages collectively consumed 195,000,000,000 bytes of bandwidth. Yes, 195 gigabytes of bandwidth bouncing around between the email servers.

Compounding this problem was a bug in the MTA that caused the MTA to crash that occurred only when it received a message with more than 8,000 recipients. But it crashed only AFTER processing up to 8,000 recipients. So 8,000 of the 13,000 recipients of the message would get it and 5,000 wouldn’t. When the MTA was restarted, it would immediately start processing the messages in its queue – and since the messages hadn’t been delivered yet, it would retry to deliver the message, sending to the SAME 8,000 recipients and crashing. And because of the way the Exchange store interacts with the MTA, even if we shut down the MTA, the messages would still queue up waiting on delivery to the MTA –shutting down the MTA wouldn’t fix the problem, it would only defer the problem (since the message store would immediately start delivering the queued messages into the MTA the second the MTA came back up).

So what did we do to fix it? Well, the first thing that we did was to fix the MTA. And we tried to scrub the MTA’s message queues. This helped a lot, but there were still millions of copies of this message floating around the system.

It took about 2 days of constant work before the email system recovered from this one. When it was over, the team firefighting the crisis had t-shirts made with “I survived Bedlam DL3” on the front and “Me Too! (followed by the email addresses of everyone who had replied)” on the back.

To prevent anything like this happening in the future, we added a message recipient limit to Exchange – the server now has the ability to enforce a site-wide limit on the number of recipients in a single email message, which neatly prevents this from being a problem in the future.

I’m curious why you don’t use a more relational system for storing the e-mail content in exchange…

Surely you could store a message sent using exchange that went to X users on y servers just y times – or once if they all used the same storage?

You would only need to store the details about who it was delivered to, resulting in a performance increase an order of magnitude as you’d just be saying "this user got a copy of this message". You wouldn’t need to store a P1 header except for those users outside of the exchange environment – the list of who got the message is available in the datastore already. Deleting messages becomes as easy as altering a record in the DB to say that it’s deleted and if all references to a message are deleted add it to a list of messages to purge from the system whenever a purge is done.

It would result in dramatically less use of storage, less inter-server bandwidth and faster message access.

To support saving changes to the message you could easily either create a new base message or a delta based on the original – which would again be more efficient…

Umm.. That’s actually the way that the messages are stored internally. Inside the message store, each of those messages to the 100 recipient in each store only occupies a single row in the underlying database.

But even though Exchange is a REALLY good email system, I don’t know ANYONE who would recommend that you put all 55,000 Microsoft employee’s on the same email server, especially back in the Exchange 5.5 days. At an absolute minimum, this single server would represent a massive single point of failure for the entire corporate email system.

There are multiple servers, and the Bedlam DL3 distribution list went to users on several servers. And as long as the message had to go to multiple servers…

We had the same thing with our distribution lists a few years ago. We had about five of them for the whole organization (thousands or addresses all together).

Someone decided to send a presentation file to all the lists which regarded the entire organization. One guy failed to open the attachement (the security setting was too high) and replied to all saying "I didn’t get the file".

This was replied by a lot of people replying to all with messages like "Me too!", "Stop emailing me!", "What’s up, everyone?", "X Y, I know you’re responsible for this!" (X Y being a guy’s name (who had nothing to do with the matter)), "All of you will be fired" (This was sent by one of the high level bosses), etc.

Eventually, they removed everyone’s rights to send mail to these lists except for admins…

1) The new Exchange Permissions system (circa Office XP) has a "Do Not Forward" option but not a "Do Not Reply All" one. In my opinion the latter would be infinitely more useful at curbing embarassment, confusion and unnecessary churn in day-to-day communications. Please add it! :-)

2) Why can’t Outlook warn me, e.g. "This message will be sent to more than 1000 users. Are you sure you want to continue?" when I click send? It would make a lot of people think twice. (Don’t answer that, I know *why*… nested DLs and private members, etc. I still want the feature though. You’re Microsoft, make it happen).

3) Why weren’t those distribution lists locked down so that only their owners could send mail to them? This is a feature in Exchange, right? And is it also a feature that only members of a DL can send mail to it? I’d hope so, but too often I see external Internet email coming to internal DLs. It seems like this functionality should be off by default.

AC: #1: Rights Management (Exchange permissions system) isn’t about preventing people from screwing up, Exchange has had lots of other mechanisms to prevent that (like marking DL’s as restricted) since Exchange 4.0 shipped. The Exchange right management stuff is about privacy, not about preventing user mistakes.

#2: The problem is that the Exchange client can’t know this. There are two aspects to this: First, the DL in question might be in a different forest, in which case it’s just a custom recipient in the local forest – there’s no DL membership to look at. The other reason is that DL membership can be restricted – Outlook can’t see the membership list, so it can’t tell how many users are on it.

#3: That is exactly the user error that happened – the developer writing the application forgot to lock the DL down and bedlam broke out. And reversing the defaults isn’t a good idea IMHO – that would discourage people from using DL’s and create a support nightmare – Imagine the number of calls we’d get from frustrated Exchange Administrators:

"I just created this distribution list, but my users can’t send mail to it!"

The bottom line is that there’s no good answer to this. If we WERE to change the default, then the 95% case would be made harder (administrators would have to check a "allow users to send to this DL" checkbox). And since most of the time users want to be able to post to DL’s, administrators would get in the habit of checking the box every time they created a DL.

The bottom line is that if it’s not a security risk (and this isn’t), chosing the default be the one that users almost always want to do is best (IMHO).

Far be it from me to question you on a tech related matter, especially with Exchange, but #2 above doesn’t seem to ring true to me.

"#2: The problem is that the Exchange client can’t know this. There are two aspects to this: First, the DL in question might be in a different forest, in which case it’s just a custom recipient in the local forest – there’s no DL membership to look at. The other reason is that DL membership can be restricted – Outlook can’t see the membership list, so it can’t tell how many users are on it. "

If I right click on the DL I can select "properties" and view the members of the list can’t I?

Scott – yes, in many cases that’s true. But in some cases such as the two that Larry mentioned (hidden membership and cross-forest), that’s not possible. So perhaps it’s best rephrased as "you can’t *guarantee* that the client would know this."

Technically it’s certainly possible to find a way to make this work – but given enough code & testing, just about anything is :-)

It all started one morning when someone looked at the list of DLs they were on, and discovered that they were on this mysterious distribution list called Bedlam DL3. So they did what every person should do in that circumstance…

While not as intensive nor intrusive as this, we recently ran an exercise against our users, warning them of a potential virus that was incoming, describing it as coming from "Super-User <root@mydomain.com>", the subject line, and sample message text, and advising them not to open it (and if they did open it, not to click on the attachment).

When I created the email, the return address shown (via OE) was indeed root@mydomain.com, but the reply to: address was the entire organization’s DL. (FYI: I created an HTML email that included an image from my webserver so I could read the logs to see who opened it, and included a readme.txt-space-space-space-space-space…space.html file with META redirect tags to an internal webserver’s page that said "Yeah, you shouldn’t have done that…" – and I could check the same logs to see who clicked the attachment.)

Suckers, umm, "errant users" that opened it, realized they screwed up, and wanted to chew me out, replied to the whole organization (4500+ on the DL) (rather than the fictious "root" account) and were promptly embarassed when everybody read their nasty comments.

A good time was had by all in our division. Senior management was not really amused, probably because they were up there in the top offenders list.

(To keep this moderately on-topic – no noticable impact to my six Exchange 5.5 servers, as the majority of good users actually deleted the mail upon receipt, and emptied their deleted items folder as well, which is something they hardly ever do!)

A minor correction: the T-shirt was created by a victim of Bedlam and was sold to raise money for charity (it was around the time of Giving Campaign). I still have mine. And I still remember those two wonderful days where I got NO email at all. Nothing.

Please stop replying to the new mailing list you were added to to ask why you were added. If the 40+ messages in your inbox from other confused coworkers haven’t made this abundantly clear, nobody knows. And we’re all sick of hearing about it. There are over 3,000 of us. I fail to understand how this sort of thing happens. This is almost 2005! Have you never used e-mail before? Do you not understand that a ton of us are…

as in "magnum" Today I finished my important-and-urgent list early, and moved right onto my important-but-not-urgent list. At the top of that is to watch some Channel9 videos already. It turned into a day of sleuthing, but never fear, my…