The Life and Times of Jeff Squyres

Technical Archives

July 14, 2000

Zoom

Spent the remainder of the day hacking minime socket classes (partially as a result of jjc -- because it wants to use no authentication and no encryption to connect up to the web server --
but it was due to happen anyway; better sooner than later). Had to make the authentication and encryption be "mo' separate" than they originally were. This actually led to the sockets being able to accept multiple kinds of encryption and/or authentication in an IMPI kind of way (amazing how this Ph.D. thing really ties in all work that you have previously done...). Also inspired by ssh. It goes like this...

Sidenote: This is lengthy not just because I'm telling people what I'm doing, it's lengthy so that I myself have a record of what I'm doing. Will be quite helpful when I actually go to write that dissertation! Also, I'm about to effectively go on vacation for about 1.5 weeks, and my short term... what was I talking about?

Assumptions:

Authentication is a different step than encryption. First you authenticate, then you setup encryption. A successful connection must pass both steps on both sides.

Connector and acceptor potentially have different lists of authentication/encryption methodologies. Each entity has a list of auth/enc methods that it will allow -- these lists are in priority order. i.e., "strongest" methods are listed first.

The acceptor governs the authentication/encryption choice (i.e., the "server"). As a side effect of this, if the acceptor has no authentication methods defined (for example), the authentication passes (even if the connector has some authentication methods defined).

Authentication/encryption methods are indexed by their string names (makes debugging easier, for one thing).

Different authentication methods can allow for different levels of security. For example, a "shutdown" authentication is assumedly only given to root -- so that root is the only user who can shut down a minime daemon.

The overall process is just about the same for authentication as it is for encryption, so I'll just describe the authentication.

Connector makes a socket connection to the acceptor.

Acceptor spawns a thread to handle this connection and goes back to sitting on accept(). The new thread (after some internal accounting -- a phrase that will never die) sends an integer count of the number of authenication methods that it has, followed by a '\n' delimited string of the names of authentication methods that it has available (if there are more than 0 methods available).

Connector receives this count and [potentially] the list of possible names.

If the count is zero, both sides rule that the authentication was successful. If (count > 0), the connector goes through the acceptor's list (in order) and finds if it has any of the same names in its list. If it does, it sends the integer index of the match back to the acceptor. If it does not, it sends back -1 and then hangs up, ruling that the authentication failed because common authentication method could not be found.

The acceptor receives the integer. If it is -1, it hangs up and the thread handling that connection dies (similarly, if the connector hangs up without sending the -1, the acceptor can deduce that there was no match found). Otherwise, both the acceptor and connector call the accept() and connect() methods (respectively) on the selected Authentication object (sidenote: it only makes sense to have two entry points to the Authentication -- it's much easier to have a 2-way protocol where each side knows who they are so that you can have defined challengers and responders, etc.).

The contents of the Authentication::accept() and Authentication::connect() are obviously protocol-dependent (since they now own the socket, they can do whatever the heck they want). Currently, these routines simply return a bool (remember, gentle reader, they were previously setup with whatever initialization/key/etc. information when they were initialized and attached to the Socket instance) indicating success or failure. The only condition that must be enforced is that accept() and connect() both return true or both return false. It would be a Bad Thing if they returned different answers, because then one party will think that it has passed authentication successfully while the other will hang up (upon which case the other party will detect this and also hang up, but it's Not the Right Thing to Do).

Once the authencitation process passes, we essentially repeat the procedure for setting up encryption. So there's multiple potential points of failure: agreeing on authentication, performing the authentication, agreeing on encryption, performing the encryption setup. If any of these steps fail, the whole connection fails and both sides hang up (assumedly with some nice error message saying what went wrong).

So this is the scheme. For all that text above, it's really not that complicated. Now I gotta go implement it. Code to write.

August 12, 2000

Louisville->SBN thoughts

Had some interesting thoughts about minime yesterday while driving up from Louisville:

Currently, there are two kinds of endpoints for conduits: unix sockets and TCP sockets. How about creating a third kind -- an internal endpoint. This would allow modules within a single minime to be able to talk to each other across the already-existing conduit abstraction.

The minime boot service should be a module. I think that this has been stated before, but the ramifications of this really only occurred to me yesterday. The core minimed will be really dumb, meaning that it only knows how to:

launch itself

listen for messages (on three different fronts)

make connections (on any of the three fronts), which will include authentication and/or encryption

receive messages on these connections and give them to the module that will handle them, or handle them itself (e.g., "shutdown" messages)

shut itself down in an orderly fashion

Astute readers will realize that the previous bullet precludes routing messages across multiple minimed's -- messages always go from one minimed to another, and no further. No problem -- have a "router" module in the minimed that handles this kind of message passing (for non-fully-connected networks, for example) to get from point A to point B -- a la nsend in LAM (although implemented completely differently).

Hence, we're looking at a small minimed with probably the following core set of modules:

network minime boot: ability to launch minimed's on other hosts and build the minime mesh. Currently only planned to execute once -- not add/delete from the mesh.

init.d minime boot: ability to launch minimed at boot up time and either look around to see who else is up and join a mesh (this will be hard) or have a pre-defined mesh (a la a config file, which is much easier) that indicates what other nodes should exist, and just join that mesg. This may be repeated multiple times to add/delete nodes as nodes go up and down. Although I'm defining this to be in the core, I'll likely add this part last.

message router: transparently move user-level messages from point A to point B through some route in the minimed mesh. This will be a higher level of messages that conduits/self-executing messages. High-priority messages should use a "c2c"-like routing (which may still involve multiple hops -- i.e., from gateway to gateway, such as A->gateway1->gateway2->B -- but not have to go all the way through the mesh), where low priority messages can go through the minime mesh (i.e., potentially more hops).

group operations: perform an operation (or set of operations?) on some group of nodes in the minime mesh, and collate the results in a scalable fashion (most likely using some kind of tree over the minime mesh).

uptime: report uptime and some other useful stats (first real useful tool and proof of concept). Will probably use the group operations module to get the uptime for all nodes in a cluster, for example.

fork/exec: run arbitrary user programs.

Obvious security questions have to be addressed, such as only allowing specified authentications to do certain actions such as shut down, run arbitrary programs, etc., etc.

August 25, 2000

Surreality or Blueberries?

Got back out here to LBL, possibly for the last time. Lummy and I are out here for some design meetings about the BLD (Berkeley Lab Distribution). Lummy and I had an uneventful trip out here yesterday, and spent a good portion of the flight chatting about Things, including SC2000, t-shirts for the lab, various future directions of the LSC and some of its members, etc., etc. Good stuff.

We had some good arguments for a few hours about BLD today (Bill Saphir, Eric Roman, Paul Hargrove, and myself). We will continue arguing on Tuesday.

Sadly, though, Bill will be leaving LBL (going to a startup). Best of luck to him -- all hail Bill!

I saw Nathan's final presentation today on what he did with cfengine this summer. We heckled him a bit and asked him a few questions that he wasn't prepared for, but all in all, it was a good presentation -- I learned some things.

Contributed a few minor fixes to the vorbis-dev list today (and had a good typo in one of them -- nice).

Found a paper that was mentioned on HPC Wire today about ye old "6 degrees of separation" issues. These kinds of things are actually closely related to network topology and are highly relevant to my dissertation. It's good stuff. It's interesting to me that these kinds of studies were first started with a psychologist -- Millgram (I'm gonna get this wrong, but it should be somewhat close: the guy who did the pain response tests that had subjects pushing a button that supposedly shocked a "patient" [although it really didn't] and cause the "patient" enormous pain -- they tested how hard they could push a subject into delivering pain to the patient. Only 1 of Millgram's subjects refused to push the button. Peter Gabriel even has a song about this; Millgram's 37.). Millgram told several of his friends/colleagues to deliver a letter to some random person in the country -- someone that he was sure that they did not know. They could only send the letter to someone that they knew on a first name basis and ask them to pass it on in a similar manner in order to get it to its final recipient.

After many such trials, it turns out that the average number of what we now call "hops" was between 5 and 6. More research in this area has led to the now popular "6 degrees of separation" perception, and "the 6 degrees of Kevin Bacon" -- which, if you think about it, is intuitively obvious. Consider: if, by some handwaving, we can assume that most humans on the planet are connected by 6 degrees, then linking any actor to Kevin Bacon -- a vastly smaller domain than all humans on the planet -- seems obvious to be true.

Even though I suck at math, these kinds of things interest me. I read half of a book about this. I can't remember the name of it, but it was also written at Cornell, and the paper that I read today referred to their work several times. Good stuff.

Lummy and I will be heading back to our skanky hotel soon. Outta here.

August 31, 2000

It all started with parallel bladeenc...

Great quote today from HPC Wire:

"The truth is, great software comes from great programmers, not from a large number of people slaving away. Open source can be useful. It can speed things up. But it's not new, and it's not holy water."

-- William N. Joy, chief scientist, Sun Microsystems

There's really two separate thoughts in that quote, but they're both true. More people need to recognize this.

Spent the majority of the last 24 hours writing my talk for tomorrow's LSC lunch. The reason that I've spent 24+ hours on this as opposed to 2-3 is because it has turned out to be much more interesting than I originally thought. Lummy has also found it to be very interesting; it is even possible that this will end up in my dissertation.

So tomorrow's topic actually started many months ago. I was visiting here in Berkeley back in January of this year, and only had a few MP3s loaded on my laptop from my CDs. I was out here for quite a length of time (3 weeks or so), and I was getting sick of the same old MP3s every day. So I started looking at Lummy's CDs (he wanted to rip them, too). So I volunteered to help. :-)

The problem was that it was just too damn slow. The encoding took forever. So after a few nights of hacking, I got a preliminary version of the parallel bladeenc MP3 encoder working (see the "technical details" page on that site for all the nitty gritty of how parallel bladeenc works). This did vastly improve the MP3 encoding process; we could listen to songs faster this way (and before you ask, I only kept MP3s [that were generated from Lummy's CDs] that I already own the CD of). However, I never did get the parallel encoder Right -- it's just about right, but not quite right. Without going into Big Detail, suffice it to say that there's some MP3 framing issues that Jeremy Faller and I tried to figure out and gave up in light of the fact that we had no MP3 documentation.

This brings up an important point -- even though the parallelization of bladeenc was at the very top level, I had to dive down very deep into its tangled code in order to parallelize it. Indeed, there is an odd bit reservoir that has to be drained in order to make the whole scheme work. It took a long time to figure out, and like I said, there's still some MP3 framing issues that are unresolved. To end this already-lengthy explanation: you can't diff the results of parallel bladeenc with that of serial bladeenc. Bonk.

(Some of the following may have already been in a previous journal entry -- I don't remember -- so you'll cope if you've already read it)

So I basically let that go, and stumbled across the vorbis project (I think that I saw it on slashdot or something). Vorbis intrigued me for the following reasons:

Vorbis is a totally free encoder -- it does not have the same legal issues surrounding it as MP3 encoders do.

The vorbis algorithm supposedly has better sound quality than the MP3 algorithm. I won't argue this either way -- I don't know much about these kinds of things... math, ick -- but it does sound good. :-)

The vorbis stuff is all in a library --libvorbis.a. So writing an encoder is trivial. Indeed, they have a sample encoder in their distribution. So no diving into the source code to figure out how it works.

The vorbis library stuff specifically discretizes the steps in the encoding process; all the steps are fast, except one (hmm...).

There is an active vorbis development community. I got very little feedback from the bladeenc community. :-(

So with all this in mind, I posted to the vorbis development list asking "has anyone thought about parallelism?". After some informative discussions with the main developers, I started thinking about it (although not actively coding). After coming up with a suitable architecture, I spent a few hours one night and coded up a multi-threaded vorbis encoder. It seemed to work like a champ... but the output was not diffable with the output from the serial encoder. :-(

I queried the vorbis development list again, and found out that the library is not thread safe -- yet. There's apparently still one issue that makes it not thread safe, something that the developers eventually intend to fix. Oh well.

So then I started thinking about an MPI architecture for such a beast. Conceivably, it would be very similar to the architecture of parallel bladeenc. However, I would like to use threads if/when possible, so I started thinking about mixing MPI and threads (and not in a crass OpenMP/MPI kind of way), and how that would work, particularly in light of the fact that the 2 open source MPI implementations are not thread safe. Ugh.

However, since the vorbis stuff is neatly discretized, it really reduces down to a generic parallel master/slave problem (i.e., replace vorbis_analyze() with a generic calculate() function, and the framework is applicable to any master/slave problem). I realized this yesterday after I made an offhand remark to Lummy that I was thinking about talking about a potential parallel vorbis encoder for my LSC Friday talk.

I spent some time thinking about it (much scrap paper, test programs, etc.), and realize that this is actually quite a thorny issue. It's much more complicated (and interesting!) than one would initially think. Having a generalized master/slave solution for local and remote computation is something that no one has done yet. Yes, we all understand master/slave, but no one has published how to do this with threads and MPI. Indeed, if such a thing was coded up, it could certainly serve as a framework for any task farm kind of parallelism -- only a small number of interface functions would probably be required:

input / preprocess (on the master)

calculate (on the slaves)

postprocess / output (on the master)

registration for the marshalling/unmarshalling of data to be exchanged between the master and slave nodes

So I've spent about 24 hours of thinking and writing about this, and when I went to print it out for the first time, I was surprised to discover that I had written about 12-13 LaTeX fullpages about this topic (a lot of which is pseudocode). I've talked with Lummy about this; I only realized late this morning that this is an extension of the parallel image processing toolkit (PIPT) -- it's a few levels beyond what the PIPT is, but some of the ideas in it are definitely influenced by the PIPT. Sadly, I'll only be able to deliver some excerpt of this tomorrow, but I think Lummy and Bill and I might discuss this further this evening. Andy things that Jeremiah and Rich might follow up on this (if I don't) for GGCL and other things.

I'm not going to go into detail here about what I have already written (perhaps I'll fill in the details here later). I'm just struck by the irony that this all started by a personal pet project and may end up as a chapter in my dissertation (we'll see, but Lummy did mention it).

Threading, in general, can get quite complicated. Having multiple independent threads that don't interact with each other is easy --
anyone with a pthreads book (or man page) can do it. But having multiple threads that have to interact with each other and share data can get arbitrarily difficult. This whole scheme pretty much is in the latter domain -- threads need to share queues and data and whatnot; there's some thorny locking issues involved (which makes it so interesting). I haven't even solved all the issues yet -- I don't know how to give some slaves small amounts of work and give other slaves large amounts of work, for example (there's some convoluted locking issues involved).

Who knows -- this might go nowhere. But it seems pretty interesting, so I wanted to get it On The Record.

September 8, 2000

It Madonna calls, I'm not here

Spent much of yesterday thinking about the generalized manager/worker problem, and spent most of today re-writing my paper about it (I had a good quote to Tracy today, "if I were a theoretician, the paper would be done". But I'm not, and there were still a number of non-trivial practical issues that concerned me. So I re-wrote it). I solved the problem with the variable rate input/output in the input data unit queue -- a cool use of a condition variable and something I call a "reservation system" (which really amounts to two queues managed by mutex and condition variable... which are more or less the same thing anyway ;-). You get all the benefits of a variable rate I/O and blocking (instead of spinning). Here's the problem:

The input thread reads in chunks of input at a time and preprocesses them (working on the idea that preprocessing takes much less time than the "real" calculation). The preprocessed data units go into a data queue. The calculate entities (which can be local threads or remote threads that are serviced by MPI proxies) are removed from the input data unit queue (which will involve an MPI_SEND/MPI_RECV pair if the thread is remote from the manager) and processed. This is the step that takes the most time -- the actual calculation. When each unit is finished processing, it is placed in the output data unit queue (which, again, will involve MPI_SEND/MPI_RECV is the thread is remote). The output thread removes the output data unit from the queue, postprocesses it (again, taking orders of magnitude time less than the actual calculation), and then writes it out to the output datafile.

Believe it or not, this is not rocket science. Pretty standard stuff, actually. Here's the problem:

When we mix local and remote threads, however, he need to give different amounts of work to threads based upon their location -- this can help hide latency for remote threads, because we give them larger amounts of work. They request work less often, which directly translates to less messages, and therefore less latency (latency is bad, Bad, BAD!).

Since we're dealing with multiple threads that have to share the same data structures (i.e., the input data queue), that data structure must be locked in some fashion to prevent multiple threads from entering it simultaneously. That would also be BAD! So we lock it with a mutex -- again, pretty simple. The input thread locks the mutex, adds one or more input units. It's will probably be more than one, actually, for efficiency. Just like sending a smaller number of large messages is more efficient than sending a larger number of small messages -- even if the total number of data bytes is the same in both cases -- it is more efficient to lock something a fewer number of times for longer periods of time rather than a larger number of times, each for a short amount of time. The total lock time is the same in both cases, it's the overhead time (i.e., the time necessary to lock and unlock) that is different. But that's not the whole story -- there's concurrency as well! So it's not quite so simple; hopefully our worker threads will be busy enough with real calculations that their blocking time will be minimal in comparison.

And it's even more complicated than that. In order to save latency again (provided that the input units are somewhat small -- and yes, this is problem-specific), we want to only send one message to each remote node, not one message per input unit. So we have to request M input units per CPU per node. Hence, the proxy MPI thread must determine at the beginning of time how many CPUs each worker node has (no problem -- MPI_GATHER into num_cpus[]). For each remote node, the MPI proxy requests M*num_cpus[i] units, packs them into a single message, and sends them off. A similar thing happens with the output data on the way back --
the MPI proxy on the worker node packs all the output data into a single message and sends it back to the manager MPI proxy.

Anyway, back to the problem. So the input data unit queue is locked with a mutex. We want local threads to be able to retrieve X data units from the queue, but want remote data threads to retrieve Y data units. This is for several reasons:

The latency reasons that were cited above; when Y > X, the remote threads do more work than the local threads to hide latency.

When X and Y are relatively prime, given that one input unit translates to some T amount of time of computation, it can help offset the synchronizations necessary between local and remote threads. That is, it adds "jitter" to the scheduling, such that local and remote threads will be synchronizing at different times, which can reduce contention by preventing bottlenecks.

The question is how to do this? My previous scheme used a semaphore and mutex such that any thread (either a local thread or an MPI proxy thread) could remove single units at a time. This is no good, because retrieving a series of individual units may not guarantee their contiguousness -- another thread may slip in and grab a unit from the middle of your stream. I needed a way to atomically get an arbitrary number of units from the queue, and to be able to do it with blocking, not spinning (spinning == eating CPU cycles, and therefore taking them from someone else; blocking == allowing the thread to be swapped out until some external event wakes it up).

So here's what happens: the input works as before, putting in Q input units at a time (where Q >= 1). It then broadcasts to the condition variable (RTFM Tannenbaum). The calculate and MPI proxy threads are a bit more complicated. They get the mutex and check to see if the reservation queue is empty (this is different than the input data unit queue, but protected by the same mutex). If it is, it puts (threadID, num_requested_units) at the tail of the reservation queue. If it is actually at the head of the reservation queue, it checks to see if there are already enough units in the input data unit queue. If so, it takes them, removes itself from the reservation queue, and unlocks the mutex. If there aren't enough input data units, or this thread is not at the head of the reservation queue, it goes to sleep on the condition variable.

The input thread's broadcast to the condition variable wakes up all the calculate threads that are waiting on it (if any) and they all check to see if a) they are at the head of the reservation queue, and b) if there are enough input data units to service their request. If a thread finds that both a) and b) are true, it services itself, removes itself from the reservation queue, and broadcasts to the condition variable again to wake up the next thread in line. And so on.

Of course, it's not quite as clean as this -- there's other sticky issues like the input queue can be drained (i.e., the input is exhausted) but there's not enough units to fulfill a thread's request, etc. So the extra logic gets a bit corn-fuzing.

Interestingly, the output thread protection is much simpler -- the output thread comes in and takes as many contiguous output units off the output queue as possible. It gets more complicated when we can have an arbitrary number of input and output files being processed simultaneously; dequeuing from the output queue becomes an absolute nightmare. Consider just one issue: if we're processing B input files into B output files, all B input files will be read fairly quickly. Their data will be processed in order, and the processing will take some time. But we want the output files to be ready upon output of the final output unit in each file -- i.e., we have to close() the file. Hence, since only the input thread knows when the input has been exhausted, it has to submit some kind of sentinel value into the input data unit queue for a given output file that filters all the way through the pipeline to the output thread such that it says, "when you get all the output data units from this file, you can close it."

Pair that with the fact that the output data units will be coming in a random order from the calculation threads since they may all be operating at different speeds (indeed, since we're using MPI, the remote worker nodes may not be the same kind of machine as the manager machine). So some kind of order has be associated with input/output data (i.e., sequence numbers). The output thread has to re-order the output units into order, postprocess them, and then write them out to disk. And know when to close() the file since most POSIX systems work with write-on-close semantics.

Whew!

The paper is up to 18 pages now.

I'll probably spend more time re-writing the paper again. It seems that I get all the way through it and then realize one more critical threading/synchronization issue that throws the whole thing off, and makes me start the pipeline from scratch all the way back at the input stage. Still, it's all good, because it's way cool stuff. I've redesigned the pipeline from input to output 3-4 times now, each time better than before. :-)

I talked to Lummy about the paper today, and told him that I'd like to have something for him (and possibly Jeremy, Rich, and maybe even Jeremiah) to read in the next day or two.

September 23, 2000

Bring it on!!

The threaded version of the booter, indeed, seems to improve performance. Again, these are not on unloaded machines, so we can't say for 100% sure, but it certainly seems like it (I know pine will display this table badly; deal):

Number of nodes

2-way

3-way

4-way

5-way

32

0:37.5

0:29.1

0:28.3

0:21.2

147

1:01

0:48.5

0:55.4

0:43.4

(same conditions as before, AFS-cached, etc., etc.)

We have weirdness with the trinary and quad trees in the 147 again. :-( I'm still guessing that there are some strategically "bad" (i.e., heavily loaded) machines in the mix that are causing this. Indeed, it seems to "hang" on the last few nodes on the 4-way in the 147 tests. But again, the only real way to test this would be with a large number of unloaded nodes. :-\

Flying monkeys

Archiving some more test results...

Per Lummy's suggestion, I have compared lamboot vs. a serial ring-like boot of several different sizes to compare the two different topologies. My hypothesis was that they would be roughly equivalent -- the rsh latency would dominate any bookkeeping and efficiency of the two codes.

I used the threaded scaleboot version -- not that it mattered, 'cause there would only be one thread/child anyway. Here's the results:

Program

Number of nodes

8

32

147

lamboot

0:23.1

3:18

15:xx

ring boot

0:22.6

3:15

15:06

I unfortunately forgot to run /bin/time on the biggest lamboot, so I could only go off the timestamps from my unix prompt. Doh...

Also, with all this big testing with lamboot, I am soooo glad that I wrote lamhalt (to replace wipe) -- it takes down a running LAM by simply sending messages to all the lamds, as opposed to doing a whole new set of rsh's to each machine to kill the daemons.

As Arun says, "'wipe' sounds silly and doesn't have the syllable 'lam' in it." lamhalt rocks.

Yeah, ok, I'm still behind a 1.5Mbps DSL line. So what. (actually, I'm streaming MP3s around here behind my firewall, so ethernet collisions were becoming a bit of a problem in terms of performance)

I got version one of my scalable booting working. It does an n-ary tree-based boot across a group of machines. Seems to work pretty well, but is not 100% bug-free yet. That is, it still hangs sometimes -- I think it is because it has done an rsh to a remote node and fails. Here's some preliminary results (units are expressed in seconds):

Number of nodes

lamboot

Binary tree

Trinary tree

Quad tree

16

1:02

0:12.8

0:09.8

0:06.4

32

3:07

0:46.5

0:29.3

0:27.5

147

14:06

1:28

1:00

1:07

Pretty good looking so far. Some notes...

All results were with the binary already AFS cached.

The 16 node tests were conducted on unloaded machines. The 32 and 147 node tests contained nodes that were in use, some of which were heavily loaded (shh!). So these numbers are not perfect. But they are a good ballpark.

The difference between 3 and 4 children is sometimes small. This can make sense -- consider the 32 node case. With 3 children each, the farthest leaf from the root will be 3 hops. With 4 children each, it is the same. Hence, with each of 3 and 4 children, we still have the same number of "timesteps".

Also, the algorithm is sub-optimal, particularly where there are heavily loaded hosts. I believe that this explains why 3 and 4 children on the 167 test results seem weird (it seems that some of the key parent nodes in the 4 node tests were heavily loaded -- I checked). This is not conclusive proof -- I would need a large number of unloaded machines to be able to test this theory. :-( See below.

I had a brief discussion last night with Lummy about this. I presented some timings of lamboot vs. the tree boots. He wants me to run a ring boot as well, and compare. I initially didn't see why he wanted me to do this -- indeed, I thought that it would be the same as lamboot (and I'm still not convinced that it's not the same -- the majority of time is dominated by rsh latency), but he made the good point that I don't have any numbers to back this theory up. As such, I don't know for a fact that they're not different. And I do agree, they are different topologies, so there could be a difference. They're different code bases, too, so subtle differences could mean a lot (although the scaleboot stuff derived from the inetexec.c that is central to LAM's lamboot). I'll code up the ring and see what happens...

The current implementation essentially works like this:

Invoke the program with the -master switch and provide a hostfile.

The program figured out that it is the master, and decides 1) that it has no parent, and b) reads in the hostfile.

Switching into "parent" mode, it does what I call "multi-rsh" for its number of children (default is 2, but can be overridden on the command line). i.e., it fork/exec's rsh's into the background to the children's hosts. This is more complicated than it sounds...

The multi-rsh routine is given a list of username/hostname pairs, and a list of argv to execute on each.

First, you have to send a "echo $SHELL" command to the remote host to see what the user's shell is.

When that comes back, if they are running Bourne shell (and you'd be surprised at how many people do...), the Real argv (denoted by food) has to be surrounded with "( . ./profile foo )" so that their .profile will be executed, and paths will be setup, etc., etc. Goofy, but true.

Once this is determined, fork/exec the rsh with the real command to be executed.

Keep in mind that there are multiple rsh's fork/execed into the background simultaneously; they all have to be tracked by watching their stdout and stderr to determine where they are.

Additionally, when an "echo $SHELL" finishes, it has to be replaced with the real argv and re-launched.

This results in one big-ol' state machine. It's somewhat hairy, but once I figured out some nice abstractions in C++, it worked out ok.

After all the commands are executed, the parent waits for its children to contact it (we passed some command line parameters to each child indicating the parent's IP address and the port number that it was waiting on). This means sitting on an accept()N times, waiting for each of the N children to connect.

As each child connects, give them a list of (M - N) / N username/hostname pairs to boot (where M == total number of hosts that this parent has to boot).

The children go off and do their thing, potentially booting grandchildren.

As each child finishes multi-rsh'ing its children (but before doing the accept()s to give its children work to do), it sends a number upstream to its parent indicating how many children were launched. These numbers all filter up to the root/master so that cumulative stats can be kept about how far along the boot is.

The cycle is broken in two conditions (they're actually the same condition, but I call it two conditions here for ease of explanation):

A child is executed who has no children. When it contacts its parent to get a list of children to boot, it will receive "0" and therefore recognize that it is a leaf in the overall boot tree. It will then send a "-1" up to the parent and close the socket.

When a child has received "-1"'s from all of its children, it will send a "-1" up to its parent and close the socket.

Hence, these "-1"s are propagated up the tree to the root/master, so that when everyone finishes booting, the master knows. It would be fairly easy to put in a fan out after this fan in and complete the barrier process so that the whole tree knows when it has booted, but it wasn't necessary for these tests.

There is a limitation to this approach: we have to wait for the multi-rsh to finish before we can give work to our children. Depending on the number of children used, and depending on the relative speed of the children of a given parent, this may involve some children waiting an a period of time before being given work. This is conjecture at this point, but 1) it seems reasonable, and 2) I hope to prove it with the following...

A single-threaded approach was actually fairly difficult. It involved some big select() statements, and a lot of lookups and extra bookkeeping. i.e., when select() returns, you have to scan down the list of available file descriptors, figure out which socket is ready, figure out where in the process that socket is, and then react. This created a lot of code (thank god for the STL and hash maps!). While the approach seemed to work, I think a multi-threaded approach will be much simpler in design.

With a multi-threaded design, we can have a thread for each rsh. It therefore only needs to monitor its own progress. We don't even need to have select() statements, because it's ok for each thread to block on read() statements, waiting for I/O from the remote process. I believe that the whole programming model will become significantly easier. And, as I mentioned above, there's a chance that there will be greater performance because each child will be able to go at its own speed and not be forced to wait for any of its siblings.

So I'm off to go implement the multi-threaded approach. I should be able to scavenge parts from lamboot and the scaleboot stuff...

September 28, 2000

Be the ball

By request, I did a pseudo-release of the jeffjournal client and server today to a limited audience. We'll see if it works out for them.

Dog just installed the patches for the Solaris Forte6 compiler today, and I gave it a whirl. Initial impressions:

It's slow.

They still didn't fix the linker bug. Compiling minime (which uses STL heavily) still gets all the same STL linker errors. <sigh>

They seem to have fixed much of the Memory Badness with using bcheck in C++, but I still get a fairly lengthy "blocks in use" report at the end of my run. <sigh> At least these are not potentiall fatal errors, though...

Jeremy claims that they don't have iterator_traits. I hope he's wrong...

Trippy message from running a multithreaded program through the new bcheck:

October 3, 2000

San Demas High School football RULES!

I hit a problem the other day: how to tell when a machine is down? i.e., given a random IP name/address, how do you tell that it is down without a really lengthy timeout?

For example, try sshing to a machine that is down. Not one that doesn't have ssh enabled -- that rejection is almost immediate. And not to an IP that doesn't exist, or isn't reachable by you -- those rejections are also immediate. But ssh to a machine that is currently powered off, or not connected to the net.

It can take a long time to timeout.

A Solaris 7 machine takes almost 4 minutes (3:45) for a telnet to timeout to a host that isn't there. It takes 15 minutes for ssh to timeout (again, on Solaris). Quick testing showed that the majority of the 3:45 time was spent inside a single connect() call.

But a linux machine takes 3 seconds for telnet to timeout to a host that isn't there. What's it doing differently? How can it tell so quickly that the machine is not there?

I ran my connect() test on both Solaris and Linux, and the results were identical to telnet -- Solaris sits for a long time on connect(), and then eventually times out. Linux only sits for a few seconds in connect() and then returns with a "no route to host" error.

Hmm. If connect() does not report the same error in the same way across multiple OS's, how do I do this? Indeed, Linux's behavior is great -- but what do I do on Solaris (and anyone else who doesn't return in 3 seconds)?

I got to thinking about the problem, and decided to look at some network and hacking tools. ping was my first stop. ping works in interesting ways. I didn't realize that it had its own protocol stack (like TCP and UDP). It works like this: you open an ICMP socket (you don't don't bind it to a port). From that socket, you send packets to the ping recipient. The ICMP stack on the other side will reply right back to you. Here's the catch: all ICMP replies come to a single point -- so if you have multiple ping programs running simultaneously, they'll see each other's ping replies (makes sense, if you think about it). Hence, you have to put some encoding in the payloads of the ping requests (which the remote ICMP stack will echo right back at you) to know which requests are yours and which you can discard.

Hence, here's a nice way that you can tell if a machine is up --
send it an ICMP packet. If you don't get one back in a relatively short timeout (probably even user-runtime-settable), rule it as "down". No problem.

Wait -- there's a catch. You have to run as root, 'cause the ICMP stuff is protected. Crap. We don't like setuid programs.

nmap was my next stop. They've got all kinds of goodies in there. SYN scans, FIN scans, etc., etc. They note, however, that many of these are not available to non-root users. Hence, they try the connect() thing as well when a non-root user tries to scan a machine. Again, Linux bails in 6 seconds saying "machine is not up" (this must be due to Linux's short connect() timeout). Solaris, however, takes much longer -- 1 minute. But it is significantly less than 3:45 that we saw in both telnet and the raw connect() call.

Some poking around in nmap revealed the following:

It's actually pretty small; only a dozen or so .c files. For something as full featured as nmap, I would have guessed that it would have been larger. Who knew?

It seems to be pretty well coded -- I could actually code the code pretty easily. They have good voodoo; color me impressed.

The non-root ping scan tries a connect(), but does it in a non-blocking way, and repeatedly uses select() to check if the connect() has finished yet. A neat trick --
this allows them to set their own timeout (evidentially somewhere around a minute; I didn't bother checking what it actually was).

So I'm going to have to try this -- code up my own non-blocking connect() and put it in my threaded booter and give it a whirl. Too tired right now, though -- this will be tomorrow's activity.

"I'd actually like to see a non-blocking MPI_WAIT."- Shane Herbert, MPI-2 Forum.

October 11, 2000

Fuzzy dice on a motorcycle

I've fallen behind on my journal entries. Cope.

Brief synopsis on dissertation stuff: got the threaded boot "fully" working. It still sometimes hangs in a very large boot (e.g., .helios scale boots, ~148 hosts or so) near the end. I suspect that someone drops out of the mesh before the boot finishes, but I haven't had a controlled failure yet to check the logs and see what is really going on. Additionally, sometimes a given node drops out when we do arbitrarily large numbers of children (e.g., at 12, foster.helios.nd.edu somehow decides that it doesn't want to boot. I don't know if this is an artifact of foster's parent screwing up [e.g., running out of file descriptors], or if foster itself somehow legitimately getting hosed. It's hard to tell, too, because all of these machines are actively being used when I do my tests :-).

I had to spend a good amount of time writing the jmesh algorithm into the code. I was using the Boost Graph Library, which was written by Rich Lee and Jeremy Siek here in the lab. However, as is always a danger with developing code, the APIs and concepts are continually changing, and the docs that I have (i.e., the book that Jeremy is writing on the BGL) is not consistent with what is available for public consumption at www.boost.org. Additionally, Jeremy's local CVS copy changes stuff even further. As a result, I spent a long time before I actually got it working. Arf.

However, I did come up with an iterative method to generate a list of edges in a jmesh that doesn't use any lookups at all
-- it just generates pairs of vertex numbers, and then we smush those into the constructor of a BGL graph. As such, it's considerably faster than the version that I wrote before -- the prior version would go to each vertex, check how many edges it had, determine if it needed more, etc., etc. This new version just does bookkeeping as it goes along with a small number of integer variables, and All is Well.

Now that I've finally got that working, I can add the stuff for all the nodes to make the connectivity implied in the jmesh, drop the boot tree connectivity, and then sit there waiting for commands. Not for now...

Unfortunately, however, Bill from NIST sent around an e-mail today saying that we'll be having a conference call about the SC'2000 IMPI demo on Friday morning. Doh!!! I haven't done butkis on IMPI yet. I've got to do the following:

Finish implementing the attribute color stuff per IMPI section 2.5

Implement MPI_BARRIER on IMPI communicators

Make LAM/MPI compliant with the IMPI errata

Get the pmandel demo code working with a few instances of LAM

It would be Really Good to get this all working by the call on Friday so that forward progress can be claimed...

Sidenote: it's really been quite a while since I've worked on IMPI. I am finding out how much I forget about how it works. Doh!

I don't think that either project's main goal is formal python MPI bindings, but instead have some main "real" project that is [at least partly] in python, and they wanted to use MPI. I conversed with the sourceforge project author (at Lawrence Livermore); they're actively using it. I asked if there will ever be a formal release (all that's on sourceforge is CVS, not a real distribution). Haven't heard back yet.

Tony Hagale got my journal up and running. Woo hoo! Not entirely pain-free, though. Had to upgrade his C++ compiler, etc. He had some initial problems with quoting, as well. Not quite sure if that was a local configuration issue or a bug in my code ('cause it doesn't happen to me :-).

Started running MojoNation on squyres.com. Speaking from a distributed/crypto standpoint, that's some really cool shit!

Much work to do to get IMPI into shape. Miles to code before I sleep. Rusty will be here all day tomorrow; he's giving a talk on MPICH's daemon, and then Lummy and I are going to the LaSalle grill with him for dinner (yummy). Should be quite interesting.

February 19, 2001

Look Dave, no strings.

Ugh. I've spent the past few days fighting the return semantics of rsh and ssh.
In trying to make the tree-based booter industrial strength by putting it into LAM, I found out that not all rsh implementations are created equal. Grrrr...

It seems that some versions of rshpretend to close stderr, but will in fact actually send things across it later. i.e., read() will return 0, but then will later return a positive number and have valid bytes in the buffer.

ARRGHHH!!

There's also some mysterious things happening that I don't fully understand yet (this only happens when you scale to above 25 nodes or so). So I finally decided that if rsh cannot be trusted, the whole framework in LAM for generic remote-launching is wrong. i.e., the whole issue is about determining if the remote program started successfully or not. How to do this in a programtic fashion?

It currently goes like this (and rsh can be replaced with ssh or whatever):

Open two pipes

fork() a child process

Close the respective pipe ends in the parent and child processes

Tie the pipes to the stdout and stderr in the child process

The child exec() the rsh command command

The parent watches the pipes:

If something comes across stderr, our heuristic says to abort

It something comes across stdout, buffer it

When stderr and stdout close, the child is done, quit the loop

The parent calls waitpid() to wait for the child to die

If the return status of the child is not 0, abort

If we incorrectly determine that a remote program failed to start (i.e., it actually did start, but the local node thinks it didn't), the remote program gets stranded, and is left running forever because no one will ever contact it again. Among other reasons why this is bad, this is anti-social behavior.

Plus, the code is complicated as well because of the statefull nature it has to maintain while checking multiple data sources in a non-blocking way. Ugh. And I didn't even mention how we have to check and see if the other side is running a Bourne or Korn shell...

The long and the short is that the remote agent (rsh, ssh, whatever) cannot be trusted to give reliable information. So the only thing to do is to disregard the information that it gives and determine if the remote program started correctly by a different means. One way to do that is to have the remote process call the spawning process back with a TCP socket.

If the remote process doesn't call back within a timeout period, the spawner can reason that it failed and give up on it. If the remote process starts up properly and is unable to contact the spawner (perhaps it took a long time to start, and the spawner has timed out already), it will just abort. This prevents orphaned remote processes.

Specifically, I'm looking at something like:

Parent creates listening socket for the callback

Parent launches a thread to wait for the callback on that socket

Parent makes three pipes (for stdin|out|err)

Parent fork()s a child

Parent closes appropriate ends of the pipes

Parent launches two threads to monitor the pipes

Parent launches a thread to block on waitpid()

Child closes appropriate ends of the pipes, ties the other ends to stdout|err

Child exec()'s the remote agent

Parent blocks on a queue

When either of the pipe threads wake up on a read, they buffer the data and put it in an event and queue it up for the parent

Closing either of the pipes is similar -- an event is queued up for the parent followed by the thread committing suicide

When waitpid() returns, the return status is queued up in an event for the parent, and the thread commits suicide

When the listening thread succeeds on accept(), it begins the authentication/connection protocol. Upon success, it queues up an event for the parent (including the open socket file descriptor) and commits suicide.

When all the threads die, it means that the remote process has started up, the remote process has authentications and indicated that it wants to run, a socket is still open to the remote process, the remote agent is now dead, and all threads/processes have been reaped, so the parent can now continue.

In the previous scheme, the remote agent would launch the remote program. The remote program would immediately close stdin|out|err and then fork a child into the background as a user-level daemon, and then quit. This would allow the remote agent to finish normally (hah!). The child process would then continue on to do whatever it was launched to do.

In the new scheme, there is no need to have the remote agent finish until the callback to the spawner has completed and there is no more gain to having the remote agent process around anymore. i.e., in the previous (linear) scheme, it was necessary for the remote agent to quit before the next step would proceed (wait for a callback). In this scheme, they are independent events -- the remote agent quitting has little bearing on the callback since those are in different threads. Indeed, it may be advantageous to have the remote agent stick around until the callback occurs successfully to give one more way to abort the remote process if something goes wrong. That is, if something goes wrong and the callback gets mucked up, send a signal or some kind of message down the stdin pipe to the remote agent, which will get passed to the remote process that will cause the remote parent and child to abort.

Additionally, just like giving each remote process a thread to manage it, giving a thread to each of the stdout and stderr pipes eliminates the combined state machine and uses blocking reads. This makes the algorithm for monitoring the pipes much simpler. Hence, we can monitor the pipes, waitpid(), and the callback separately, and therefore greatly simplify the code (why didn't I think of this earlier?).

Jeff's law of non-blocking:

writing blocking algorithms is much simpler than writing non-blocking algorithms.

Jeff's law of blocking:

writing concurrent blocking algorithms introduce their own problems, but generally only in terms of infrastructure, and are typically problems that are already solved.

What's even cooler is that the remote process can startup, call back the spawner, and give a "I'm ready to go" message, or "things suck over here; I can't run so I'm going to abort" message. i.e., the remote process can decide whether it's going to run or not (e.g., check to see if the load is not too high) and send back a yay or nay to the spawner. Even cooler than that -- an integrated startup protocol allows for authentication instead of security through obscurity (security though obscurity isn't, for those of you who care!).

I'm currently in the middle of re-writing all this code (it takes time to setup the infrastructure and whatnot). The result should

February 21, 2001

I think Joe saw us in the movie theater last night

I've gotten an unexpected result from my thread booter.

When booting across the ND helios cluster of 161+ sparcs (some of which should fail, BTW -- at least 2-3 are down at any given time, and about 5-10 are a different version of Solaris than the rest such that there are shared library linker problems trying to run on them).

Even with about 10-20 nodes expected to fail, about 1/3 of them fail to boot properly on a regular basis. This is many more than expected.

The main reason is that the parent that is trying to boot them times out. i.e., if the child does not callback on the socket within N seconds, the parent decides that the remote boot must have failed (even if the boot does succeed at some point later, and the child does try to boot). The parent rules that that child is dead and moved on to the next.

The weird thing is that this was happening a large percentage of the time; much more than I expected. Worse than that, it was inconsistent -- I would get different results every time I did a helios-wide boot (even if they were only separated by only a few seconds). This is clearly not good enough.

One solution is to increase the timeout time (I was using timeout values of 5, 10, and 30 seconds -- the problem occurs with all the values). Increasing the timeout value to 120 seconds seems to make it work most of the time; most bootable helios machines actually boot properly. However, this significantly adds to the overall boot time because we now have to wait 2 minutes for each individual failure before moving on to the next child, which is undesirable.

So I think I need to change my booting algorithm yet again (this is the point of research, isn't it?).

Still keep the basic tree-based structure.

To overcome the problem with slow children, we need a system where the work of one child can be given to another, but need to keep this in a tree-based structure (vs. a monolithic server) so that we don't run out of resources. That is, some kind of first-come, first-serve basis, since we know that if a child requests work, it is ready to go. Faster children will naturally ask for more work.

Right now, each parent node receives a list of all the children that it is responsible for booting. It divides this list up into N sub-lists (where N is the number of immediate children that it will boot), spawns a thread for each, and gives each thread one of the sub lists. This needs to change.

Instead, spawn off N threads and give them each one child to boot. The parent thread keeps the rest of the list of nodes that it is ultimately responsible for booting.

If a child fails to boot by some kind of immediate failure (e.g., a ping to that child fails), the parent can kill that thread and launch a new thread and give it the next node from its master list.

When [if] a child actually boots successfully (which is defined by the grandchild opening a socket back to the child and saying, "Ok, I'm ready -- gimme some work"), it asks the parent for a subset of nodes from the parent's pool. The parent will give a list roughly of size total_size/N so that each descendant's subtree will be about the same size, which then child then passes on to the grandchild.

Aside from the parent keeping the list of children, this is more or less how it happens now.

Here's the new part: when a parent finishes (actually, when any node finishes -- whether it was originally a parent or a leaf), it sends an "all done -- got any more work?" message to its parent.

If the parent's parent has any more work, i.e., it has some nodes left in its pool because one or more of its children were slow to boot, it will give a subset list (of about the same size as it has given out to every other node who queried) to the querying child.

If the parent's parent doesn't have any more work, it passes the request to its parent, where the same procedure is repeated. If any work is eventually found, it is sent back down the tree to the original child who queried.

As such, with this scheme, it is possible for a grandchild (or some node even further down) to steal work from a slow child. This scheme can allow for long timeouts which may be necessary (particularly in an active workstation environment), but still allow for speed in the overall boot -- we just eliminate the blocking on slow children by potentially taking away their work from them.

A side-effect of this is that the overall tree may become lop-sided. But that doesn't really matter much, because the idea here is to parallelize in time, not in space. So if we have a slightly taller-than-optimal meta-tree at the end, it doesn't matter -- the meta tree is only for booting and will be discarded anyway.

It's good to be a gangsta

This is particularly annoying where you close file descriptors 0, 1, and 2 and re-open them to something else, because cin/cout/cerrwill still use those file descriptors, and will read/write to the new things that you opened!

I finally figured out that that was what was happening on my tree-based booters -- I had debugging cerr's that ended up writing down sockets, which caused havoc on the remote side, because it got unformatted and unexpected messages. Doh!!!

In hindsight, this completely makes sense (and may even be by design; I don't have an iostream book handy). Consider that cin/cout/cerr are not tied to the OS -- they have no way of knowing when file descriptors 0, 1, and 2 have been closed and reopened to something else. For example, cout's operator<<(...) assumedly eventually boils down to:

write(1, ...);

In which case, cout has had no indication that file descriptor has been closed and re-opened into something else.

Just a point of wisdom for readers out there... it caused me three days of grief.

April 21, 2001

Gazizza, Bill

The exact topic of my dissertation has changed several times.

Here's what I presented to my committee last week, with their comments applied, as well as with information from my coding it up (particularly with their changes). Those who aren't geek-minded can probably ignore the rest of this message.

Fair warning: this is a pretty long journal entry!

Background and Algorithmic Overview

The idea is to have a "fully generalized manager-worker framework". However, the end result is that it's not quite the manager-worker model -- it's more like a "fully generalized, threaded distributed work farm". I started with a model for any [serial] kind of computation that looked something like this (it won't render right in pine -- deal -- go look at the web version):

If you throw some queues in there, you can duplicate and therefore parallelize the Calculate step (keep the Input and Output steps serial, because a) they'll probably take very little time, and b) any part that can be parallelized can be thrown into the Calculate step):

So what I'm doing is two things: extending this model to include threads (still relatively unexplored areas with MPI, particularly since the 2 major freeware MPI implementations have little multithreading support) and to make a distributed scatter/gather scheme.

The goal here is to present a framework (i.e., a library) to the user such that they only have to supply the Input, Calculate, and Output steps. Yes, they do have to be aware of the parallelism, but only so much so that they can make their problem decomposable. The framework takes care of all the bookkeeping. Hence, the user essentially writes three functions (actually, 3 classes, each with a virtual run()-like function, and functions to pack/unpack their input and output data. As briefly mentioned in previous journal entries, I ended up using C++ templates heavily so that the type safety would all work out).

The target audience is people who want parallelism but don't really care how it works. That would be most engineers and scientists -- even some computer scientists| Most of these kinds of users just want to run their results, and run them faster -- they don't care how it works. After all, that's our job (as computer scientists), right?

Back to the description...

From the above picture, if we're only running on one machine (say, a 4-way SMP), the Calculate boxes (instances) will be individual threads. The Input and Output instances will be threads, too. By default, there will be one Calculate thread per CPU -- the Input and Output threads will be "extra" and cause some thrashage of CPU scheduling, but not very much -- particularly when the Calculate step is large enough to run for a while.

Note that the two queues do not have threads running in them --
those queues are just data structures with some intelligent accessor functions. The Input, Calculate, and Output threads access the queues and become a thread active in the queue. But there are no separate threads running the queues themselves.

Using threads is nice because it avoids the whole issue of extraneous memory copying and allows message passing latency hiding (even with single-threaded MPI implementation). If we used the same model with pure MPI instead of threads -- i.e., where Input, each of the Calculate instances, and the Output were all separate MPI ranks on the same node, we'd be doing sends and receives between each of the instances (the queues would possibly be located in the Input and Output ranks), which would invoke at least one memory copy (and probably more). If the input data and output data are large, this could add up to be a non-trivial portion of the wall clock execution time. Using threads within a single process, pointers to input/output data can just be passed between the Input, Calculate, and Output blocks. i.e., pass by reference instead of by value. Therefore, it makes no difference how large (or small) the input and output data is.

Extending this model to cover multiple nodes, let's throw in a definition first. The node on which the Input and Output are run is called the "Server". While it would certainly be possible to run the Input and Output phases on different nodes, this model will assume that they are on the same node, just for simplicity. It is [probably] not difficult to separate them, but this work doesn't focus on that issue. Hence, there's only one server in this model, regardless of however many nodes are involved in the computation.

To extend this model to include multiple nodes, we add a special kind of Calculate instance to the diagram from above -- a "Calculate Relay":

This RelayCalc instance has the MPI smarts to send input data to, and receive output data from, remote nodes. Notice that it just dequeues input data and enqueues output data just like the other Calculate instances. Hence, the Input and Output instances do not need to know anything special about remote access.

Also note that there will be a thread running the RelayCalc instance. One could conceivably model the relays in the queues, but this would entail having 2 relays, and would cause some issues with non-thread safe MPI implementations (although these issues arise elsewhere, anyway), and it would destroy the idea of not having threads running in the queues. While threads are nice and lightweight, we don't need to have extraneous threads running where we don't need them. Not only are they not free (in terms of resources), they do add complexity (e.g., what would threads running in the queues do?).

The RelayCalc fits in the came category as Input and Output --
it's an "extra" thread, but it is not expected to take many CPU cycles (particularly when the Calculate phase is non-trivial).

Note that there is only one RelayCalc instance, regardless of how many nodes it is relaying to. This greatly simplifies the relaying with a single threaded MPI implementation -- indeed, to have N instances of RelayCalc to relay to N remote nodes would mean that a global lock would have to be used to only allow one RelayCalc instance in MPI at any time. This would mean that that all the RelayCalc instances would have to poll with functions such as MPI_TEST. And this would involve continually locking, testing, unlocking between all the RelayCalc instances, which would certainly keep one or more CPUs busy doing message passing rather than working in the Calculate instances, which is not desirable.

Hence, there's only one RelayCalc instance that can do blocking MPI_WAITANY calls to check for messages from any of the nodes that it is expecting output data from (and checking for completion of sent messages -- see below). This will probably serialize message passing in the server, but that is to be expected with a single-threaded MPI implementation anyway.

Indeed, even if the MPI implementation were multi-threaded, there will frequently be less network interfaces than remote nodes (typically only one), so the network messages will likely be at least somewhat serialized anyway. The best that a multi-threaded MPI implementation could do would be to pipeline messages to different destinations across the available NICs, but that's within the MPI implementation, and not within the user's (i.e., the framework's) control. Indeed, a quality single-threaded MPI implementation can pipeline messages anyway (if non-blocking sends are used). So there's actually little gain (and potentially a lot of CPU cycles to lose) in having multiple RelayCalc instances when using a single-threaded MPI implementation -- the same end result of having multiple RelayCalc instances with a true multi-threaded MPI implementation can be achieved with carefully coded single RelayCalc instance using non-blocking sends and receives with a single-threaded MPI implementation.

(There's a lot of the finer details that I didn't cover in the previous two paragraphs; those are currently left as an exercise to the reader. :-) Read my dissertation for the full scoop)

The Input and Output instances have been replaced by RelayIn and RelayOut instances, respectively.

As far as the Calculate instances are concerned, the model is the same -- it dequeues input, processes, and enqueues output.

The RelayIn and RelayOut instances are the MPI entry and exit points -- input data is relayed to the RelayIn instance from the RelayCalc instance on the Server, and output data is relayed back to the RelayCalc instance by RelayOut. This is why the user has to supply not only the Input, Calculate, and Output instances, but also methods to pack and unpack their the input and output data -- the framework will call them automatically to send and receive the data between nodes.

But again, in terms of the Calculate phase -- nothing is different. It operates exactly as it does on the server node. The framework has just added some magic that moves the input and output data around transparently.

There are now two threads vying for control of MPI. Since we only have a single-threaded MPI implementation, we cannot have both of them making MPI calls simultaneously. The following algorithm allows both threads to "share" access to MPI in a fair manner.

In the beginning of the run, the RelayIn instance has control of MPI because we expect to receive some number of messages to seed the input queue. After those messages have been received, control of MPI is given to the RelayOut. The RelayOut will block while dequeuing output data from the output queue (since the Calculate threads started acting on the data as soon as it was put in the input queue), and then return the output data to the Server. Control is then given back to the RelayIn in order to receive more input data.

That is, the message passing happens at specific times:

Messages will only be received at the beginning of the run, or after messages have been sent back to the Server

Messages will only be sent after messages have been received and the Calculate threads have converted them to output data

Specifically, incoming and outgoing messages will occur at different (and easily categorizable) points in time. Indeed, outgoing messages will [eventually] trigger new incoming messages, and vice versa. So the simple "handoff" model of switching control of MPI between the RelayIn and RelayOut instances works nicely.

A big performance-determining factor in MPI codes can be latency hiding. Particularly in high-latency networks such as 10/100Mbps ethernet. An advantage of this model is that even with a single threaded MPI, progress can be made on message passing calls which actual calculation work is being done in other threads. This pipelined model can hide most of the latency caused by message passing.

That is, the RelayIn thread can request more input data before the Calculate threads will require it. Hence, when the Calculate threads finish one set of data, the next set is already available --
they don't have to wait for new data to arrive.

A possible method to do this is to initially send twice the amount of expected work to each node. That is, if there are N Calculate threads on a given node, send 2N input data packets. The Calculate threads will dequeue the first N input data packets, and eventually enqueue them in the output. The next N input data packets will immediately be available for the Calculate threads to dequeue and start working on.

Meanwhile, the RelayOut thread will return the output data and the RelayIn thread will [effectively] request N more input data packets. When the N input data packets arrive, they will be queued in the input queue for eventual dequeuing by the Calculate threads. This occurs while the Calculate threads are working -- the message passing latency is hidden from them.

This scheme works as long as the Calculate phase takes longer than the time necessary to send output data back to the Server and receive N new input data packets. If the Calculate phase is short, the RelayIn can initially request more than 2N input data packets, and/or be sure to use non-blocking communication to request new input data blocks so that requests back to the Server can be pipelined.

To improve the scalability of the system by removing some of the bottlenecks in the scattering/gathering, non-server nodes can also have a RelayCalc instance:

This RelayCalc instance will relay input data to additional remote nodes, and gather the output data from them, just like the RelayCalc on the Server node.

The implication of having a RelayCalc step is that we can have arbitrary trees of input and output. That is, the Server is not the only node who can scatter input out to, and gather output data from, remote nodes -- arbitrary trees can be created to mimic network topology (for example). Consider the following tree:

the_big_cheese is the Server. It has two children, child_a0 and child_a1. child_a0 has only one child, but child_a1 has two children. The numbers in parentheses represent the MPI rank numbers (with respect to MPI_COMM_WORLD). Note that there is no restriction to having a maximum of two children =- this is just an example. Each node also has one or more Calculate instances. So the end result can be a large, distributed farm of compute nodes.

This refines some of the previous discussion: the various Relay instances (In, Calc, Out) will actually not necessarily talk to the Server -- they'll talk to their parent, child, and parent, respectively. In some cases, the parent will be the Server. In other cases, the parent will be just another relay.

The RelayIn will now need to request enough input data packets to keep not only its local Calculate threads busy, but also all of its children. This is accomplished by doing a tree reduction during the startup of the framework that counts the total number of Calculate instances in each subtree. This will allow a RelayIn to know how many Calculate instances it needs to service. The RelayIn can then use an appropriate formula / algorithm can be used to keep its input buffer full (as described above) before any of the local Calculate instances or the RelayCalc instance needs data.

The astute reader will realize that there are now three threads vying for control of MPI. Therefore, the simple handoff protocol discussed above will not work (although the handoff protocol is still applicable for "leaf" nodes, where there is no RelayCalc instance). To make matters worse, both RelayIn and RelayCalc will potentially need to be blocking in receive calls, waiting for messages to asynchronously arrive. RelayIn will only receive messages at discrete times (as discussed above), but the frequency at which RelayCalc can receive messages is determined by the node's children, and therefore could effectively be at any time. That is, since RelayIn/RelayOut will be driven not only by the actions of its children nodes, but also by its local Calculate threads, the times at which RelayCalc will need to receive messages is not always related to when RelayIn/RelayOut will need to communicate.

Specifically, there will be times when RelayCalc needs to receive a message that will independent of what RelayIn and RelayOut are doing.

It would be easiest of RelayCalc could just block on its MPI_WAITANY while waiting on a return message from any of its children. But this would disallow any new outgoing messages from RelayOut (and therefore any new incoming messages from RelayIn). The implication is that nodes will have to wait for a message from their any of their children before they can send the output data from their local Calculate threads back to their parent, and therefore have to wait request any new input data.

This can be disastrous if a node's children are slower than it is. In this case, a fast node could potentially drain its entire input queue and be blocked waiting for results from any of its children before being able to ask for more input data from its parent. Even worse, this effect can daisy-chain such that slow nodes could cause the same problem in multiple [faster] parent nodes; the fast parents could all get trapped waiting for results from the one slow child.

These questions are addressed further in the following section.

This tree design will help eliminate the bottleneck of having a single Server that has to communicate with N nodes (especially as N grows large) -- the problems of serializing the message passing could easily dwarf the CPU cycles given to the Calculate instances. That is, the communication could become more costly than the Calculation.

But just having a tree structure for scattering/gathering is not sufficient. Indeed, if a leaf node (e.g., child_c0) sends an output block back to its parent, and its parent immediately sends it to its parent, etc., all the way back up to the Server, this would effectively be no different than if all N nodes were connected to the Server directly -- the Server will get a message for every single output data block. This model would then only add hops to input and output data rather than increase scalability.

Instead, the various Relay instances will gather multiple messages into a single message (or a single group of messages) before sending them up the tree. For example, a RelayOut instance can wait for output data from each of its Calculate instances before sending them back to its parent. The RelayOut instance will send all of its N messages at once in a "burst" such that its parent RelayCalc instance will be able to process them all in a short period of time and then relinquish its CPU back to a Calculate instance. I'll refer to this group of messages as a "mega message", below.

Likewise, there will need to be some flow control on the messages from RelayCalc instances. It is desirable to group together multiple "mega messages" into a single "mega mega message" in order to send larger and larger messages as output data propagates up the tree, and therefore decrease the number and frequency of messages at upper levels in the tree. Hence, the mega messages that are received by a RelayCalc must be grouped together, possibly in conjunction with the output data from the local Calculate instances, before sending to the node's parent.

But how to do this? Does the RelayCalc just wait for mega messages from all of its children before enqueuing them all to the output? It would seem simpler to just enqueue the mega messages as they come in, and when the RelayOut sees "enough" messages, it can pass a mega message of its own (potentially larger than any of the individual mega messages that it received) to its parent.

One definition for "enough" messages could be N mega messages (where N is the number of children for this node), and/or M output data enqueues (where M is the number of Calculate instances on this node). This may also be a problem-dependent value -- for example, if the Calculate process is short, "enough" messages may to be a relatively small value.

This scheme will probably work well in a homogeneous world. But what if the node and its children are heterogenous? What is some nodes are more powerful/faster than others, or if the network connections between some of the children are heterogeneous? For example, what if some of the children nodes are connected via IMPI, where network communication to them is almost guaranteed to be slower than network communication to local MPI ranks?

The heterogeneity effect implies the problem discussed in the previous section -- that slow Calculate instances can cause parent nodes to block with no more work to do, and not be able to obtain any more work because RelayCalc has not returned from waiting for results from a child.

Another alternative is to use the "separate MPI thread" approach (where all threads needing access to MPI use simple event queues to a separate MPI thread), and have the separate MPI thread use all non-blocking communication. But instead of using a blocking MPI_WAIT approach, use the non-blocking MPI_TEST polling approach. The problem with this, as discussed previously, is that this could incur a undesirably significant number of CPU cycles, and therefore detract from the main computations in the Calculate instances. If polling only happened infrequently, perhaps using a backoff method (finitely bounded, of course), this might be acceptable.

Note that there will be one "incoming event queue" for the MPI thread where threads can place new events for the MPI thread to handle. But there will be multiple "return event queues" where the MPI thread places the result of the incoming events -- one for each thread that enqueues incoming events.

The various threads that need to access MPI place events on the MPI thread's shared incoming event queue, and then block on (or poll) their respective return event queues to know when the event has finished. An event is a set of data necessary for an MPI send or receive.

The general idea is that the MPI thread will take events from its incoming queue and start the communication in a non-blocking manner. It will poll MPI periodically (the exact frequency of polling is discussed below) with MPI_TEST, and also check for new events on its incoming queue. As MPI indicates that events have finished, the MPI thread will place events on the relevant return event queue. The MPI thread never blocks in an MPI call; it must maintain a polling cycle of checking both the incoming event queue and MPI for event completions.

A special case, however, is that the MPI thread can block on the incoming event queue if there are no MPI events pending. This allows the MPI thread to "go to sleep" when there are no messages pending for MPI (although this will rarely happen).

The polling frequency is critical. It cannot be so high that it takes many cycles away from Calculate threads, nor can be too low such that input queues become drained or threads become otherwise blocked unduly while waiting for MPI events to complete. These conflicting goals seem to indicate that an adaptive polling frequency is necessary.

That is, it would seem that the polling frequency should be high when events are being completed, and should be low when no events are occurring. This would preserve the "bursty" model described above; when an event occurs, it is likely that more events will follow in rapid succession. When nothing is happening, it is likely that nothing will continue to happen for a [potentially long] period of time.

A backoff method fits this criteria: the sleep time between polling is initially small (perhaps even zero). Several loops are made with this small/zero value (probably with a thread yield call in each loop iteration, to allow for other threads to wake up and generate/consume MPI events). If nothing "interesting" occurs in this time, gradually increase the sleep time value. If something "interesting" does occur in this time, set the sleep time value back to the small/zero value to allow more rapid polling.

This allows the polling to occur rapidly when messages arrive or need to be sent, and slowly when no message passing is occurring (e.g., when the Calculate threads are running at full speed).

An obvious optimization to the polling model is to allow the MPI thread to loop until there are no new actions before going to sleep. Hence, if a event appears in the incoming queue, or MPI_TEST indicates that some communication has finished, both the sleep time is reduced and the MPI thread will poll again without sleeping.

This stuff hasn't been implemented yet, these are questions that I do not yet have definitive answers to. It is likely that my first implementation will be modeled on the backoff polling described above. We'll see how that works out.

There's a whole bit in here that I haven't really described about a primitive level of fault tolerance -- if a node disappears, all of the work that it (and all of its children) was doing will be lost, and reallocated to other workers. That is, as long as one Calculate thread remains, the entire computation will [eventually] finish, but likely at a slower rate.

The gist of this is to set the error handler MPI_ERRORS_RETURN such that MPI will not abort if an error occurs (such as a node disappearing). There's some extra bookkeeping code in the RelayCalc that keeps track of what node had what work assigned to it, and will reassign it to another node (including its local Calculate threads, if a) the local Calculate threads become idle, and/or no remote Calculate threads remain alive).

Just to clarify: I am not trying to be fault tolerant for programmer error. If you have broken code (e.g., a seg fault), this framework will not recover from that error. This framework will also not attempt to bring nodes back that previously failed; once nodes die, they -- and all their children -- are dead for the rest of the computation. Future work may make this better, but not this version. :-)

Probably a good way to describe this fault tolerant work is: if you have a program that will run correctly when all nodes are up, your program will run correctly as long as the Input and Output threads stay alive, and at least one Calculate thread stays alive.

Implementation Details

Woof. This journal entry is already much longer than I thought it would be (we're approaching 600 lines here!), so I'll be a little sparse on the implementation details, especially since it's all in a state of flux anyway.

The implementation is currently dependent upon LAM/MPI and will not run under MPICH. This is because I make use of the MPI_COMM_SPAWN function call to spawn off all the non-server ranks. The code could be adapted to allow for using mpirun to start the entire job, but that's a "feature" that I don't intend to add yet.

Unfortunately, the only way that I could think to specify the tree used for the distribution was to specify an additional configuration file. Attempting to specify the tree on the command line resulted in a clunky interface, and arbitrarily long command lines. I used inilib to parse the file; it's a fairly simplistic, yet flexible format.

The server starts up, parses the configuration file, determines how many total non-server nodes it needs, and spawns them. The non-server nodes are spawned with an additional "-child" command line argument so that they know that they do not need to read a configuration file. Instead, they receive their configuration information from their parent in the tree.

Sidenote: what's interesting to me about using MPI_COMM_SPAWN to start all the non-server nodes is that startup protocols have to be used. I'm used to writing startup protocols to coordinate between multiple processes for socket-level (and other IPC) programs, but I've never used MPI itself for startup meta information. It just seems weird. :-)

The server sends a sub-tree of the entire tree to each of its direct children. The sub-tree that it sends is the tree with that child as the root; hence, every child of the server learns about all of its children, but does not learn about its siblings or their children. The process repeats -- each child then sends a sub-tree to each of its direct children, until there are no more sub-trees to send.

One of the parameters in the configuration file is "how many Calculate threads to have on this node". If this value is -1 in the configuration, the value will be determined at run time by how many CPUs the node has. As such, after the sub-trees are distributed, a global sum reduction is initiated from the leaves back to the server. In this way, each relay will learn the total number of Calculate threads that it serves.

After this, it's fairly straightforward (hah!) bookkeeping to distribute input data and gather output data (mainly as described in the algorithms discussed above). The queues actually contain quite a bit of intelligence (particularly the output queue), and are worthy of discussion. However, I'm pretty sure that I have discussed the queues in a previous journal entry (they've been implemented for quite some time now), so I won't repeat that discussion here.

Future Work

Here's a conglomerated list of items from the text above, as well as a few new items that would be desirable in future versions of this software:

When a node dies, try to bring it back.

When a node dies, allow all of its children to continue in the computation if possible. This may be as simple as the grandparent assuming control of all the children, or may entail something more complicated than a simple tree distribution (perhaps something more like a graph [perhaps a jmesh?]).

Allow the job to be started with a full MPI_COMM_WORLD --
i.e., don't rely on MPI_COMM_SPAWN to start the non-server nodes.

Multiple, different Calculate thread kernels for different kinds of input.

"Chainable" Calculate threads, such that the output of one CalculateA instance can be given to the input of a CalculateB instance.

Allow the Input and Output instances to run on arbitrary (and potentially different) nodes.

May 4, 2001

Jimmy James: Macro Business Donkey Wrestler

My DSL is still down. This sucks.

That is, it has been up periodically, but only for about 30 minutes to 2 hours at a time. The real work that I have been able to do this week is negligible. Arrggh!!!

Telocity is still blaming Bell South for this, and they're probably right. The packets either end up in a router loop right outside my DSL modem or make it down to Atlanta (which is only 1 step further) and then die. That seems to be consistent with having my local phone provider just sucking horribly. :-(

All in all, my internet uptime this week is probably well under 25%. :-(

I beefed up my monitoring script -- it runs via cron every minute and checks my connection to DNS, Notre Dame, and Excite (for some reason, I can usually reach Excite, but just about everything else is unreachable). I had to re-write it in Perl because it was becoming to complicated for shell script.

I've never played with Perl's CPAN modules -- they're pretty cool. I was pleased to discover that they have a Ping module and several HTTP modules. The Ping module offers all three kinds: tcp, udp, and icmp. And you can do anything you want with the HTTP modules.

So I ICMP ping the Telocity DNS servers and ND, and HTTP get / from Excite (so that they don't think I'm trying to DoS them by pinging every minute... that would have to trip some kind of alarm, I'm sure! :-).

HAH! I think we just came back on the air -- 9:20am. We'll see how long this lasts....

Since I was off the air most of yesterday, I spent a little time reinstalling my laptop again. After trying to install yet another package and realizing that some component hadn't been installed, I said "screw it" and just reinstalled the whole thing, and selected "install everything". Not that that actually installs everything on the install CDs, but it does install most things that you need (but still not pine, curiously... I guess they want you to use Evolution or KMail. <shrug>).

Anyway, I got the laptop reinstalled and only had to manually install a handful of RPMs. I got everything working, and even managed to get the WEP going on my orinoco card at 11Mbps. It seems that linux distros don't do what the PCMCIA package recommends that they do (e.g., /etc/pcmcia/wireless.opts is not where the options go), but I managed to find where 'drake puts wireless options and to get it all going.

I saved instructions on what I did, because:

'drake 8.0 doesn't come with the orinoco_cs driver module compiled, although it does come with the source code (which I thought was weird). It took a bit of futzing around and some helpful suggestions from Brian to get it compiled properly.

The default wvlan_cs driver that comes with 'drake 8.0 doesn't seem to support WEP.

Others have essentially the same laptop that I do, so if you want the instructions, let me know. I have no idea if RedHat uses the same location for the wireless options, but I'll bet that if it's not, it's very similar (/etc/sysconfig/network-scripts/ifcfg-ethX).

Soon enough I'll have a new laptop and need to repeat the procedure again...

As an experiment, I plugged the audio out of my laptop into the AUX input of my stereo, downstairs.

Yes, indeed -- soon I was streaming MP3s from the server upstairs to my laptop downstairs, and out through my stereo. How cool is that?!? I was pumping out Fatboy Slim at very loud volumes.

Even cooler -- I had forgotten that I was still streaming MP3s to my desktop upstairs. It seems that that tiny little pentium that I have working as my router and my MP3 server does pretty well. Let's hear it for old technology -- it can still be the work horse for all those "little" jobs that you don't want to have to buy a new, hefty (and expensive!) machine for!

May 25, 2001

'morning sir. Are you going to introduce me to your bi-atch?

Tuscon. a.k.a., "I never new that queues could be so complicated!"

DSL went out while I was typing this. <sigh>

Looking through my log, I see that connections to ND were really spotty yesterday (indeed, I felt that heavily as I was working), including a full 20 minute outage around 2pm, a 10 minute outage on Wednesday, an outage from 4am to 7pm on Saturday (although that may well have been my router getting hosed -- when the power blinked recently, the router froze until I rebooted it), fairly crappy connectivity last Monday-Wednesday, and some sustained outages on the previous Saturday....

Overall, it's not as bad as it sounds. Sustained outages (like the one I'm having right now...) aren't too often, and seem to usually be the fault of Bell South (packets dropping in Atlanta). Spotty connectivity does happen not infrequently, but ND might well be to blame for that, because their external router is so overwhelmed, and the internal network is, well, less than perfect (oh for the days when Shawn was running the network...). Indeed, it's quite possible that my connectivity to IU will be better than my connectivity to ND if I only have to worry about the periodic sustained outages and not spotty connectivity.

We'll see how it works out; I don't have accounts a IU yet, but that paperwork is crunching through the vast papermills... I haven't decided yet on how to change my e-mail address. I might wait until after my defense; haven't decided yet.

Wow; the MPI queue turned out to be quite complicated; I mostly
worked out the model in one days, but spent all the next day
working on the engine itself, and a few details (bugs) in the
outer parts.

everything seems to work except some MPI_Cancels at the end (I
think the requests are already dead), and it's slow. Gonna
have to revamp the enqueueing/dequeueing so that you can
enqueue/dequeue lots of things at once, not just one at a time
(why didn't I learn my lesson the first time?)

enqueueing dequeuing a list at a time really helps
(std::list<>::splice() is very handy -- O(1), baby!)

BRAINSTORM: don't enqueue a list (complicates matters greatly,
especially w.r.t. temporary buffers and whatnot) -- just get
control of MPI from the event manager and therefore can do
direct sends/receives. This also allows for arbitrary and
potentially interactive send/recv protocols for user data. WOW
-- that makes an AMAZING difference -- <1sec vs 45-60 seconds!
(test case is particularly painful: 1024 short messages to each
slave, each message contains 1 int)

Also had another thought -- this polling model in the MPI event manager is definitely sub-optimal (have to periodically steal cycles from the other threads to check for MPI progress). The main reason for the polling model is because they are always pending receiving (from children) -- the children may send to their parent at any time. So the parent always needs to have pending receiving posted. So we have to check periodically if any of the receives have finished --
hence, the polling model. But what if there was a way to block and wait for such progress? I'm talking about using a mechanism outside of MPI. That is, open a secondary socket that is used just for signaling. When a child sends a message to its parent, it does the MPI_Send, and then tweaks the socket. The parent can be blocking on a select() of all the sockets from its children. When it select() indicates that one of them is ready, it knows to go complete the MPI_Recv. This is somewhat icky because we have to go outside MPI to do it, but it would work, and potentially could save a lot of time since the progressively-slower polling model can make receives wait an arbitrary length of time before they can actually complete. I may or may not pursue this, but I wanted to record the idea...

I seem to have gotten everything working now -- for single-level only (i.e., only one RelayCalc). The current scheme won't work with multiple levels because of the way RelayCalc distributes input data and expects to collect output data, and the way that RelayOut sends back data.

Had to overhaul the EOF/EOI progression through the queues and relays a bit to make them work and to ensure that there would be no memory leaks. Children now assign their own stream ID's to each input data set; they receive a chunk of input data from the parent, give it a unique stream ID, enqueue it all, and then immediately enqueue an EOF for that stream. The parent keeps track of the "real" stream ID by associating it with the child's ID; when the child returns output data, it uses the ID of the child to look up the stream ID of the data that it is returning. This scheme allows multiple things:

Children can completely clean up state after each chunk of data from their parents are processed (trust me on this one), because each chunk of data from a parent is a treated as a discrete, complete stream in itself.

The RelayOut can wait to send all of its output data to the parent until it gets the EOF on that stream. When it gets the EOF, it knows that the entire chunk of input data that was initially received from the parent has been processed, and it can send it all back en mass. This component (buffering output data) is needed to allow multiple levels to work, as mentioned above -- it hasn't been implemented yet.

Other things that are still needed:

Handling of faults -- children (and by induction, their children) will die when their parent dies. Parents need to mark a child down when it dies, and do the necessary bookkeeping to back out of any current transactions with that child, and ensure that that child will be ignored for the rest of the computation.

Startup "all at once" with no spawning model. This will be necessary for IMPI runs. This will be more software engineering than rocket science (although it won't be trivial :-\ ) -- the software has to support both models.

Support both MPI and non-MPI models. I had some preliminary infrastructure in there for that now (i.e., configure/compile without MPI -- just support a single SMP), but I've long since broken it --
you can't compile without MPI. I likely won't fix this until after my defense...

There are 483 copies of xmms running, out of 562 total processes (85%).

June 3, 2001

Two words, Joe, "Mon ney", and lots of it.

More LAM gm RPI work.

There's a degree of urgency to this because we're asking Myricom for some cheap/free equipment, and we kinda need a working Myrinet RPI to do this. Plus, the Sandia folks have a Myrinet cluster, and it's kinda in our interest to have a working version...

Added more to the README.myri file -- it was incorrect and
incomplete. For example, it didn't have anything about changing the
tiny/short message lengths.

Had to bullet-proof the tiny and short message lengths (both the
defaults and the user-settable sizes) to ensure that they weren't
the same, and that the tiny really was less than (short +sizeof(struct c2c_envl)) -- it's a little confusing because the
tiny size has to have the size of the envelope added to it. Hence,
the tiny and short lengths must be at least (sizeof(struct
c2c_envl)) apart.

Cleaned up a lot of confusing "size" vs. "length" misnamed
variables and whatnot in the code (these two words mean something
very different in Myrinet/gm).

Found some problems with user-overridden message length sizes;
initial buffers were being provided with the wrong size, so messages
would never be received properly.

All these took a long time to resolve because I had to go [re]learn how the Myrinet code worked. Plus, I made assumptions about things that were supposedly already tested which ended up being broken anyway. Ugh.

However, with these bug fixes, we might be darn close to the first beta release. We'll see.

Everything works on Chiba City now; need to do some cross-checks on the Hydra and on the babel cluster.

Life is pain

June 6, 2001

Fact cited by Matthew Brock are not necessarily facts

I think every long-running program should have a unix domain socket that you can use to communicate with it. It's an inherently useful capability; you can use it to query the current status of the program, change run-time parameters, etc.

Particularly in multi-threaded apps -- you can have a thread just sitting there waiting for connections, handling the requests, etc.

All the cool kids are doing it.

Been fighting with LAM's Myrinet RPI all week. Got some issues solved, but not all of them (any claims that I previously made about having it all working were later shown to be totally false. Doh). I seem to be having a problem with collectives that use long messages now. Hmm.

It seems to work in trivial cases, but if you stress test it at all (i.e., get a bunch of concurrency where multiple messages are going through the state machine simultaneously), barf-o-rama. I think it may have to do with the fact that we're using a slightly different state machine (vs. the TCP RPI that we stole it from) in the initial tcp_advmultiple() entry point, but I'm not 100% sure of that yet...

I find it very amusing that the action-packed trailers shown on TV for the "Tomb Raider" raider feature a song by Fatboy Slim named "Michael Jackson". Notable lyrics in the song (not heard in the trailer, of course) are, "Michael Jackson -- that's a cute guy!"

Learned something yesterday by accident -- tcsh's pushd command, when executed with no argument, will swap the top two elements on the stack, and change to the directory that is now on the top of the stack. That's pretty cool -- and useful!

Had to spend some quality time with RPM's yesterday to make the OSCAR LAM RPM. Learned a bit more about RPM's than I really wanted it, but I de-mystified a bunch of my prior knowledge about building RPM's.

I made the OSCAR RPM (main difference is that it is completely installed in /opt/lam-LAMVERSION rather than in /usr and friends), and setup the scripts for OSCAR to install it.

Brian is doing Great Things with C++-izing the lamd. We've been hashing out ideas (granted, he's doing all the work) via e-mail today and yesterday. That rocks.

Started using e-mail notifications of CVS commits recently. Seems to be working well. We started with a basic mail script, but I stole a perl script that the Vorbis group uses to mail CVS notifications that sends out much more information. We'll see how that works out.

It seems that this perl script was originally written by someone at Cisco. Small world.

July 29, 2001

I think I've gotten it at least a dozen times, twice of which were in spanish.

I was just notified by the Army that I go before the promotion board for Captain in November. How funny is that?

I am immensely amused. I can't imagine that I'll get it; I'm quite sure that there are many other 1LT's ahead of me who are more deserving of a CPT slot. But I still find it damn funny.

:-)

Playing with new autoconf (ver 2.52), automake (ver 1.5-p4), and libtool (ver 1.4). Here's some things that I've learned:

It's not too painful to move to the new autoconf. There's a few macros that have to be changed (e.g., AC_INIT has a new arg list), and some of the macros that I use frequently have been deprecated in favor of new ones, but the transformations are mostly straightforward.

There's some new handy fortran macros that will help in LAM/MPI.

Remember how the line you invoked configure with used to be in config.status? It's not there anymore. Doh!

Not to worry, though, the line that you used to invoke configure with is now in config.log (I don't know why they moved it). In general, config.log is now much easier to read, and has lots more information in it that will be valuable to both sysadmins and programmers.

"configure --help" has lots more information in it.

AC_OUTPUT is now effectively broken into multiple macros; AC_OUTPUT itself just triggers finishing the write of config.status and then runs it. AC_CONFIG_FILES, AC_CONFIG_HEADERS, AC_CONFIG_COMMANDS, and AC_CONFIG_LINKS are now how you specify the output files, output header files, commands to run in and around output time, and sym links to make.

"acconfig.h" is now obsolete. There's a handful of new "AH_" autoheader macros that you put in configure.in to put in the top and bottom portions of the header file. You also have to specify "templates" for each #define with AH_TEMPLATE. I'm not sure how I feel about that one. :-\

Some cool new macros that will rapidly become my favorites:

AC_ARG_VAR: Mark a shell variable as "precious", list it in the output of "configure --help" and save its value in config.status. Warn if the value stored in config.status doesn't match the present value (in case you run config.status at a later date).

AC_HELP_STRING: Automatically format the strings that you give to AC_ARG_WITH and AC_ARG_ENABLE to get that pesky spacing right.

AC_SUBST_FILE: Allows you to do the same thing as AC_SUBST, but substitute in the contents of a file. I see
$COPYRIGHT$ potential here...

AC_CHECK_DECL: Checks to see if a symbol is declared. I've always had my own tests for this; I'll be happy to start using this macro instead of my own.

AC_CHECK_MEMBER: Check whether a struct or class has a given member or not.

All in all, it looks like they tried to add significant useful functionality to autoconf, so I'm overall pleased.

HOWEVER, it caused me an hour of two of frustration before I finally tracked down a problem with libtool -- there's a bug in the libtool 1.4 distribution. If you use AC_CONFIG_AUX_DIR to put all your config files in a subdirectory rather than the top-level directory (which I do), and if you use configure.ac instead of configure.in, libtool will get corn-fuzed and put ltmain.sh, config.guess, and config.sub in the top-level directory instead of your config directory. This eventually causes much badness... no need to discuss specifics here.

The problem is in the "libtoolize" script -- there's one place where "configure.in" is still hard-coded in, instead of using "$configure_ac", which is set to the Right value.

This has apparently been fixed in the libtool CVS (I checked), so the next version will have this correct. Ugh.

September 23, 2002

Spandex... it's a privilege, *not* a right

I've had an idea that's been kicking around in my head for a few years now. But I think I've started thinking about this semi-seriously recently. I certainly have no time to implement such an idea, but it's something that has intrigued me for quite a while -- it's an itch that I'd really like to scratch.

This is not necessarily a product or a get-rich-quick-killer-app, but it's something that bothers the crap outta me, and I wish I had a better tool.

The basic premise is that I'd like to have a proper knowledge management solution for my e-mail.

Mail clients today do not offer flexible enough filing systems for mail messages. Specifically, the concept of "mail folders" is no longer good enough for a society that has become highly dependent upon e-mail. Users tend to create large, complex hierarchies of folders reflecting intricate filing systems that inevitably contain inadvertent (and typically widely disparate) redundancies. For example, at the time of this writing, I have 392 folders in my personal mail store. Such complex folder hierarchies are now so commonplace that most people think that they're "good enough", when actually they just aren't aware that it can be better.

Indeed, the whole concept of an electronic file folder is modeled off the physical reality of a Manila folder in a filing cabinet. You take a memo (i.e., piece of paper), put it in a single folder, which goes in a single drawer, which goes in a single cabinet, which goes in a single row of cabinets, and so on. This means that there is one path to get to that particular piece of information. You can photocopy that paper and put it in other folders to make multiple paths to the information, but that's pretty inefficient and you have obvious problems such as what happens if someone updates the original memo? You then have to go update each copy -- which could be pretty labor-intensive.

While this is a perfectly valid and reasonable approach to filing information, I claim that such a limiting system (i.e., only one path to a given piece of information) is not necessary in the electronic world. Indeed, this limitation is based on a physical model -- why carry it over to the electronic world?

Instead, look the collection of your mail as a knowledge repository. It contains vast stores of information. The challenge is not only to keep this information organized so that you can quickly find the data that you need, but also to be able to dynamically change the filing system as the need arises.

Tenet #1: Provide multiple paths to information.

A basic precept of many Knowledge Management (KM) solutions today is that information should be reachable by many different paths. In order to accomplish this, one must look at filing information "from the other way" -- instead of putting large amounts of information in a filing system, attach large numbers of filing systems to each piece of information.

For example, say you receive an e-mail from your friend Bob in Human Resources. Bob's mail tells you the specifics of a job opening in the Finance department that you are interested in, but also makes a friendly wager of $10 on the outcome of a football game this weekend.

How do you file this message? There's at least two different ways to look at it -- job related and personal. You obviously want to keep both pieces of information and be able to find them later. Do you file it under "job prospects" or "personal:bets with bob"? Clearly, you'd really want to file it under both.

Granted, under today's mail clients, although you can copy a message and put it in both places, the underlying assumption is that you'll normally file a message in one location -- so copying/re-filing messages is not as easy as it should be. Regardless, abstractly speaking, you've now got two copies of the information, and simply made distinct single paths to each. In a more practical sense -- you've now doubled the storage required for that message. What if Bob had attached a 2MB document detailing the Finance job? In days of shrinking IT budgets, you've just used 4MB of your valuable personal disk quota simply because you want to find the information two different ways.

Instead, it would be much more efficient (and potentially easier on the user) to file the one message in both places. You should not be penalized (in terms of space, for example) for wanting to find the same piece of information in multiple ways. Not only should it be possible, it should be trivial to file that mail in multiple places.

Multiple paths to information are also important because of the passage of time. What seems like a logical way to file information today may be completely forgotten tomorrow. So if a user files a message in many different ways today, their chances of finding it tomorrow are much greater because it may be found in multiple different places (as opposed to finding the one-and-only-one location where that information was filed).

Keep in mind that I'm talking about features that don't [yet] exist -- bear with me, and assume that the actions that I'm talking about will have a simple and easy-to-understand graphical interface for users to use.

Tenet #2: Separate the filing system from the information.

Most users' mail folders hierarchy started off as a small, simple set of folders. But it evolved over time -- new messages arrived that didn't quite fit into the existing neat and clean division of folders, so add a folder here, add another there, etc. After a while, adding folders in this piecemeal fashion results in a "big ball of mud" -- kludge after kludge after kludge inevitably results in inadvertent redundancies, ambiguities, and downright misplaced messages and folders.

A user's mail folders hierarchy becomes so large and complex that it effectively becomes a "legacy system". Even if the user wants to reorganize everything into a "better" filing system, it would take enormous amounts of time and effort to do so because each message would have to be examined, re-categorized, and then dragged-n-dropped in from the old filing system to the new. Hence, people tend to stay with their "big ball of mud" model, even if they know that it's inadequate, inefficient, or otherwise sub-optimal.

Specifically: with mail folders, the information is in the filing system rather than the other way around. Instead, the filing system should be completely divorced from the data that it contains. As with tenet number 1, information is king -- the filing system (although it can be considered information itself) should not only be dynamic and changeable, it is definitely secondary to the information itself.

Using such an approach would allow users to reorganize all of their mail without significant effort. Granted, it would still require some effort on the user's part (there's no such thing as a free lunch, after all), but the threshold of effort is significantly less if the filing system can be created and destroyed at will with no risk of loss of information. Consider: with mail folders, if you destroy the filing system, you may accidentally destroy information as well. If the filing system is totally separate from the information, then accidental data loss cannot occur.

With a separated filing system, not only can the information be reorganized on the fly, there can also be multiple simultaneous filing systems. Consider the example from above -- Bob's mail to you about a job posting in Finance and a friendly wager on this weekend's game. Taking that mail and attaching two different filing systems to it would allow you to file it under both "job prospects" and "personal:bets with bob" -- yet still only be a single message (rather than two copies of the same message in to different file folders).

Another common example is outgoing e-mail. With current mail systems, there is an arbitrary (IMHO) separation between incoming and outgoing mail. If you want to group all the messages of a given conversation together, you have to move or copy your outgoing messages out of your "sent mail" folder into the destination folder where you stored Bob's incoming messages (or some variation on this, such as CC'ing yourself, etc.). If you separate the filing system, you can simply select to see "all messages to or from Bob" -- there's no more artificial separation between incoming and outgoing e-mail. Of course, it is trivial to select one or the other if the user wants to -- outgoing mail is every message that has a "From" of their address, and incoming mail is everything else. It's just important that the possibility of combined listings becomes available under a separated filing system.

Tenet #3: Let the computer do the menial work.

E-mail has become so important to industry and society that users are often flooded with incoming mail every day. Answering and keeping up with e-mail has become a significant portion of people's jobs. This results in some of the problems described above (e.g., the "big ball of mud" approach to organizing e-mail). One way to help is to let the computer handle as much of the menial work associated with e-mail as possible.

Rules and filters are two common features in e-mail clients today. These are actually Good Things. Unfortunately, few users actually understand or use them. And even among those who do, there is always the fear that important e-mails will get lost or otherwise go unnoticed. This goes back to the fact that the mail folders and filing system is [currently] the primary concern -- the actual individual e-mails are not the focus.

Using a separated filing system with the concept of rules, filters, and scoring will help make them "easy" from the user's perspective. Specifically, a separated filing system can guarantee that no e-mail will every be lost, and that using rules/filters/scoring will actually increase the possibility that important e-mails will be noticed. The goal should be to make it common to have lots of rules/filters -- the more, the merrier. Indeed, let the computer mark each incoming (and outgoing!) e-mail in 20 different relevant categories such that there are now 20 different ways for the user to notice that important mail, not just one (i.e., the old concept of an "inbox"). Filters can then be used to ensure that the important e-mail -- even though it shows up in 20 different categories -- is only brought to the user's attention and viewed once (rather than 20 times).

Granted, some interface work and user education will probably have to take place to make rules and filters understandable to most users, but re-orienting the filing system will guarantee that no e-mail will ever be lost due to faulty rules (something that is not necessarily true today). This may help reluctant users to "take the plunge" and actually start using rules/filters.

Of course, users should also be able to manually categorize/organize a message. Even though much of the processing can happen automatically, there will always be a need to manually classify a specific message, and/or re-designate a given message to be in a new (set of) category(ies).

Combining all these ideas together, the end result is that all messages (both incoming and outgoing) can be automatically categorized and organized when they arrive. Important messages can actually filter up to the top. Messages can be fully reorganized and recategorized on the fly. Arbitrary searches can be executed to find any given message with no searching restrictions. Searches can be performed on results of other searches. And so on.

The point is that this would be a fundamentally better filing system -- one that is flexible and powerful. Of course, it can be simplified down for those who don't want that kind of power (e.g., Gramma, who only gets 2-3 e-mails a day). Indeed, the entire "mail folder" concept can be fully emulated with the ideas described above. But for those who need it, this gives them a much better toolset to organize the information contained in their e-mail.

So let's make this a little more specific. From the user's perspective, let's define a few terms before we start talking about features and capabilities of such a system:

Category. A category is essentially what most people currently think of as a mail folder. Categories have names and are hierarchical. For example:

mailing lists

mailing lists:LAM users list

mailing lists:53 listserv

job postings

job postings:finance

job postings:finance:northeast

job postings:finance:southeast

job postings:accounting

(where the ":" character separates categories and sub-categories)

However, a huge difference between categories and mail folders is that any number of categories can be associated with each message. So Bob's e-mail to you about the job posting in finance may actually be in "job postings", "job postings:finance", and "job postings:finanice:northeast" (the job is in Pennsylvania).

View. A view is essentially the result of a search. Views are named as well, but begin with the special character "#". Like categories, views are hierarchical -- sub-views are further searches on the parent view. Here's some examples of common views, and descriptions of them:

#sent mail - all messages sent by me

#sent mail:to bob - all messages sent by me to Bob

#sent mail:to bob:this week - all messages sent by me to Bob during the past 7 days

#sent mail:yesterday - everything that I sent yesterday

#mail with bob - any message to or from Bob

#yesterday - all mail sent and received yesterday

#yesterday:sent mail - same as "#sent mail:yesterday"

Again, the ":" character separates views and sub-views.

Note that views are continually updated -- they are not the results of a one-time search. So when you send a mail to Bob, it immediately shows up in #sent mail, #sent mail:to bob, and #mail with bob. This completely destroys the artificial separation between outgoing and incoming mail -- users can now view entire conversations (including their own replies) with ease.

Categories and views can be navigated and browsed just like a conventional mail folder tree. So the usage scenarios are actually fairly similar to existing mail clients.

Since nothing like this exists in current client software, remember to take it on faith for the moment that we can make a nice, easy to use interface to support all this functionality. Use your imagination. :-)

The main use of rules will be to assign categories (other actions are also possible, such as deleting). Rules can search / match any aspect of a new message (incoming or outgoing) and assign categories as appropriate. It will be common to have lots of rules. For management purposes, rules can also be named (starting with the special character "%") and be hierarchical. Here's some examples:

%bob - matches any message that has a From, To, CC, or BCC of bob@mycompany.com.

%bob:bets - any message that is to or from Bob and contains the word "bet" or "wager", assign the category "personal:bets with bob" to it.

%bob:jobs - any message that is to or from Bob and contains the words "job posting", assign the category "jobs"

%bob:jobs:finance - any message that is to or from Bob and contains the words "job posting" and "finance", assign the category "jobs:finance"

%spam - delete any message that has a subject beginning with "ADV:"

%spam from foo.com - delete any message that relays through the "foo.com" mail server

Consider the following usage scenario: All new mail will have the default "inbox" category attached to it. User-created rules will attach additional categories to each incoming message. Finally, the user may manually assign more categories when viewing the individual message.

This means that to see new mail, the user simply views the "inbox" category. Since most users treat their inbox as "messages I have not processed yet", once the user reads and processes a message in the inbox, the user can simply detach the "inbox" category and it disappears from the "messages I have not processed" list. Note that the message still remains filed away in all of its other categories.

This will be an important distinction, actually -- the difference between deleting a message (which completely destroys the message), or removing it from a given category / view.

The capability to search (i.e., define a view) on anything is one of the key concepts of this system. Users can search on category names, any field in the message header (e.g., From, To, CC, Subject, Message-Id, etc.), and any combination thereof. For example, you don't actually need a category for "mail from Bob" -- such a view will be available because the underlying system automatically indexes on the "From" field -- you can simply have a view of the value "bob@mycompany.com" in the From field. Categories are more intended to further organize messages in addition to examining all fields in each message's header.

Most modern mail clients offer some form of search capability (ranging from very primitive keyword searches to sophisticated field text pattern matching searches), but most are still bound to the mail folders concept -- searching scopes cannot be dynamic (i.e., based on a view), and the results of a search cannot themselves be searched. Plus, they're one-time searches, not continual views into the current pool of messages.

Basically, what it comes down to is removing some of the arbitrary artificial constraints concerning the storage and retrieval of mail messages -- allow any given message to be filed in any number of ways, combined with the idea of a high degree of automation such that the incoming flood tide of mail can be automatically (and manually) organized in a dynamic manner.

Technical details

All of the above can be described with a few basic precepts:

using an RDBMS to store messages

using the full power of SQL to search for messages

indexing messages by reference, not by value

allowing an arbitrary number of user-defined, hierarchical categories to be attached to a message, and indexing on those categories

automatically indexing each message by every field in the message header

Although there are at least several mail servers that use a real RDBMS on the back-end (many of us have been conditioned to think in terms of sendmail, which uses /var/mail-style flat files -- but not all mail servers do this), this is not quite what I'm talking about. The client needs to have visibility into the message store database. So even if the server uses an RDBMS back-end, if the client connects via IMAP or POP, it won't have access to the power of the RDBMS. Hence, we need something more.

A few approaches come to mind:

Make all servers standardize on a common database schema. Then we can have open mail clients that can talk to any server (probably via some kind of ODBC connection), and life is good.

But the practical possibility of getting this to happen is slim to none. Not only because back-end RDBMS schemas are proprietary and closed (and probably rightfully so), but also because trying to get all vendors to agree on a common database schema would be next to impossible.

Make all servers standardize a common protocol to access the back-end database. Hence, open clients can connect to any server.

Although this seems tempting, recall that SQL effectively fills this requirement (tunneled over whatever network protocol is appropriate, such as some flavor of ODBC). So abstracting away the SQL while still giving all the power of searching and whatnot (that SQL is designed to do), we'd really only be going a half step above SQL itself. So while I don't want to discount this possibility (since it would be much easier to get vendors to support a protocol than to force them in a specific database schema), I think some experience needs to be gained with the whole RDBMS approach first before anyone could understand enough to design such a protocol.

Separate the mail server from the message store. An easy example of this would be to have a sendmail server with a customized mail.local (or every user has a .forward) that inserts the incoming message into a database instead of /var/mail. A separate RDBMS server can be running (and not necessarily even on the same machine as sendmail) to accept both the incoming messages, as well as listen/respond to ODBC connections from clients.

Yep -- that's right -- mail clients use ODBC to retrieve their mail. Forget opening /var/mail/username, and forget using mh-style folders. Just open up an ODBC connection (which can even be across the network -- no need for it to be local).

I'm thinking that #3 is the easiest to implement first. #2 might be possible after we understand #3 and gain some experience with database schemas that would be required to implement it.

Indeed, to implement #3, all you need is the following:

design a database schema that can handle all the requirements described above (this will actually take a considerable amount of thought and design to do properly)

an agent to insert new messages into the database (either a mail.local or an executable to be invoked from .forward), probably with a default "inbox" category attached to it

take an open source mail client, and, assuming that it has at least a semi-modular approach (and at best, a formal API) to reading/writing mail messages from/to mail folders, rip out the guts of the mail folders access routings and replace them with database calls

That's the basics. There's millions of features and details to be worked out, but that's the gist of it.

And here's some random thoughts / implications of what this all could mean:

Assumedly, the back-end database can either be a per-user database or a one-database-for-all-users. It would be nice to allow it both ways. But if one or the other has to be chosen, I think the all-users DB would be much more useful and user friendly. It would also allow "public folders" kind of functionality (see below), since everyone shares the same DB message store.

Searching on Message-ID to make message threading without the artificial separation of the sent-mail folder will now actually show the whole conversation, not just the messages that people sent to you (I love this idea!).

If incoming messages automatically have an "inbox" category attached to them, users can safely detach the "inbox" category and leave the message filed away in other categories. i.e., there needs to be a clear / easy way to do this that is distinct from "delete message".

Spam busting has great potential here -- you can even filter based on any machine that the message relayed through, not just originating e-mail addresses, etc.

Think of it the other way around -- take a single message, and show its relations to other messages. For example, message X has these categories, is part of this(these) thread(s), is one of 38 messages from Bob that you received today, and is the 25th out 48 message on the LAM listserv that you have received in the last week. And so on.

Basically -- anything you can do in SQL, you can do in a view. You can set the scope of a view to be arbitrarily large (all messages in the database) or arbitrarily small (a single message, or a single thread).

High-quality clients can still do local caching of messages (a la high-quality IMAP clients today) and views (i.e., results of searches) to improve client performance.

Key to all of this will be a simple and powerful interface. Create/edit/delete categories is simple enough. But making an interface that makes views and rules easy to create/edit/delete will be absolutely essential.

Views should be stored in the database itself. That is, whatever SQL or search string is necessary to execute the "#sent-mail" view should be stored in the database itself. Hence, if I connect with client A or with client B, I can still see the same "#sent-mail" view.

"Public folders" (a la MS Exchange, or any IMAP server) can be implemented with special, reserved categories. It may be a good thing to define some "system reserved" category prefixes that cannot be defined by a user.

If a single back-end database is used to store all user messages, system administrators actually have a larger degree of control over user mail spools. Consider -- many companies have a "max e-mail age" policy, such that mails over age X should not be kept. With a RDBMS back-end, a search and removal of messages older than X is trivial.

Some kind of message export from the database will probably need to be supported, such as dumping to /var/mail-style mail folders, mh-style folders, XML, or perhaps to another database.

Consider making a second ODBC connection to another server to be able to access other message stores. There are oodles of web-based listserv archives out there, why not give people raw access to a database containing the archives instead of forcing a web interface? The possibilities here are very interesting... Consider a mailing list where no mail is sent out via SMTP. Subscribers still submit mail via SMTP (i.e., conventional mail clients), but they simply make ODBC connections to "receive" mail from the list. As a subscriber, I would configure my mail client to make an ODBC connection not only to my "home" mail server, but also to the LAM listserv ODBC. Messages to the LAM list would still show up in my inbox view (if I wanted them to, that is), but they were never actually pushed via SMTP to every subscriber on the list -- they just appeared in the database, and clients pulled them. Granted, this has obvious scalability problems, so a more realistic example might be providing ODBC connections in a read-only fashion for archive searching, etc. (vs. everyday usage). But it's still interesting. :-)

The whole point here is that mail clients today are bound by artificially limiting data stores. If we remove those limiting factors and instead use a very powerful data store and start using KM kinds of tools with e-mail, the possibilities are truly interesting..

None of this is holy writ. Like I mentioned in the beginning, this was an idea brewing in my subconscious for a few years, and it only just took on words and active dialogue with others within the last week. So although this idea intrigues me greatly, if I ever get around to implementing it, it may be substantially different from what I have outlined from above. :-).

January 9, 2003

So I got my new

So I got my new (RMA'ed) linksys today. Anyone want one? I'll sell it -- cheap. It's actually a fine unit -- it just doesn't do what I need. And don't try to upgrade the firmware. :-)

My D-Link router is nicer in almost all regards, except that it has 2 problems -- one minor, one major:

1. Minor: the DHCP lease times for linux boxen seem to be really weird. The dhcpcd.log file complains of infinite lease times, and the router shows the lease as expiring in 2016. This doesn't really matter at all, but it should probably be fixed.

2. Major: the wireless activity on the WAP periodically just goes catatonic. Resetting the router (either via the web control panel on a wired connection, or power cycling it) makes it come back. But that's kinda useless if you're working via wireless. :-)

I called D-Link about these issues, and they had me submit technical details on #1, and told me that they're working on #2. We'll see what happens.

January 19, 2003

They belong together like H and 2 O

We had an OSCAR working group meeting at IU this week.

My tenure as the chair of the group is now over (it was a one year elected position). Woo hoo! Freedom! :-)

Actually, it was a good position, and I enjoyed it. And I think I got a lot done. But it took a lot of time, and now I really need to be able to focus on my dissertation. And that means spending less time on OSCAR.

The meeting was good. We talked/argued/yelled/compromised/came up with good solutions. It always fun and rewarding to work with other technical people on complex problems that require difficult solutions and lots of brain work. Good stuff.

I spent far too much of today debugging GNU Automake's depcomp script. Ugh. I typyed up a lengthy / detailed bug report, but http://sources.redhat.com/ has been giving "Connection Refused" since yesterday afternoon.

At some point, I really need to edit the journal page templates to make them how I want them. I also need to make it send out mail upon new entries. I also need to make millions of dollarts. I also need to conquer the world. I also need to...

February 14, 2003

Ahhh.... Nothing like that tiny new car smell

Arrggh!!

My last entry ended with how I was going to reboot my desktop because things were acting flaky. Well, I rebooted, and that's when the Badness started.

Things were now extraordinarily slow. And I could no longer see most of the network. That is, after looooong delays, I could see [some] things on my local network. But I couldn't get outside of my network at all. ifconfig showed lots of errors on the NIC.

So this started me on a huge search to see if I had somehow fried my NIC. I tried 4 different Linux kernels (each of which takes a while to comple, especially the modules), tried twiddling the parameters to my NIC, etc. This morning, I swapped out the NIC for a spare that I had lying around. No love. Great. Why didn't I do that last night?

So it's definitely a software problem. Then I looked closer at the default route.

It was wrong, by one digit.

When I replaced my DSL router/WAP, it came with a different default address than my old one, and I must have just manually changed the route last time and forgot to change the boot up default route. Arrggghh!!

After thinking about it, that totally explains the slowness -- Linux thought that networking was up, but instead of having all of its packets rejected by some remote server, since the incorrect router IP that I had in there did not exist, all packets would just get silently dropped. So the slowness was probably due to timeouts to all kinds of processes failing. I've seen network failures before, of course, but not like this -- usually it's pretty obvious when your default route is wrong because someone rejects the packets, and you get immediate denials. But I unfortunately picked on that didn't exist, and that led to timeouts, not denials.

Great. I wasted about 8 hours on this. Well, I'll remember this the next time it happens. Wisdom is not simply knowledge, it's knowledge + experience.

On the up side, I got an e-mail from the MoveableType folks (the blog-makers), and there's a new version out. There's a bunch of new features, one of which is the ability to have a plugin that has a lot of the features that I had back in the original JeffJournal (particularly with respect to entering in text that magically turns into HTML -- stuff like *hello* being automatically turned into hello). I've really missed those features, so I might try upgrading sometime soon...

March 27, 2003

Where art meets reality

Classic. I just ran "sudo" on a RedHat 8.0 machine for the first time ever. The first time you run sudo on a machine, you get a standard warning with two main bullets in it. With RH 8.0, I notice that they added a third bullet:

April 2, 2003

gaim is funny

I think my IM client ("gaim":http://gaim.sf.net/) played April Fool's jokes on all of us who keep up with the CVS HEAD.
The title on any window that I pop open on MSN is " is a stupidhead". The main gaim window that shows my buddy list is titled "Biatches list". All the icons are very Dali-like.
Trippy.

April 8, 2003

Unhealthy looking cross between a possum and a racoon

I upgraded to gaim 0.60 on my laptop the other day (on my desktop, I keep up with the CVS HEAD, for no apparent reason).
I was somewhat disappointed -- the font sizes were tiny and it didn't allow me to change them. [shrug] So I downgraded back to 0.59.whatever, to find that my buddy lists had been converted and lost. Ugh! I finally got them back, but lost all my aliases. Oh well.
I'm not complaining (much), actually. gaim -- although it's lacking some notable IM features -- is actually a pretty good program, and it's impressive that its main maintainer keeps chugging out the code. Kudos to him!

There's two main features missing in the textile plugin for MT that I enjoyed with jjc -- quick shortcuts for horizontal separator bars (like you see above), and surrounding text in the <code> HTML tags.
Maybe someday I'll go look at the code and see if it's easy to add those in (I'll bet that it is :-) ).

Nic H. found a bug that's been latent in LAM since December of last year. It only surfaced last week because of a second bug, but still -- kudos to Nic! And kudos again!
Because we're going to enter our "pseudo-freeze" in the [hopefully] month before we release LAM 7.0, I think we're going to run a contest based on points for all of us finding new bugs. Something along the lines of each new bug report will earn between 1 and 5 points (depending on the level of detail and accuracy of the bug report). The winner will be the person with the most points when we finally release 7.0 (hopefully mid-May!).
The winner will get some fabulous prize. :-)

April 10, 2003

Mr. James, what did you mean when you wrote, "Bad clown making like super American car racers; I would make them sweat. War, war."

Taxes are done. We only owe a little this year, mainly due to my weird work-in-Indiana-but-live-in-Kentucky status. Coolness.

We finally cleaned up a bunch of bugs in the LAM/MPI startup protocols such that the whole test suite can run properly.
Woo hoo!

Thunder over Louisville is this weekend. We're going with a bunch of GE folks -- a ball game followed by the airshow (hmm.. I wonder if the military presence will be somewhat reduced this year...) followed by the Thunder fireworks show.

A friend just told me that his organization uses AIX for its web servers.
_Wow._
'Nuff said.

April 13, 2003

Bouncy, crunchy, filppant rubber balls

We've been having the _worst_ time with anti-virus software at my church. :-(
It's one of these all-in-one packages that scans everything -- downloaded files, e-mails, etc. Over a month ago, it stopped working: when sending certain e-mails, the whole anti-virus subsystem would just hang. It seemed directly related to the content of the e-mail (as opposed to, say, the destination address). The AV vendor (no names mentioned, but it's not Norton or McAfee :-( ) has given me a complete runaround in "helping".
Their consistent answer has been, "Please uninstall and re-install". I tell them that I did that and that it still hangs when sending certain e-mails. They said "Please uninstall and re-install". In fairness, they have come out with (so far) 2 new updates that supposedly fixed the problem. But neither have fixed it -- the software still hangs upon sending some e-mails. Their solution? (even after I've told them that the newest versions still do not work) "Please uninstall and re-install."
I'm trying to decide whether these tech support people are arrogant jerks who know nothing and also assume that the user knows nothing, or whether this is the general level of technology of Windows software (i.e., that Windows itself is so unstable that many problems -- even _repeatable_ problems -- can be solved by repeatedly uninstalling and re-installing an application, and the application itself is not at fault). It's a tough call. :-(
And note that their uninstaller does *NOT* completely remove the software from the system. You have to go manually remove all kinds of registry entries and leftover files. Half the time, their uninstaller doesn't even work _at all_ -- you have to go remove *all* files and registry entries.
We bought this software at the recomendation of someone else because it was significantly cheaper than Norton. What a mistake this has turned out to be. This person has even had the nerve to tell me that I really shouldn't complain, because my church is such a small organization and that big software companies like (our Anti-Virus Vendor) don't pay attention to the little customers like us. I was speechless when he told me this. Their software doesn't work! Of *COURSE* I have a right to complain!
I've had to give several users "Administrator" privlidges so that they can disable anti-virus protection when sending e-mails. Arrggh!!

May 3, 2003

Bugs

Ran across some bugs this week:
* gaim 0.62 has spring a new file descriptor leak. After running gaim for a long time (over a day), it starts complaining about how it has too many open files. Looking in /proc, sure enough, the entries in the fd subdirectory of the gaim process seems to have lots and lots of open files. gaim is quite an active project, so I'm assuming that someone else has already reported this. Let's hope 0.63 fixes this. :-)
* libtool 1.5 is not passing flags down to the C++ linker properly in all cases. One specific case that doesn't work is attempting to build a shared C++ library on Solaris with the Forte compilers in 64 bit mode. In this case, a CXXFLAGS of "-xarch=v9" is required at compile time. And since CC is used as the linker to make the resulting .so file, "-xarch=v9" is required at link time, as well. But libtool refuses to pass this flag down to the linker, resulting in CC complaing that the objects are of the wrong class (64 bit) when it is trying to make a 32 bit library. Doh!
I've reported the bug to the libtool mailing list.

June 2, 2003

Microsoft sucks

Ok, I'm really annoyed with Microsoft now.
Granted, this all started out with a stupid mistake on my part -- but now it turns out that to fix that mistake, I have to _completely reinstall a machine!_
Here's what happened...
I was working on an XP Pro box and was wondering if the problems that my user was having with her anti-virus not updating properly (she's a "limited user" on the machine -- no admin rights) were because of NTFS permissions issues. So I went about using the cacls command to change the permissions in the anti-virus software directory (yes, I know this isn't a Good Idea -- but I wanted to see if it would work; I wanted to see if NTFS permissions were the issue). Well, I borked up the syntax of the cacls command, and ended up erasing all the perms out of c:\windows (and apparently some of the subdirectories). Doh! Stupid, stupid, stupid...
But the problem gets worse. Going to another machine, I notice that there are complicated permissions on c:\windows (and friends). So I look up the help page (and "http://support.microsoft.com/":http://support.microsoft.com/), and notice that cacls _cannot set these permissions!_ That's right: you can use cacls to bork up the permissions on your machine, but you _cannot_ use it to fix them!
That is unbelieveable to me. I could not find any GUI method of changing the permissions, either. Hence, I'm going to have to fully re-install the box to really fix the problem. Totally, totally lame.

August 21, 2003

There is only one unit of measurement

After beating my head against a wall for 2 days looking for a memory bug in LAM/MPI using valgrind (a memory-checking debugger for Linux), bcheck found the error within about test 3 runs on Solaris.

Don't get me wrong -- valgrind rocks as well. valgrind is a fabulous tool and I'm extremely glad that its available (many thanks Julian!). But bcheck somehow provides more detailed information than valgrind provides.

...actually, I guess that's not entirely true. I was sitting here thinking about it while writing this entry and I figured out why valgrind didn't tell me the same information that bcheck did. Here's the scoop:

In this case, the problem was both a read from unallocated and a duplicate free within LAM's myrinet network device. bcheck reported these problems, but valgrind did not. Why?

It all comes back to Myrinet -- arrgh! On Linux systems, LAM/MPI has to use its own memory allocator (a derrivation of the venerable ptmalloc) to be able to catch calls to sbrk() such that memory returned to the OS is guaranteed to be unpinned before it is returned. Hence, valgrind is probably not intercepting these calls because it doesn't know that it's the "real" free(), sbrk(), etc.

This doesn't happen on Solaris because Solaris has a bug deep within its kernel such that gm can't atomicly allocate-and-pin memory, and therefore LAM/MPI doesn't need to replace malloc/free/etc. (that's the short version, omitting all the juicy details). Hence, bcheck is able to see/report on the "true" malloc/free, but valgrind isn't.

September 16, 2003

Verisign totally sucks

September 17, 2003

WOPR lives!

The various network incarnations of squyres.com will soon be moving to a new home.
Several ND grad students (and ex-ND grad students) have banded together for this endeaver. We bought an old server from dotgoneassets.com, set it up with Debian (now named WOPR), and shipped it out to a hosting service in Kansas.
It'll take a week or three before everything switches over to WOPR, but the machine has been live on the net since 14:08 CDT today.
Woo hoo!

September 24, 2003

Renice? You must mean kill -9

Bonk.
Today was the first time I've ever experienced totally random BSOD-like behavior from Linux. i.e., a total crash for no apparent reason. I was not installing, re-configuring, or tweaking anything with the kernel. Nor have I done so for quite some time (weeks). I was simply editing a C source file in emacs, when BLAM! My whole system freezes and the caps lock and scroll lock LEDs started blinking.
A few quick searches showed that it appeared to be a Linux-related problem (i.e., others have run into it), and that it was a fail-stop problem. So I rebooted (sigh) and opened up my C source file to find it totally trashed. #$@%#@%#@!!!!!!
Also, my dad noticed today that lists.squyres.com was misbehaving -- things he had sent didn't seem to be getting redistributed. I logged in, and sure enough, the load was astronomically high and, well, basically nothing was happening because a) it's a 100MHz machine with b) very little RAM and c) a slow disk. So it was thrashing like crazy and no real work was happening.
Apparently what happend is that over time, with random network outages, various processes piled up until my machine reached Armegeddon. Rebooting cleared it all out and the spice started flowing again.
Can't wait to transfer all this stuff to WOPR... (we're having a DNS propogation problem right now -- looks like Tucows may have screwed up our domain entry. Doh!)

October 12, 2003

A new form of spam

This new theorized alliance between hackers and spammers is quite troubling. My web log was spammed 3 times today -- 3 separate comments were posted (two of which were identical, suggesting that it was an automated agent, not a person) that contained links to porn sites.
This is a new form of spam that I haven't seen before. And it sucks.

November 9, 2003

WOPR lives!... umm... again...

I'm finally moving more and more services over to WOPR.
The [lame] web site moved a while ago; DNS hosting moved a while ago; mailing lists moved yesterday, and now I've finally moved my blog.
Woo hoo!

November 30, 2003

Outside of a dog, a man's best friend is a book. Inside of a dog, it's too dark to read.

I recently submitted "a bug":https://sourceforge.net/tracker/?func=detail&atid=100235&aid=849022&group_id=235 to the Gaim instant messenger program (the one that I use). it's a minor detail that has annoyed me for several releases now. I just got an e-mail that it's been fixed and will be included in the next release.
I actually submitted "a second bug":https://sourceforge.net/tracker/?func=detail&aid=849031&group_id=235&atid=100235 at the same time (it's a bit more significant bug than the first one), and although it's not fixed yet, it was assigned to one of the developers within 24 hours, so that's promising.
Yay for open source projects! :-)

December 29, 2003

It's like mah daddy told me once; the only thing bettah than a crawfish dinnah is *five* crawfish dinnah.

In browsing through some music samples on Amazon the other day, I discovered that Windows Media Player sucks. At least 1 in 4 times, when you click on the music sample link, the media player window pops to the front but instead of displaying the song title and starting to connect to the media, it says "hurl" and does nothing.

I've started playing with the "Zinf":http://zinf.sf.net music player on Linux (instead of XMMS). Although it was relatively annoying to compile (it has a lot of dependencies), it has a *much* better music organization structure than XMMS -- it searches and finds all your music and then lists then according to artist and album.
This is by no means a new concept -- but for someone who is used to XMMS's lack of music organization, it's great. It also has a great method of editing Ogg/MP3 tags, so I spent a little time fixing up a lot of my tags. I even discovered that a bunch of my old music was corrupted (some holes in MP3s or Oggs), so I re-ripped them.
There's still a few problems with Zinf that I'd like to see fixed (should use case-insensitive sorts, a few random crashes, better differentiating between the currently-playing playlist and other playlists that may be open, fixed width title displays, etc.), but I think I'll give this a whirl for a while instead of XMMS and see what happens.

January 3, 2004

To NTP or not to NTP

So here's an odd thing.
I mentioned a few journal items ago that I wanted to get a watch that automatically sets its time via the radio signal from Ft. Collins, CO. With the Gramma Cash(TM), I bought myself the "Casio Atomic Shock Tough Solar watch":http://www.casio.com/index.cfm?fuseaction=products.detail&Product=GW300A%2D1V that fills this requirement. Other than the fact that it relies on light to recharge its batteries, I think it's cool. It definitely sets it time every day -- I can see that it almost always has a high "time signal strength" level. It's a little bigger than my old watch (height-wise), but I'm adjusting. We'll see how it goes.
I was bored tonight, and on a whim, I compared the watch's time to the digital clock on my Verizon cell phone. My cell phone doesn't have a seconds display, but hypothetically, they should change the minutes value at more-or-less exactly the same time (as far as the eye can tell, anyway).
Shockingly enough -- they don't. My cell phone changes the minutes value almost a full second before my watch. This is really weird -- both should hypothetically be within milliseconds of "real time" because both are frequently synchronized with a central source (the watch supposedly re-syncs at least 4 times a day).
I suppose there can be multiple sources of error here:
* simple lack of precision (i.e., precision granularity on the order of, say, tens or hundreds of milliseconds) on either device
* the cell phone time is sufficiently far down from a true NTP source that it is actually hundreds of milliseconds off
* the cell phone only synchronizes once in a long time, and when I saw it noticably different than the way, it had just drifted a lot
* the martians are pissed off about us sending annoying probes to their planet and have decided to retalliate by skewing all of our clocks by infinitessimal ammounts, thereby raising the Earth's Geek Ire Level
* propagation of the radio signal from Colorado to my location
I'm guessing that the last one (radio propagation) is probably the most likely -- the cell phone syncs to a local tower, and the signal distance isn't nearly as far. The watch has no concept of its distance from the signal source, and I'm guessing that there's no negotiation (a la NTP) for it to be able to calculate its time-to-travel from the source. Hence, I'm guessing that if I was in the immediate vicinity of Ft. Collins, Colorado, the time on my watch and my cell phone would be virtually identical.
It's still odd, though. :-)

NTP take too

Hmm. A "quick google calculator search":http://www.google.com/search?hl=en&lr=lang_en&ie=UTF-8&oe=UTF-8&q=1164.13+miles+%2F+speed+of+sound&btnG=Google+Search shows that my last theory can't be right -- using the speed of sound (at sea level), it takes over 1.5 hours for the radio signal to get from Ft. Collins to Louisville, KY (and yes, 1164.13 miles is a mapquest distance, so this assumes that the signal is taking I-25 out of Ft. Collins towards Louisville and hits no traffic -- but it's more or less a straight line, so it's a Good Enough(TM) distance to use).
However, I have no idea what frequency the Ft. Collins station is broadcasting. For example, (although I'm sure it's not) if it was in the light frequency range, "google shows that the signal would get from Ft. Collins to Louisville in about 6.2 milliseconds.":http://www.google.com/search?hl=en&lr=lang_en&ie=UTF-8&oe=UTF-8&q=1164.13+miles+%2F+speed+of+light&btnG=Google+Search
So it's probably somewhere in between the two (which sounds reasonable) -- I could check what frequncies typical radio stations transmit at (it has to be somewhere in that order of magnitude, and FM signals are obviously much faster than the speed of sound). But I'm tired and finding that I don't care too much any more. I just wanted to post some links to google's calculator. :-)

January 11, 2004

Wise words

January 20, 2004

GNU ddd 3.3.8: missing files

A random technical note for myself (and possible others -- google?) in the future...
GNU ddd (data display debugger) v3.3.8 is missing some files in its distribution tarball. I tried for quite a while to get it to compile, and finally gave up and posted to the bug-ddd@gnu.org list. I actually got a helpful response in a few hours:
bq. Yes, some files are missing in the latest release. It is fixed in the CVS repository. Quick workaround for the moment: get the missing files in gcc 3.3.x in "gcc/include".
Excellent!

February 17, 2004

It's raining mail

In my never-ending quest to find a great mail client, I tried Thunderbird yesterday (the Mozilla mail client).

It seemed like a fine client, but I had some major problems with it. Here’s the problems that I found:

Thunderbird’s folder subscription mechanism was both erroneous and did not scale. I have hundreds of IMAP folders — selecting and unselecting all of them (or even picking which ones I want) in a tiny window with no multi-select capabilities was annoying at best. Thunderbird was also convinced that I was subscribed to many folders that no longer existed, so whenever I tried to go in there, Thinderbird would report an error from the IMAP server saying that the folder didn’t exist (duh!). The only way that I could convince Thunderbird that I wasn’t subscribed to these folders was to “delete” the folder (even though it didn’t exist). There were far too many folders like this for me to want to sit through deleting all of them. The subscription mechanism therefore seems to still need some work.

Similarly, the “which folders do I want to be available offline” doesn’t scale for exactly the same reasons. It would be really convenient, for example, if there could be a quick shortcut for “all of them”.

There seemed to be a nice flexible ruleset mechanism, but it lacked the ability to colorize entries in the index pane. This is a feature that I have grown to love in my current mail client (different kinds of mails are shown in different colors in my index).

The search capabilities were nice (search through all folders, even through the bodies of mails), and it seemed to work nicely. But it would be nice if you didn’t have to open a separate window to do it — there’s a search box in the main window, you can only search through subject and sender information.

There didn’t seem to be capabilities for multiple “roles” or “personalities” in addition to the ones that you had official accounts for. For example, in my current mail client, I have about a dozen roles — settings that affect the “From” line in my messages, signature, sent-mail box, etc. I have more roles than e-mail accounts for two reasons: 1) all my mail funnells down to one mail server, and 2) my mail server has several different DNS names. That’s a killer feature, and I really need it.

There was no ability to save my settings or addressbook on the server. I’ve only seen one mail client have the ability to do this (pine, my current mail client), but it’s really useful. Especially since I commonly use at least 3 different machines to read my mail, when I store all this info on the server, there’s never a need for manual synchronization (which can be a huge hassle); the most recent version is always downloaded from (and saved to) the server. I have grown totally addicted to this feature.

So I’ll stick with pine for now. I really was looking forward to being able to handle PKI certificates properly and having true IMAP disconnected operation. But oh welll….

February 22, 2004

We should've gotten a live chicken

CVS commit message of the day:

It’s an uncommon practice to use strlen(“SOMETHING”) && strncmp(…) as the value for a length parameter to a strncmp. I’m therefore assuming it’s wrong, and fixing it. I’m also going to get some breakfast. I was thinking Lucky Charms, but we’re about out of that, so I may go for bacon instead. Or I could go take my shower and get some donuts. It’s a tough decision.

June 30, 2004

Lies, damn lies, and statistics

milliways.osl.iu.edu, the fourth server on the list, is my lab’s main mail server. It’s a little underpowered mail server running Sendmail and GNU Mailman. It serves all the lists for http://lists.boost.org/, which is why it’s so high in the list.

But in other news, who knew that smtp.gentoo.org and lists.gentoo.org were in Indiana? ☺

July 18, 2004

Silme-vertising

So I ran across a new trick the other day. A slimy, disgusting trick, but it was new to me.

I co-own a server with a bunch of friends that it hosted out in Kansas somewhere. We all host our personal domains out there, a few web pages, and non-work related e-mail. Each of us take responsibility for different sub-systems on the server. I, for example, am responsible for the web server. To that end, I run a bunch of virtual servers in Apache, one for each web site. For each web site, I setup a “stats page” showing a bunch of interesting (and totally useless) stats for the site: how many hits, IP addresses of those who visit the site, by-hour breakdowns, etc. I used the freeware Webalizer for this stuff.

The other day, Kyle W., one of the other owners, sent me an e-mail asking why the hit count for his stats page alone was over 7000 for the first 2 weeks of July. Well that makes no sense whatsoever:

Kyle runs a small web site and should probably have less than 7000 hits total for his entire site for the entire month

Who on earth would look at the web stats page over 7000 times in 2 weeks?

So I went and had a look at the logs. It took me a few minutes to figure it out, but once I saw it, the pattern was obvious. Little known-fact: when you surf to a web page, your browser usually sends the page that you came from to the web server. That is, if you’re on page A, and you click on a link that takes you to page B, your web browser will automatically send A’s URL to B’s server. This is called the “HTTP referer” and it allows web site administrators to track your progress through their site, figure out what search engines have found their site, etc. This is not new — it has been in the web since the very beginning.

What I saw in the logs was that lots of random different IP addresses were hitting Kyle’s stats page every few seconds (each IP would hit Kyle’s page every 4-10 seconds) with a very specific refering URL — a porn site.

That’s right, a porn site.

So what was really happening was that lots of “zombied” machines (i.e., machines that have been taken over by a virus or a worm and used for nefarious things like this) were hitting Kyle’s page every few seconds with a referring site of a porn site. So they were convoluting the real intent behind the referring URL — they weren’t really listing the URL that they were coming from, they were simply always listing the porn site URL. Put another way, they were lying about the URL that they were coming from.

The reason why is a bit convoluted: by hitting the stats page, they were getting the referring page listed (and linked) on the stats page. That is, the stats page lists all referring URLs and how many times they were seen. This usually gives a web site administrator a good idea of where (and how often) people enter the site (e.g., from a particular search term in Google, etc.). So since our web stats page lists all referring URLs, by lying and insertting the referring URL of a porn site, they were getting us to [automatically] link to their porn site.

Additionally, Webalizer counts how many times a referring URL is seen and ranks them. So by hitting Kyle’s page with the porn referring URL 7000+ times, it was easily at the top of the referring URL stats — i.e., it was the most frequently seen referring URL.

Ok, well that’s all fine and good, but still — why go to all this trouble?

The answer lies in how search engines work. Search engines — like Google — rate the importance of a web site by how many other web sites link to it. So what was really happening here is that some porn site has figured out really creative ways to get other sites to link to it — they search out and find web stats pages. They then hit that site thousands of times and get their referring URL to show up (and make it highly ranked on that stats page). The thought is that Google (and others) will notice all these links and increase the importance of the porn site because it’s linked to by so many other sites.

Very, very slimy. I have decided to call this slime-vertising. It’s totally dishonest.

And I’m sure this tactic is a) not limited to this one porn site, and b) an entirely automated process. It was quite surprising how many machines were zombined into doing this (some of the “dozens” of machines that I mentioned above were actually in .mil!). Our stats pages have since moved into a password-protected area on the web site, so we won’t see this problem anymore, but to those of you who have publicly-viewable stats pages, beware! This could be happening to you.

Here’s another form of slime-vertising — one that has been around for quite a while: it’s not uncommon that I have to remove anonymous user postings to JeffJournal that simply contain a link to a porn site (undoubtedly put there by some automous bot who found my blog and noticed that they can put anonymous posts with web links in it).

September 12, 2004

Spamity spam spam spam

Blog spam is pissing me off. I got 5 posts yesterday and another 5 today. While I was deleting them, I got confused and accidentally deleted a valid comment. #$%@#$%#@$!!!!! Sorry Andras. :-(

If I keep getting blog spams, I’m going to have to convert this to a you-must-register-before-you-can-post blog. Sorry folks — it’s not like there’s a million people who post to my blog, but I just don’t have the time to keep deleting these spam posts.

September 14, 2004

Spamity spam spam spam (redux)

Big props to Tony H. for pointing me to the MT-Blacklist plugin to my blog system, as well as the MT-Blacklist clearinghouse (ok, I’m shamelessly linking to both because they’re both quite worthy and quite excellent).

Should help a lot in reducing the spam, and has quelled my desire to immediately turn off un-registered comments.

September 26, 2004

Lions, tigers, and bears, oh my!

The Irish had a decent game yesterday. They kinda fell apart in the 3rd quarter, but other than that, they’re actually looking like they’re coming together as a team. Keep it up, guys!

The main purpose of this entry is fairly frivolious — I’m testing some new features in MT 3.11, and after digging through a lot of web pages, I finally found the version of MT-Blacklist that works with MT 3.11 (my blog software): 2.01b. For anyone else as mystified as me — and that’s probably most people who don’t follow the intimate details of MT and MT-Blacklist releases — you can (currently) only get 2.01b from the MT 3.1 plugin pack.

Don’t get me wrong — I’m quite happy with both MT 3.11 and MT-Blacklist 2.01b. They’re both great pieces of software! It just took a bit to find out which versions match which, and neither has a strictly consecutive version number scheme.

November 22, 2004

It's snowing at a rate of $30/hour

A quick entry today about some tech-related stuff…

I recently discovered that I had been slammed back in September — someone changed my long distance provider to Sprint without my knowledge. When I called Sprint to fix the situation, they were quite helpful — the attendant confirmed all the charges that I had seen and sent all the information to some investigatory department (I should hear back in 20 business days… seems like a long time). She said that it looked like someone had bought a Sprint cell phone at a Radio Shack and put down my home phone number as their number and asked to have their long distance switched to Sprint. I asked her if this was the newest way of slamming people or whether it looked like someone made an honest mistake. She said it looked like a mistake. Hmm. I’m dubious, because the name on the account wasn’t mine and the address was “similar but different” (i.e., on my street, but the number was backwards). I suppose that it’s certainly possible that someone randomly put down a phone number that just happened to be mine and a jumbled address that was remarkably similar to mine… perhaps even one of my neighbors on my street (she obviously couldn’t tell me the name for privacy reasons). But it still seems awfully coincidental. We’ll see what Sprint reports back to me — hopefully, at a minimum, I’ll get a refund of the costs involved with switching to Sprint (!) and the difference in their long distance rates vs. what I was paying before I was slammed (yes, I was getting worse long distance rates with Sprint).

I got a new headset for my cell phone. I like it much better than my old one. It’s a little in-ear thingy with a short, noise-cancelling boom mike (the kind that doesn’t come out beyond your cheekbone). Yummy.

We just switched my church over to a Windoze domain (they had previously been doing just peer-to-peer stuff with a central file server before). All in all, it’s been a good transition, but there have been a million tiny “gotchas.” The same procedure applied to 13 machines has yielded different results on all of them — every one failed in different ways (keep in mind that they are fairly well-controlled machines; none of the users have administrative privlidges). This is perhaps what I had about Windows most: the lack of repeatability.

I’ve found at least one, and possibly two software bugs in my maxivan. The first is quite repeatable (I tried several times in the Kroger parking lot to ensure that I wasn’t imagining things). Do the following steps:

Insert key and turn on the car

Wait 4 seconds

Switch into reverse

Backup for 4 seconds (not sure if moving is actually necessary, or just being in reverse)

Switch into drive

The rear-view camera may stay on for several seconds more, but the map disclaimer will eventually come up. Click on ok, and you’ll be taken to the map. From here on our, the “Audio” button is non-functional — you can’t get to the stereo screen.

I haven’t taken the time to make the second bug repeatable yet, but it has happened to me multiple times: if I turn on the car and the CD starts playing (i.e., the CD was playing the last time the car was turned on), when I go to the audio screen, I can’t use the touchscreen to move away from the CD (e.g., to switch to XM radio). I can push all the mode buttons (FM, AM, XM, etc.), and the push clearly registers on the screen (i.e., the button changes color like it was selected), but then the button unselects itself and goes back to CD (the CD is playing uninterrupted the whole time). The “mode” button on the steering wheel seems to be the only way to change out of the CD player (note: I haven’t tried with the backup stereo controls, nor the back seat stereo controls). Need a little more testing to nail this one down, actually.

I got the weirdest message from Norton Anti-Virus on my mac the other day (many ask: “why do you bother to have anti-virus on an OSX machine, anyway?” After this message, I’m not sure!): “Norton AntiVirus AutoProtect could not continue. Please reinstall Norton AntiVirus and restart.” Here’s my response: “Dear Norton: PC users may be used to this crap, but I am not. If you stop working for no apparent reason (I hadn’t done anything to Norton when this message suddently appeared on my screen), then I won’t use you. Buh-bye.”

November 25, 2004

Lost mail

I just found out that due to an bad setting in my OSX Mail application, I’ve lost 1.5 months of sent mail (it was automatically truncating my sent-mail folder). I have all mail that I’ve sent since the mid-90’s, except for September 1, 2004 through mid-October, 2004.

December 1, 2004

Really stupid spammers

I got over 50 spam posts to my blog yesterday. The thing is, 95% of them were from a really stupid spammer. It looks like someone is using a bot network to post to MT blogs (because I get hit with the same message from lots of different IPs), but the message doesn’t advertise anything, and doesn’t contain any URLs. It looks like the spammer left the default / template blog post message and forgot to fill in a targeted message to advertise whatever they are trying to scam / sell. Here’s the message that is being posted over and over in my blog comments:

You are invited to visit the sites about… Thanks!!

It’s amusing to think that someone is actually paying for this (i.e., spammers typically hire bot networks for this kind of thing, so the spammer is paying for this, but isn’t getting any advertising at all), but it’s also sad to think that the stuff is so easy these days that any idiot can do it.

December 13, 2004

Technology gives us free time

Wow — how’s this for suckage? Another volunteer was installing a pair of DSL modems at my church to connect the LAN in the main building to a building about 800+ feet away (i.e., one modem at the main building and another at the end of a dormant twisted pair that was laid out to the remote building several years ago):

One thing that was driving me nuts with the DSL installation was the fact that the head DSL modem would not work when the EH end was plugged directly into the Dell switch with a straight through CAT-5 cable. I got it to work by plugging the cable from the Dell switch into a small hub I brought along then pluging the DSL modem into this hub. I thought it strange that I was getting a connection from the Dell switch to my hub using a straight-thru cable. Usually one uses either a crossover cable or the “stscking” connection on the hub to connect a switch to a downstream hub. Unless, thought I, that the Dell Switch is auto-sensing and reverses the input & output connections to accomodate either a straight-thru or a crossover cable.

A trip to the Dell website confirmed that auto-sensing is a “feature” of this switch.

I called Black Box, who made the DSL modems and talked to a tech about this. He said that the modem is auto-sensing, too, so two auto-sensing devices are fighting each other and never figure out which polarity is which. Inserting my “stupid” hub in line, settled the issue in that both the Dell switch and the DSL modem saw a regular hub connection.

December 29, 2004

The Ballad of Sir Camcorder

Tracy and I got ourselves a camcorder for Christmas because we felt that, as responsible capitolistic American parents, we needed to spend too much and get one (all the other parents have one — shouldn’t we?). So without doing nearly enough web research, we went to Best Buy and bought a middle-of-the-road-but-still-quite-expensive digital camcorder (the Best Buy sales girl was actually quite knowledgeable and helpful).

Still, there were endless debates about what model to buy, what camcorder tapes we needed to get, what to do with the video after it was recorded, etc. It turns out that we don’t already own a DVD recorder (not even in my Mac — I didn’t get a SuperDrive; I only have a regular CDRW drive), so what kind of DVD writer did we need? And what software would we need to mix/make movies and record on DVDs?

We ended up loading up on all kinds of stuff: a camera, some extra tapes, Pinnacle [Windoze] software, and an external DVD burner. The rationale here was that Mac assumedly had good movie editing software, so it would be nice to work on either the Mac or the PC.

Can you say; “Best Buy target audience”?

Well, it ended up sucking. ☹

The camcorder is fine — as advertised, it’s a middle-of-the-road camera, and all the reviews that I found for it on the net are fairly positivie (good, but no one’s trying to make production-quality movies with it). It’s the external DVD writer that I ended up with in a battle of wills.

Note to self: Refrain from entering a battle of wills with inanimate objects; you’ll lose.

I should mention that I made a fatal flaw common to Mac owners — I assumed that the stuff would “just work” and didn’t try the DVD burner until after I had spent all day mixing the video, still photos, music, etc., into a movie and was ready to burn it onto a DVD. Doh. I should have realized that the burner was not Apple hardware, and that assumption did not apply. ☹

I should also mention that the Mac software iMovie is pretty nice. I know nothing about video editing, but after messing around with it all day and reading its online help, I made a decent movie for a first-timer. Its companion package for creating and writing DVDs, iDVD, however, seems a bit less mature than iMovie and has some rough edges (and was actually the source of several of my problems, it turned out). Hopefully, the next version of OS X will have an improved version.

The first burner I got simply didn’t work with Macs at all (despite what the guy told me at Best Buy). So I returned it. Best Buy was very cool about it; it was their mistake, so they swapped it with no hassle whatsoever. I got one that supposedly did work with Macs — a Plextor 716UA. But after I plugged it in, I couldn’t get any DVD ROMs to be recognized by OS X. Hrm.

Needing a break, I starting going through all the rebate paperwork from Best Buy (we got somewhere around 37 receipts and rebate forms for all the various gear that we bought). Lo and behold, for buying the camcorder, we were supposed to get a free copy of the Pinnacle software that I just bought! So I returned the software and filed for the free copy (hypothetically, we’ll see it in about 8 weeks).

Returning to the fight-the-DVD-burner project, I finally found small print on apple.com that says that iDVD (and OS X) only works with Apple SuperDrives; it doesn’t write to any other DVD writers. Arrgh!

Unfortunately not. Even with the excellent Patchburn utility, I was still unable to get iDVD to write to my Plextor. It would go through all the motions and even write a little data, but then it would fail with one of a few different errors (this would even happen if I simulated writing to the DVD!). I’ve written to the Patchburn author to see if he’s interested in fixing it; we’ll see where this goes. I can successfully burn DVDs from my PC, so I know that the burner is ok. It’s something wrong in OS X / iDVD / astral alignments.

I may end up returning this DVD burner and getting a [cheaper] internal one to put in our Windoze PC (i.e., so much for using the nice iMovie software).

January 13, 2005

'tis the season for disk failures

Wow — two disk failures in totally disparate systems in a single day.

The first is the [Linux] server at my church. It’s a Dell PowerEdge server with 3 disks running in a RAID5 configuration (which I have Kim B. to thank for convincing me that “what the heck; we might as well run RAID5, right?” — woot!). Earlier this week, the server stopped responding entirely. The on-site staff rebooted it and it all came back fine, but nothing showed up in the logs as to why it was failing. So we scratched our heads and went on with life. The next day, it “half-died”, meaning that Samba kept working fine (which is the main purpose of the server), but incomming ssh connections would hang halfway through the authentication.

The on-site staff connected a monitor to the machine and saw that there were SCSI errors on the console (but not in the logs!). I got in early the next morning and found that one of the three disks was issuing media errors, so I forced it offline. RAID5 took over without a beat and no data was lost. Woot!

The machine is still under warranty, so Dell overnighted a new disk that should be there today. The hardware RAID is hot-swappable; I’m told that I can just plug in the new disk and it will automatically start rebuilding the RAID5.

Cool.

One of the two brand-new SCSI disks in WOPR (the server that hosts squyres.com and several other friends’ domains) also died. It’s running software raid (RAID0+1, IIRC?), and has been issuing warnings for a few days, and finally totally failed last night. This caused massive badness in the machine (unresopnsiveness, inabaility to hard reboot, etc.). Bmoore spent a good amount of time with Jason on the phone (the on-site tech); they managed to coax it back into life by somehow convincing the software RAID that the disk wasn’t there (the RAID was failing in odd ways when it thought that the disk was there, bringing the entire machine down).

I bought these two new disks in early December from a low-cost supplier (no names mentioned). Luckily, they have a 365 day warranty. The failure occurred supposedly during their business hours, but I couldn’t get anyone on the phone. According to their web pages, after I filled in a web form for warranty service, I can supposedly expect an RMA number within 2 days (!). Apparently (it’s not 100% clear from their web pages), we have to ship the disk back to them and then they’ll ship us a new one.

So I’m thinking that it’ll be at least a week before we get a new disk — squyres.com will be running without RAID backup for the entire time.

Just contrasting this with my experience from Dell, I’m probably never going to buy from these low-cost vendors again. The immediate/no-hassle/no-fuss service from Dell was worth the extra cost.

The Ballad of Sir Camcorder, II

My new LaCie DVD external DVD burner arrived yesterday.

Oddly enough, their software installer for OSX just wouldn’t work — it would tell me that I didn’t have permission to install their software, even if I was running it as root. I called their tech support who had me manually copy one file under /Library, and then life was good.

iDVD doesn’t support this drive (it apparently only supports the Mac SuperDrive), but at least I can use iDVD to burn to an image file and then use Disk Utility to burn to this DVD drive, which is definitely an improvement over before (where I had to convert the .img to be an ISO image and then copy it to a Windoze machine to burn it). So it’s still not optimal, but it’s good enough for me.

January 14, 2005

It's Raining Failed Disks

In a previous entry, I mentioned how I had two disk failures within 24 hours of two totally different machines. I found out that in the same week, a friend found 3 bad disks in another RAID in one of his systems.

I smell a global consipiracy!

So the new disk arrived from Dell. When I went in to install it, I was surprised to find that that disk, too, was bad (!). Specifically, it wouldn’t even start up — the RAID controlled immediately identified the disk as “failed.” I called Dell tech support and spent over an hour convincing the support guy that the replacement disk was bad and that we needed [another] new one.

That’s not nearly as bad as it sounds — the Dell guy was actually quite knowledgeable, but he [rightfully] insisted that we try an exhaustive set of tests to ensure that the disk really was bad. Unfortunately, I was doing this in the evening, and it was long past the deadline for overnight delivery, so it’ll be delivered after the holiday next week.

For my own system, I submitted a web form the night of the failure to get an RMA number. I then called them first thing in the morning to see if I could convince them to send me a new disk before they received the old disk back. It took 13 minutes to look up my order (apparently I had committed the cardinal sin of having the disk shipped to where the server resides, so I didn’t have the invoice. The invoice is apparently the sole place on the planet where my customer order number exists; it wasn’t even on the receipt (which I have). Weird.

Once he found my order, I explained my situation (including telling him that I had submitted the RMA form the night before). He issued me an RMA number on the spot. After much groveling, cajoling, and begging, the guy also grudgingly agreed to send me the new disk right away. So I’m very grateful for that, but it seemed pretty odd that I had to go through so much trouble for reasonable customer service.

And then, to top it off, towards the end of the day (i.e., several hours later), I got an e-mail response to the RMA form that I submitted the night before with a new RMA number.

January 16, 2005

The results are in

February 24, 2005

My mac is haunted

In general, I’m quite happy with my Mac. It “just works” most of the time, and I don’t have the screw around with it to do what I would expect it to do. However, there are a small number of things that it doesn’t do right; they all small enough to be annoying but not show-stoppers:

I think the filters/speakers on the left/right of my keyboard is dirty/clogged. Just moving my hands over them causes the screen to dim. Brian tells me that lots of other people have been complaining about this since the 10.3.8 update — perhaps this one will get fixed soon.

I cannot play to Airport Extremes through my iTunes. I can see them listed in itunes in the bottom right, but when I try to play to them, nothing happens — music keeps coming out of my machine, not the airport extreme. This has happened to me in at least 2 different places.

My Codetek virtual desktop sometimes misbehaves. It sometimes flips off to another screen and then flips back. This usually happens when I’m flipping to another screen — e.g., I’ll flip flip flip (i.e., 3 three desktops away), but Codetek will go to the 4th desktop, and then flip back to the 3rd.

If I have a message half-composed in Mail, if I switch off to another application, I’ll frequently get a popdown window saying “The message cannot be saved.” However, if I switch back to the compose window and hit Apple-S, the message saves properly in the Drafts folder.

I have the full Acrobat reader installed, but if I pull up any PDF document in it, it will not print more than 1 page. More specifically, if I print the document from within Acrobat, Acrobat goes through the motions of printing all N pages (i.e., I see the dialogue of it and it lists that it’s printing N pages), but only one page comes out of my printer. I actually think that this is a problem with my print setup at home — my Mac automatically finds the CUPS print server on my Linux workstation (to which there is an HP laser jet attached). I’m guessing that there’s some kind of wonkyness there in the data that the Mac sends and CUPS is able to understand. As further proof of this, I have never been able to get documents to print in landscape mode from my Mac (I select landscape in the print setup dialogue, but they come out on the printer in Portrait).

Has anyone else seen these kinds of things? Any hints would be appreciated…

Updates

Here’s some more things (added after the original post) that my Mac does:

Emacs periodically freezes and I have to “Force quit” (this is the Carbon emacs that I download from Brian)

That same emacs, when used with “-nw” goes into a spinning loop of death and needs to be kill -9’ed from a different window

It never remembers wireless networks that don’t have WEP passwords (even though it asks me if it should remember them for the future)

MS Office will typically (but not always) launch behind all other windows, even if I’ve double clicked on (for example) Excel and sat there waiting for it (i.e., not changing the focus to anything else)

March 1, 2005

Software is great until users screw it up

Every month, our e-mail list server sends out “reminder” e-mails to everyone who is subscribed to all the lists that we provide. These e-mails are essentially, “Hey, don’t forget that you’re subscribed to this list. If you don’t want to be subscribed to this list anymore, here’s how you unsubscribe…”

Periodically, we get irate users who reply to us about various things — sometimes justified, sometimes not. Sometimes the angry replies are so misguided as to be quite amusing. Here’s one — including my reply to him — that came in this month:

On Mar 1, 2005, at 7:32 AM, L. User wrote:

I just realized for the first time, each time you send me a membership reminder, you send me my password, UN-ENCRYPTED.

Correct. This is how the GNU Mailman software works — you will find the same setup on a very large number of other mailing lists around the internet. The fact is that there is no good way to have 100% secure passwords in software that is driven by e-mail. In order to reach a wide variety of people with a huge diversity in mail clients, one cannot assume encryption. This was not our decision — please feel free to take it up with the GNU Mailman developers.

On the list subscription page, it quite clearly says:

“You may enter a privacy password below. This provides only mild security, but should prevent others from messing with your subscription. Do not use a valuable password as it will occasionally be emailed back to you in cleartext.”

Are you people nuts?

No.

I use that password for all my email accounts!!!!

If you truly are a security-conscious person, you:

a) should not do that
b) should not have admitted that in an un-encrypted e-mail

Now stop doing that, or else un-subscribe me.

If you are unhappy with the [free] service provided to you, please feel free to unsubscribe at the following URL: http….

April 22, 2005

Finally!

I’ve been begging my editor for well over a year, and he finally managed to deliver old editions of my MPI Mechanic column in PDF form in the ClusterWorld magazine to me. I only have the first seven so far, but I hope to get current (i.e., the current publication minus about three months) and stay current in the near future.

In his defense, my editor was much more interested in getting a solid, repeatable process in place for being able to deliver the PDFs on a regular basis than a one-time shot. Which is at least some defense as to why it took so long. :-)

May 13, 2005

"world wide web in the air"

Too funny not to share — from Katie S.:

I was at a hotel last night in St. Peters, Missouri - the reason I picked the hotel was because it was new and mentioned free wireless internet on the hotel website. When we got to the hotel, my laptop was not picking up any networks. So I went to the front desk and asked what the SSID was. The guy behind the desk tried to explain to me that the “world wide web is in the air,” thus I did not need any “IDs.” I then explained to him that my computer did not see any “world wide web in the air.” Finally, the manager’s son came in and told me the SSID was DSL. This sounded fishy…after grilling the two guys a little more I found out there was no “world wide web in the air” - it was all in “plastic holes in my room.” Ahhhh… after I had them give me an RJ-45 from one of their computers (I don’t carry these anymore…but I should), I was finally able to access the web from the plastic hole. Woohoo for technology on the road!

May 30, 2005

By the way, the Latin word for "yam" is diosporia

I just bought iLife ‘05, the newest version of the Mac photo / audio / movie / DVD creation software suite. It’s got some nice new features — and I’m actually pretty happy with it — but it seems to be a little buggy. I used to to make a DVD of all the little video clips that we’ve been making of the munchkins over the last few months, and made up 2 photo slideshows too. The end result was kinda nice. But I have a couple of quibbles:

It would be nice if there were tigher integration of iMovie and iDVD. As it is now, once you make the movie and export it to iDVD, you can’t really do any meaningful edits to the movie. You can’t add or delete anything to the timeline, you can’t edit the DVD chapter markers, etc. An example of where this matters is when you “finish” the movie, export it to iDVD, spend the several hours it takes to render the DVD, burn it to a DVD, go view it, and realize that there are now some more edits that you need to make. So you fire up iMovie again, make the changes, but then realize that you have to re-import it to iDVD, replace all your pictures and themes in iDVD, etc. #$%#$%

When you make DVD chapter marks in iMovie (in preparation for exporting to iDVD), there seem to be cases where they “stick” to an absolute point in time. That is, if you add a clip or a transition, the DVD chapter markers don’t move with the video — they stay stuck at time N, whereas the frame they used to be associated with moved to time M. This means you have to delete all the chapter marks and put them back in again, which is fairly annoying. This page seems to have a bunch of details on this — some of these issues are clearly bugs.

I found an interesting bug in the iPhoto slideshow transitions — when you use the “droplet” transition from a vertically-oriented picture to a horizontially-oriented picture, it doesn’t seem to work right. That is, before the transition starts, the vertically-oriented picture is shown on the screen with a lot of black background on the left and right of the picture (which makes sense — the picture is taller than it is wide). When the transition starts, you can see the droplet effect happening within the boundaries of the vertical picture, but the black sides are immediately replaced with the next picture (i.e., no transition).

OTOH, there’s a bunch of features that I really do like in the new version:

iDVD now recognizes my LaCie DVD burner (read: a non-Apple-native DVD burner), so I don’t have to do whacky workarounds to get my movies to burn to DVD — it works straight from iDVD.

The slideshows in iPhoto are pretty cool. Even Tracy liked them (she made the second slideshow on the Munchkin DVD).

Some of the themes in iDVD for the menu screens are pretty technically impressive. You can drag your own pictures in there and the themes move the pictures around — with lighting and shading. That’s pretty neat for a templated feature.

iDVD now shows you where it is in the rendering process. This may sound like a minor detail, but given that the process can take several hours, this is really handy.

Tracy was so impressed by the whole thing that she even brought up the topic of getting a mac for her to manage all the family pictures, movies, etc. Plus, right now, I’m doing all this stuff on my IU-owned Mac. While Lummy really doesn’t care (as long as it doesn’t interfere with my work — which it doesn’t), it would be nice to get a machine that is a) totally owned by us, b) a little faster, and c) has a much bigger hard drive. A Mac Mini or iMac might be in our future…

I had ceiling fans installed in two bedrooms this morning. The circuit breakers in the basement were poorly labeled (“bedroom”); my home office (containing all my computers) is one of the “bedrooms,” so it was a total guess as to which breaker controlled which bedroom.

Also, there is a very, very old public key of mine on the MIT PGP servers. It’s way out of date, I no longer have the secret key, and is therefore irrecovable. If you send me something with that key, I cannot read it. Just say no — use the key that is linked, above.

June 28, 2005

Super Karate Monkey Bummer

The signature-changing functionality of the MailEnhander plugin for OS X Mail.app doesn’t work with Tiger (10.4.1). This is a major bummer. The sig-changing functionality was the only reason that I used MailEnhancer.

It looks like Mail.app changed its behavior to aggressively reset the signature (presumably after MailEnhancer changed it).

What’s worse is that the author has disappeared. His web page has gone 404, and no one seemed to know where he is. So no one has the source code to MailEnhancer, and there’s a lot of other bummed users out there.

June 30, 2005

OS X Spotlight problem: solved

So we just recently got an iMac at home. After setting it all up, I was coonfused and disappointed that Spotlight (OS X’s new “search your whole computer” tool) was not finding keywords from most of my e-mails. It would find one or two e-mails from keywords that should have found dozens.

What the heck?

It took me a while to figure this out. I found help from the handy mdutil, mdfind, and mdimport man pages. The crux of the issue: Spotlight automatically ignored entire directory trees that have names that begin with a period (“.”). Doh!

Let me explain…

On my mail server, all the folders under my INBOX are stored in $HOME/.elm/Mail (only the crustier Unix folks reading this mail will recognize how old that is — I stopped using elm many years ago, but have always left my folders down there and directed my mail client to look there for them. It was just easier). True to form, I put in “.elm/Mail” in my IMAP directory setup on the OS X Mail client. OS X dutifully found all my folders, and everything was fine.

However, unbeknownst to me, OS X Mail was caching my e-mails locally in a directory called “.elm/Mail/…”. This makes perfect sense from OS X Mail’s point of view (allowing potentially multiple different prefixes from a single IMAP server), but it caught me by surprise. Additionally, since Spotlight ignored the entire “.elm” tree, it didn’t index any of the mails below my INBOX.

Problem solved by renaming .elm to something else on my mail server (something that does not being with “.”). I’ve finally been dragged into the 21st century.

August 28, 2005

On CUPS, IPP, OS X, and home area networking with multiple routers

It took me a while to figure out (all of which, in hindsight, makes perfect sense, of course), so if I can save myself, or someone else, some time in the future, that’s a Good Thing…

Background:

I recently got Vonage, which, for the purposes of this conversation, means that I had to add another router into my home network. This router must go right behind the DSL modem in order to guarnatee quality of service for the telephone TCP/IP traffic. However, this Vonage router (Linksys, in this case) does not have wireless capabilities, so this unfortunately means that my D-Link wireless/wired router must go behind the vonage router. It looks like this:

I have a Linux box that must be ssh-able from the outside. Hence, it really needs to hang off the Vonage router, and have port 22 port forwarded to its local IP address (more on this below — see the Hindsight section).

This same Linux box also has an HP printer hanging off it (local parallel connection) that all computers in my home should be able to print to.

I have several wireless clients in my home (laptops, handhelds, etc.).

Problems / Confusion:

The D-Link router does not allow me to disable NAT, so it must be a separate subnet from the Vonage router. This is one of the biggest problems — if I had been able to have the two routers create one logical subnet (e.g., with a default gateway from the D-Link->Vonage, and a static route from the Vonage->D-Link), all might have worked out significantly easier.

Part of my confusion was that my OS X clients used to just “find” the printer hanging off my Linux box when they were connected to the network. Once I went to this new configuration, my OS X clients would no longer “just find” the printer.

Analysis:

It turns out that CUPS (http://www.cups.org/) uses the Internet Printing Protocol (IPP) as the backbone for all of its printing. CUPS can also be setup to advertise its printers via IPP using UDP broadcasts (which my Mandrake 9.2 box did by default). This is how my OS X clients used to “just find” the printer before when they were all on a single subnet. But now that they’re on different subnets, this UDP broadcast doesn’t cross the boundaries — if I connect any client on the Vonage network (i.e., the same subnet as the Linux box with the printer and CUPS server), they “just find” the printer, as usual, and everything is fine.

However, this doesn’t help printing from the D-Link subnet (e.g., my wireless clients).

On my D-Link OS X clients, I tried manually adding an IPP printer, but that never worked. Specifically, I would enter the IP address of the Linux server (on the Linksys router) and the printer name, and then try to print something to it. OS X would try to print and then report that printing had “stopped” with no other expalantion.

Looking at the CUPS logs on the Linux server, I saw that it was replying “Hey, there’s no such printer here.” Looking even closer, it looks like the clients were posting to the URI path /ipp/<printername>, which is what CUPS was insisting did not exist. Looking further back in the logs at jobs that did succeed, I saw that they had posted to the URI path /printers/<printername>. So somehow OSX is inserting the prefix /ipp instead of /printers. How to fix this?

It seems you can’t fix it from the OSX “add printer” GUI (either 10.3 or 10.4). You have to manually edit /etc/cups/printers.conf to reflect the correct URI and then kick the local cupsd (i.e., kill -1 it). On 10.3, this seems to just cause a cupsd reload; on 10.4, you may need to wait a few seconds for the launchd to re-start the cupsd.

Once this was working, I could see that print requests were correctly spanning the entire distance from an OS X client, across the wireless, across the D-Link, into the linksys, to the Linux router, and to the CUPS server. However, jobs were still not printing. Looking at the server CUPS logs again, they were resulting in a “403” HTTP error code every time (“Access Forbidden”).

This was really werd — in my cupsd.conf file on the server, I have a block similar to:

<Location />
Order Allow,Deny
Allow from All
Deny from All
</Location>

Watching the logs server’s CUPS logs (using LogLevel “debug2”), I could see that it wasn’t barfing on the config file, and it was using the “/” Location for the permissions on this printer. But it stubbornly gave 403’s for all accesses until I deleted the Deny clause! Specifically, I had to do the following:

<Location />
Order Allow,Deny
Allow from All
#Deny from All
</Location>

And then it worked just fine (i.e., got “200” HTTP responses instead of “403”, and jobs would end up printing). According to the Apache conventions and the CUPS documentations, my first version should have worked fine — the Allow clause should be examined first and then the Deny clause should be examined. But only by not having a deny clause (effectively making an empty deny conditional) did it work. Just for completeness, I tried “Order Deny,Allow” and got exactly the same results (although that it what should have happened in that case). I tried many carefully-constructed cases (kicking the cupsd every time, of course), but could never get this to work properly until I commented out the Deny clause.

I downloaded the 1.1.19 source and had a quick gander through it. It appears to have the Right code in it for checking the Order, but it somehow appears that the parser is always reading in the file as “Deny,Allow” instead of “Allow,Deny” (I double checked that it was reading the config file that I thought it was reading by introducing syntax errors and saw that they were reported in the log). I’m not sure how this was happening, and I ran out of time before tracking down the problem in the parser. Perhaps someone else will have the time to figure this one out (I have a wholly unremarkable cupsd.conf file). And perhaps it’s fixed in later versions of CUPS (http://www.cups.org/ says that the current version is 1.1.23).

So these were the three Big Things that I had to do:

Manually add the IPP printer on the OS X clients by IP address and queue name

Move everything down to the D-Link and simply setup port forwarding from the Linksys to the D-Link to the host that I need to ssh to. This should [hypothetically] work, in terms of port forwarding (i.e., it will work for any real router; I’m not sure of the exact capabilities/bugs of these two broadband routers, and whether it would really work or not), and then put all hosts back on one subnet and the whole IPP UDP-broadcast-not-spanning-multiple-subnets problem goes away. I may actually try this in the near future as it would simplify a lot of things.

I should have been able to use LPR-style printing (CUPS supports the server-side of LPR as well) — i.e., don’t rely on IPP self-advertising printers, but rather configure the clients manually to talk to the LPR server (via IP address and queue name). However, when I tried this from OS X clients, although the print job did end up issuing on the printer (LPR doesn’t support authentication / authorization, so there were no permissions issues), somehow it ended up printing a stream of postscript text instead of the actual formatted output. I’m not sure where the translation was lost (i.e., that the printer / driver didn’t realize that the job was postscript and do whatever translation was necessary), but I abondoned this attempt because I wanted to use IPP for so that clients would automatically download the relevant PPD files from the CUPS server.

September 10, 2005

It's not a total loss; the phone still works

It’s amazingly difficult to find a corded (i.e., not cordless) 2 line phone for the home that doesn’t suck.

A few weeks ago, I got a second phone line at home (via Vonage), mainly for work-related stuff. Since then, I’ve been looking for a reasonable 2 line phone for my desk. I had only a few requirements:

Not cordless

Caller ID built in for both lines

Support caller ID for call waiting for both lines

A bunch of speed dial buttons (15 or so)

That wouldn’t seem too difficult to meet, right?

Wrong.

I tried several phones from the local brick-n-mortars (Best Buy, Staples, Office Depot) and I returned every single one of them. I could not believe a) how poor the selection was (there’s only a small number of corded 2-line phones) nor how badly they all worked. They were all major brand phones, too, not some small company making novelty phones. The phones were either remarkably sucky (seem examples below), or they were exorbitantly expensive and had way more features than I would ever need (I don’t need a phone waterproof to a depth of 50 meters [particualrly with the 6 foot phone cord it came with] for my desk).

Here’s some of the things that were broken:

One phone would continually “chirp” that the second line was ringing (when the second line was not, in fact, ringing).

If I was on line 2 and line 1 rang, one phone would hang up on line 2 and answer line 2 (without me pressing anything).

Caller ID worked about 50% of the time one one phone (it would get stuck in a screen displaying “Waiting for caller information…”). I know that this was the phone’s fault because I have a small standalone caller ID device that was already showing the information at the same time as the phone would get stuck (disconnecting the standalone caller ID device made no difference).

I ended up doing some of research online (yay Froogle) and found a bunch of phone sellers, but only a few that had non-PBX multi-line phones. Of these, I found what I think is the brand of phones that they use at Indiana University (which have always been reasonable). So I ordered an Aastra 2-line phone.

I received it the other day, and it seems to be working great. It’s sad that I’m excited that my phone is working — phones are supposed to be “just work” technology. But I do like some of the features it has that I haven’t had before — a voice mail light (similar to cell phones), and even the speed dial buttons allow you to program a name in that displays on the screen for both outbound and inbound calls (i.e., it doesn’t just show names that are programmed in the directory and/or caller-ID strings).

Some sidenotes about the whole process:

Vonage seems to be working out well. Most of the time you can’t tell that it’s “special” — the phone just works and the phone quality is just fine. Periodically, I get a call with lots of echo or it just sounds weird, but that’s definitely fairly unusual (and that periodically happens with POTS, too, although definitely less). The best way that I can think to describe Vonage is slightly “less” than POTS, but waaay better than a cell phone. I’m not ready to switch my POTS home phone to a much cheaper long distance service (and use Vonage for everything, because it’s unlimited), but I’m getting close.

My Vonage phone number is actually local to Bloomington, IN (where I work). So I have a single phone on my desk that answers (and makes) calls in two different area codes. Since Cowbell South (my POTS line) doesn’t allow you to dial a 10-digit phone number for local calls, this means that any Louisville phone numbers I have in my speed dial will only work on one line or other other (depending on whether I put “1-502” on them or not). Doh!

Home LAN: followup

Following up on this entry — it seems that my “hindsight” section was correct.

The simplest solution was just to put everything on the Dlink and setup double port forwarding. That is, I setup the Linksys to forward incoming port 22 to the IP address of the Dlink. Then I setup the Dlink to forward port 22 to the IP address of my linux box. This seems to work just fine (as it should) — I can ssh into my linux box from the outside world.

And now all my machines are back on one subnet, so I don’t have to deal with printing and file sharing woes.

September 15, 2005

Hi-Fi internet

Last week, when I was driving up to Bloomington, I noticed that the hotel referenced in this entry now has a new sign up. That is, in addition to their existing sign that says “Free Hi-Fi Internet”, they now have a new sign hung right next to it that says “Free Wi-Fi Internet.”

That’s just too funny.

So they know the original sign is wrong and got a new sign that is right. BUT they left the old sign hanging in all of its wrongful glory (and quite sun-faded, as a testament to how long it has been broadcasting their Wrongness to the passerbys on I-65). Absolutely Fantastic!

I didn’t have my camera with me to take a picture, and I didn’t go to Bloomington this week (and won’t go for another 2 weeks). I hope both are still hanging the next time I drive that way — I’ll take a picture and post it here for everyone’s amusement.

October 18, 2005

Linux processor affinity: a rant

Update in September 2007: Google Analytics tells me that people are continually finding this page while searching for terms like “linux processor affinity”. You should know that I created the Portable Linux Processor Affinity project to address the problems stated in this blog entry. Please go there after reading this entry. Thanks.

This is a technical rant that can be summarized quickly: the current state of Linux processor affinity sucks.

There are esentially three different variants of the API (that I can find); which one you have depends on a combination of several factors:

your Linux distribution/vendor

what version of kernel you are using

what version of glibc you are using

Annoyingly, regardless of which variant of the API that you have on your system, the man page for sched_setaffinity(2) and sched_getaffinity(2) is the same. Specifically, it looks like this one man page has been copied everywhere and never updated to be what you actually have on your system. So you have — at best — a 1 in 3 shot of having these functions correctly documented.

This appears to be in recent 2.6 kernels (confirmed in Gentoo 2.6.11). I don’t know when #1 changed into #2. However, this prototype is nice — the cpu_set_t type is accompanied by fdset-like CPU_ZERO(), CPU_SET(), CPU_ISSET(), etc. macros.

int sched_setaffinity (pid_t __pid, const cpu_set_t *__mask);

(note the missing len parameter) This is in at least some Linux distros (e.g., MDK 10.0 with a 2.6.3 kernel, and SGI Altix, even though the Altix uses a 2.4-based kernel and therefore likely back-ported the 2.5 work but modified it for their needs). Similar to #2, the cpu_set_t type is accompanied by fdset-like CPU_ZERO(), CPU_SET(), CPU_ISSET(), etc. macros.

Also note that at least some distros of Linux have a broken CPU_ZERO macro (a pair of typos in /usr/include/bits/sched.h). MDK 9.2 is the screaming example, but it’s pretty old and probably only matters because I use that as a compilation machine :-) (it also appears to have been fixed in MDK 10.0, but they also changed from #2 to #3 — arrgh!). However, there’s no way of knowing where these typos came from and if they exist elsewhere. So it seems safest to have a configure script to check for a bad CPU_ZERO macro.

Glibc itself shares a bunch of the blame. Case in point — look at this implementation of sched_setaffinity from Glibc 2.3.2:

Why even have the function there if all it’s going to do is return an error? It’s better to not have it at all (because we already have to have a complex configure script to figure out which one to use) than to provide one that is simply broken. Arrrggggghhhh!!

Finally, note that even the syscal() interface won’t help — apparently the back-end kernel function has changed the number and type of parameters multiple times (so that may not actually be Glibc’s fault). So there appears to be no portable way to use sched_setaffinity() and sched_getaffinity() without a complex configure script and multiple implementations in your code. That totally, totally sucks.

This rant is therefore an open appeal for the Linux development community to get its act together and figure this darn thing out once and for all, and standardize on a singleAPI.

Update in September 2007: Google Analytics tells me that people are continually finding this page while searching for terms like “linux processor affinity”. You should know that I created the Portable Linux Processor Affinity project to address the problems stated in this blog entry. Please go there after reading this entry. Thanks.

October 21, 2005

Linux as a desktop... err... "needs a lot of work"

Ok, another tech rant. Sorry!

Earlier this week, I had to turn in my Mac laptop (read: my primary working device) for service — its keyboard was going bad. As a temporary replacement, I have an IBM laptop running Fedora Cord 4 Linux (I could not bear the thought of using Windows for 1-3 weeks). I had used Linux on a laptop and various desktops for about a decade; I thought it should be pretty easy to adjust for the duration while my Mac is gone.

Wrong.

Linux sucks as a desktop. I don’t think I ever realized how much until I was totally spoiled by a Mac for the last 1.5 years. I spent 5+ hours yesterday morning getting a [pseudo-]reliable set of Mail, Calendar, and IM working. I’m certainly not going to claim that Macs are perfect — they’re not (far from it, actually). But they do a lot more things Right compared to most other platforms.

Don’t get me wrong — the Linux desktop is way better than it used to be. But I’ve come to realize just how far it has to go — Mac’s philosophy is to make tiny little tools and then integrate the heck outta them. For example, on a Mac, there’s an addressbook. It’s not a calendar, it’s not an e-mail client, it’s not a kitchen recipie database. It’s just an addressbook. But that one addressbook is integrated with everything, meaning that it can talk to all those other applications — Mail, Calendar, Instant Messenger, Kitchen Recipient Database, etc. In this way, Mac reflects the BSD philosophy of “one system” rather than Linux’s philosophy of “lots of parts put together.”

Why did it take 5+ hours before I got something sorta-reasonable? Here’s some points:

FC4’s “install/uninstall software” tool still resolutely shows that KDE is not installed, even though it’s runnning as my main desktop.

If I shutdown my laptop or put it to sleep with the ethernet networking active, and then boot/restore it with no ethernet cable plugged in, I have to wait for DHCP on ethernet to timeout (60+ seconds?) before it will finish booting/restoring. That’s just absurd; why doesn’t launching the network occur in the background?

Thunderbird refused to import my addressbook entries. That’s a total non-starter (I have hundreds of e-mail addressbook entries, and I’m not going to re-type them manually).

Thunderbird also makes you wait while it sends every single message. For someone that sends dozens of e-mails a day, that’s also a non-starter (one of my students later gave me a workaround for this; apparently you can go into an obscure panel and change some hidden setting to make it not show the progress while it’s sending).

So I switched to Evolution. It loaded up my addressbook ok; cool. But it’s slow. It doesn’t handle multiple identies without adding multiple accounts (pretty non-intuitive, if you ask me — an “account” with no incoming mail server… pretty weird).

The Evolution calendar sometimes locked up (I had to have KDE kill Evolution after waiting for 10+ minutes) when importing my .ics files from my Mac calendar.

The Evolution calendar definitely has bugs in it. Here’s a humorous example — some of the day-long events that I imported set the “mark time as busy” flag. If I disable that flag on one of these events, it automatically changes the recurrance of the event from once a year (e.g., someone’s birthday) to every day. How these two are related, I have no idea.

Evolution periodically locks up and is essentially unresponsive for minutes at a time. This can happen when I click on “reply” to a mail, to simply try to go to the next mail in my index (e.g., click on “reply” and don’t get a compose window for 60+ seconds). Quite frustrating, since e-mail is a central focus of my work.

I was editing my signature blocks in Evolution when it crashed. Twice. Resulting in me [somehow] sending the same message to a public mailing list twice (how does editing a signature cause re-sending of an e-mail?).

Every time you make a change in your account settings, Evolution re-scans the IMAP server for all your folders (and re-caches everything). This is painful (I have a lot of server-side folders).

Right now, Evolution is refusing to update my INBOX. The last mail it shows is from around midnight, but it’s giving some obscure IMAP error every time it checks for new mail. So I quit Evolution and restarted; wallah — problem solved. Oh, look — I suddenly have lots of mail from after midnight.

There’s a million other little usability issues (e.g., clicking on a http link in mail or IM — after tweaking both the mail and IM clients — finally does bring up a new tab in my already-open web browser [which is my desired behavior], but then I have to manually go switch to the browser application, which is sometimes in a different virtual desktop. One would think that when I click on a link, I want to see that link, and that I would not have to initiate one or more actions to see that link), some of which are just “different” from my Mac, and others showing a lack of integration between various tools.

There are probably reasons for all of these items above. Indeed, I’m quite sure that there are hard-working programmers out there working to fix all these bugs (if they aren’t already fixed; FC4 is “new”, but software projects keep evolving even after a Linux distro releases a version). And I know that no software is perfect — even my own software has bugs that we continually work to fix. OSX software has plenty of bugs too. So don’t get my rant wrong — it’s certainly not an attack on any of these projects or the people working on them.

Although the individual applications are not entirely blameless, it’s mainly the level of integration that is the problem. The distros are getting better at making it better, but they still have a ways to go (and I’m sure my rant is not news to them). I’m sure that I could have fixed many of the problems that I listed above. I could have done something different and either not had the problem or gotten a workaround (the Thunderbird about: editor is a good exanple). But my question is — why? I didn’t have these kinds of problems with my Mac because someone thought through all these application and integration issues and distilled down the information to what 90% of the world wants and/or needs. I don’t see many useless controls on my Mac simply because I don’t need them — someone else put a lot of effort into trying to figure out what people really need to do their jobs. It’s for darned sure that your Grandmother does not want to have to go into an obscure about: editor to turn off a hidden setting in Thunderbird. It raises the question of why that progress box is there in the first place — what if I frequently send large attachments? Thunderbird makes me wait there for a positive acknowledgement that the mail was sent rather than later giving me a negative acknowledgement if something went wrong. The latter allows me to be much more productive — I can actively be doing stuff before a “hey, something went wrong with the last mail you sent…” notice comes up.

Also — and this is something that all programmers should take to heart — quitting and restarting an application to fix an error is not acceptable.

In short, the state of the Linux desktop is quite frustrating. My productivity yesterday was rock bottom because I was trying to get my machine to do what I wanted it to do (but inevtiably resinging myself to letting it do whatever it wanted to do). I’m sure I’ll adjust better over the next 1-3 weeks, but I can’t wait to get my Mac back where things tend to “just work.”

October 25, 2005

More complaints about desktop Linux...

None of my USB jump drives are recognized or mounted.

Evolution is truly evil. I gave up using it; it kept failing and/or dying in strange and mysterious ways (e.g., restarting the app made everything work, but I should not have to restart the app to get new mail).

Thunderbird’s threading view is equally mysterious; why does it not show threads when “All” threads are selected? If you select any of the other threaded options, it doesn’t show you all the mail in your INBOX. Even more confusing, if you switch to another folder and then back to your inbox, even more mail disappears from the index.

Despite editing a text file and telling Thunderbird to disable the sending progress window, the compose window still remains visible (and in focus) when you “send” a message. You have to either wait for it to disappear or manually switch the focus back to the main window. Quite annoying.

I ran “yum install kmail” twice and got different results (!). Specifically, I ran it once, and it apparently updated a bunch of internal tables (“Added 79 new packages, deleted 40 old…” — the fact that there are 79 new package in the last 6 days is somewhat frightening; all I want is stability, not bleeding edge!). It didn’t find the package I wanted, so I tried “yum install KMail”. yum then failed on the first mirror (it got an http 404!), so it moved on to another mirror. But then it said “Added 5 new packages, deleted 56 old…” This says to me that these mirrors are not in sync (and it frightens me — what just happened to all my internal yum tables?). But that’s not my problem — I can’t imagine any other command that I would expect to run twice in a row and get different results.

Thunderbird does not scroll the index when new messages arrive (especially if you have the newest messages at the bottom).

If you have the newest messages at the top, if you delete a message, Thunderbird goes to the next message down in your index. For example, say you’re on message X. The next message is Y. Then message Z arrives. If you delete X, you’d expect it to go to the next message (Z), but instead it goes to Y. So you have to do 2 actions to get to Z (delete X and then select Z). Quite annoying.

November 10, 2005

Rando Techno Factoids

- www.squyres.com crashed last night for a few hours due to hyperactive spamassassins and some scripts that didn’t properly check for concurrency
- Best quote from yesterday: “If that sentence were a mineral, it would be a diamond.”
- Best error message of the day:

GFORTRAN module created from mpi.f90 on Thu Nov 10 08:05:40 2005
If you edit this, you'll get what you deserve.

November 30, 2005

Cell phone

I just recently had to get a new cell phone because my old one was dying (the microphone was giving out). The Verizon store didn’t have the one model that I was really interested in, so I settled for another model (an Audiovox phone).

After spending nearly 2 weeks with it, I really came to hate it. Here’s why:

It had no “one-beep” ring option (great for meetings; I never feel my phone when it vibrates)

It only stores 10 voice memos (this is a deal breaker for me; I record oodles of voice memos when driving back and forth to Bloomington)

When you delete a voice memo, it jumps back up to the first voice memo (this is quite annoying that you have to scroll all the way back down again)

It takes 7 clicks to make a voice memo

Closing the lid automatically aborts just about everything (like a voice memo — as opposed to finishing and saving a voice memo)

dialing from addressbook always defaults to the cell phone entry — you can’t have a different default (E.g., home phone) for each addressbook entry

This made that phone unworkable for me. So I exercised the “you can return/exchange your phone within 15 days” policy and got a Motorola phone. It was a bit more expensive, but it has a lot more of the features that I wanted:

1-touch voice memos

Voice memos are bounded by memory capacity, not the number of memos

The addressbook is a little different; you actually have a different entry for cell/home/office for each person, but I can get used to that — at least I can easily pick which one I want to call

It doesn’t have a one-beep ring, but I guess I can get used to that (I couldn’t find any phones that had a one-beep ring, actually) — it does have different “profiles” of ring style (loud, soft, etc.)

And last, I accidentally found out today that my phone actually has Bluetooth. I’ve long disdained Bluetooth because it’s fundamentally insecure, but it does allow one signficant feature that I plan to use frequently — the fact that I can sync my Mac addressbook to my phone. YES!!! Tiger’s iSync natively supports talking my phone, so now all the phone numbers in my phone exactly match my Addressbook (and I can activate Bluetooth selectively in my phone so that it’s only on when I synchronize with my Mac; that’s secure enough for me).

I FINALLY have one and only one addressbook (Mac had previously allowed me to consolidate IM, E-mail, and PDA, now I finally can share it with my phone). Waaaa-hoooooooooooo!

December 2, 2005

Verizon anti-spam measures

My mail server (squyres.com) had been suffering for about a month; verizon.net would reject about 50% of the mail that was sent to it. It took us about 4-6 weeks to figure out why. It turns out that a) Verizon has rabid anti-spam measures, b) the specific measures that they take are not published (as of today, 2 Dec 2006, at least), and c) it is extremely difficult to find out why Verizon is blocking you. So I’m posting this in the hopes that it helps other, legitimate ISPs get unblocked from Verizon.

In short, here’s what you need:

An MX record for your domain

Your mail server to accept mail to the sending address for all outgoing mail

Without these two things, you’ll be blocked from sending to any verizon.net recipients. Specifically, they won’t be blocking the IP address of your server, but the incoming message will fail what’s called domain verification, and therefore they’ll reject the message with an SMTP 550 message. Their domain verification step does two things:

Check that the sending address on the incoming message has an MX record

Connect to the server listed in the MX record and start a message to the sender of the incoming message (specifically, EHLO verizon.net / MAIL FROM: <> / RCPT TO: address_from_the_incoming_message).

Yes, I know that this is above and beyond IETF RFC conditions. So does Verizon. But they do it anyway, so if you want mail delivered to them, you need to meet these conditions.

For squyres.com, we have 2 external e-mail server names: squyres.com and lists.squyres.com. squyres.com has long-since had an MX record, but lists.squyres.com has never had one (because we didn’t need one). So any mail sent from lists.squyres.com, by default, failed Verizon’s domain verification. So we added an MX record for lists.squyres.com… but mail still kept getting rejected.

Quite embarrassingly, it turns out that we had a misconfiguration in our mail server such that the address that GNU Mailman sends mail from (i.e., <listname>-bounces@lists.squyres.com) did not accept incoming mail. Not only was this mucking up Mailman’s internal bounce processing, even once we added an MX record, we still failed Verizon’s domain verification. Doh.

After 4-6 weeks of total non-replies from Verizon (“Check the SMTP settings in your mail client”, “We’re not blocking your IP address” [that was true, but totally unhelpful in solving the problem], …etc. My favorite was “There is nothing more that we can do to help you.”), we finally — quite by accident — got an extremely responsive support tech named Shawn T. He’s not even in the same support groups that we initially appealed to (I believe he’s in the DNS support group — a misguided front-line tech referred us to him when they heard the keyword “DNS”). Shawn got “interested” in our problem, and although he didn’t know all the answers right away, he stuck with us and figured it out. He even put us directly in touch with the anti-spam group (which, to my knowledge, is totally unheard of).

The last tech that we were on the phone with (Brandon, in the anti-spam group), was literally working on the main Verizon anti-spam gates as we were talking to him (e.g., he had to clear our IP address from the “bad” cache on all the incoming servers). He was quite helpful — and tolerant (when we discovered the fact that lists.squyres.com was rejecting mail for the <foo>-bounces@lists.squyres.com addresses).

And before you ask, no, I don’t have the contact information for any of these techs, so I can’t contact them for you, nor can I give you their phone numbers. Sorry. :-(

So if you’re an ISP and you think you’re being blocked by Verizon, first check the 2 things that I listed above. If you’re absolutely sure that those conditions are met (and be sure to wait up to 2+ days for DNS propagation if you just created an MX record), then double check them by telnetting to port 25 on your server and trying to send mail to the return address manually. If that all checks out properly, then wait 6 hours and try again — Verizon’s cache of “we rejected you” lasts for 6 hours. So if you muck up and get rejected, then you have to wait for the cache to clear out before trying again.

If all else fails, try visiting their whitelist form: http://www.verizon.net/whitelist/ This is where we started, and although it took a few weeks, we did get in touch with the Right people and found out what was required to get our mail to Verizon recipients (I told both Shawn and Branden that the should publish these 2 domain verification conditions somewhere — the whitelist form seems like an appropriate place. They said that was a good idea; hopefully it’ll show up there someday).

So to conclude my story, major huge thanks to Shawn and Brandon from a random tiny ISP out in the internet wilderness. Once we got ahold of you, you were extremely helpful in solving our problem. I hope you get raises.

February 20, 2006

A Tale of Two Telephones

I’ve been using Vonage as a second (business) phone at home for several months now. On the whole, it’s been great — Vonage was a cheap way to add a second line (much cheaper than any of the traditional 2nd phone line routes), I had a local phone number in the city of my employer, unlimited long distance calling, etc.

The feature that I love the best, however, (and I think I’ve mentioned it here on my journal before) is their SimulRing™ thingy: you can input several different phone numbers on the Vonage web site and when someone calls your Vonage number, all the phone numbers you input will ring simultaneously. Whichever number answers the call first gets it. This is extremely handy for me. I have my cell phone in the SimulRing list, so if anyone calls my “work” phone number, it rings both on my desk and my cell phone. Since I travel a fair amount, this means that I only have to give one phone number and it’ll reach me wherever I am.

To make this all work, I have a Vonage ethernet router that my phone plugs into. Unfortunately, last Monday, it stopped working. That is: pick up the phone and get no dial tone. The internet side of the router was working just fine — I could see web pages, do e-mail, etc. — but the phone side was [seemingly] kaput. There’s a little light on the front of the router indicating that the phone is correctly configured, and it stubbornly refused to be lit. I reset the router a few times with no luck. Bonk.

So I called Vonage tech support. We went through a whole bunch of steps and the tech finally concluded that my router was dead and they would need to send me a new one. The only catch was that I had to pay $100 (plus shipping) for the new router. Yow! Since my employer pays for my Vonage line, I told them not to send it — I had to get various approvals before this could happen.

Last week was absolutely crazy, and I never got around to getting the approvals. Which turned out to be a Good Thing because yesterday (Sunday), I randomly looked at my Vonage router and whoa! The phone light was lit. I picked up my phone and lo and behold, there was a dial tone. I did absolutely nothing to fix this — I don’t know how it started working again. I’m quite sure that it was not working in the latter half of last week (e.g,. I would pick up the phone and still get no dial tone). It’s still working today, too. I guess that I appeased the VOIP Gods somehow.

February 24, 2006

New job

Let’s cut to the chase: I am leaving IU and taking the position of “Chief Cat Herder of Cisco Open MPI Efforts” effective 13 March (2 weeks from now). Specifically, I will be responsible for all Open MPI development and coordination at Cisco.

Wwwoooooo hhhooooooo!!

The Open MPI project has been working hard to involve the entire HPC community — to include HPC vendors — in Open MPI over the last 2 years. Bringing Cisco into the group is an excellent technical and strategic move for the project. Since I have been working with the Open MPI core group since it was founded, I feel well-qualified to create a solid integration between the current core group and the vendors who are starting to come on board. It’ll take some work and discussion with all involved parties, but I think it will be fun, I think that the project will be better for it, and I think that the entire HPC community will benefit.

I’ve always known that I would someday leave academia and go to industry. I just never realized that it could happen so fast! Indeed, although I am tremendously excited about my new job at Cisco, I feel sadness and nostalgia at leaving Indiana University because it has been a really great place to work. Here is what I said in my letter of resignation:

It is with mixed feelings that I formally give two weeks notice of leaving my job at Indiana University.

I am excited because I am taking advantage of a great opportunity that was unexpectedly presented to me (I will be accepting a position at Cisco Systems to lead their Open MPI development efforts). I am saddened, however, because Indiana University has been a wonderful job and home to me for the past several years. I have learned so much working with the great people in the Pervasive Technology Labs, the University Information Technology Services, the Computer Science Department, the Open Systems Laboratory, and particularly with my boss, Andrew Lumsdaine, that I am at a loss for words to express my gratitude and thanks. I am truly humbled to have worked alongside people who genuinely cared about the technology, research, and human side of it all – everyone at IU made my job all that much more wonderful.

It has been a pleasure and an honor. A million thank you’s are not enough, so one will have to suffice: thank you for everything.

April 2, 2006

Outlook 2003 rants

There are good and bad things about Outlook. To be fair, let’s start with the good things:

2003 is far better than 2000. You can clearly see that MS listened to its corporate customers and made changes to Exchange/Outlook integration to make it scale much better. There’s far less chatter between Outlook and Exchange (e.g., it downloads the entire GAL — if you want — once a day), and therefore it looks / feels / acts quite a bit faster.

Outlook Web Access is not even in the same class as OWA 2000 (OWA 2000 = horrid, horrid horrid). OWA 2003 looks and feels much like Outlook itself. It was AJAX [long] before AJAX was cool (granted, it’s ActiveX, not Javascript, but the ideas and concepts are the same). Gotta hand it to MS on this one — OWA 2003 rocks as a web app.

“Search” folders to automatically display the results of a search (simple or complex). Yes, I know everyone has these these days, but Microsoft was actually on the leading wave of this one.

Directly related to search folders, the search capabilities are far better than OSX Mail, I have to admit (and yes, you can search multiple folders). You can do trivial searches (the default for OSX Mail), or you can select another tab and get arbitrarily complex in your search.

The groupware capabilities rock, especially when your entire company uses them heavily. Schedule, people, rooms, etc. You can look up the phone number of the phone in a conference room, for example. That’s surprisingly handy.

Someone wrote a plugin for the calendaring functionality that allows me to automatically reserve a teleconference phone bridge for an Outlook meeting (and it automatically sends around all the information to the participants about the dialin phone numbers, access code, etc.). I don’t know if it was Cisco or some 3rd party who wrote it, but that’s awesome.

But the list of things which are Bad is still fairly lengthy:

Integration with other mailboxes is better than 2000 (where it was not possible to mix POP/IMAP and Exchange), but still less than optimal. For example, IMAP’ing to another mailbox will still not result in a combined INBOX (or even a virtual “inbox”, similar to a search folder).

IMAP support is still somewhat kludgey. I have my IMAP mailbox set to automatically be included in send/receive e-mail, but it still doesn’t seem like Outlook updates it until I actually go to an IMAP folder (i.e., new messages are not displayed/downloaded until I go there).

Searching for names in the addressbook is klunky at best. There’s no searching for partial names; you have to start with the first name. Am I “Jeff” or “Jeffrey”? “Mike” or “Michael”? “Anju” or “Prabanjan”? It looks like they haven’t updated this searching capability since 2000.

When you display a 7-day week in the calendaring section, it shows up in two places: it hilights Sunday-Saturday on the mini-calendar view, but it shows Monday-Sunday in the detail view. Why are they different? It’s quite confusing when you’re trying to put events on Sundays.

Treatment of timezones is [still] abysmal. Example: I create an “all day event” in Eastern time. I travel to San Jose, and change my computer’s time zone to Pacific. I look at that all day event in Outlook and it now spans 2 days — 3am on the day of the event to 3am of the day after. Technically, I know that’s right — it’s being faithful to when the actual event occurs. But for “Johnney’s Birthday”, I just want it to show up on “April 17th”, not 3am-3am on April 17-18.

It is abysmially hard to reply in-line to someone’s e-mail. Outlook pretty much forces you to reply to a mail at the top, which is why you end up with e-mail threads where each individual message is 3 miles long (i.e., the entire thread is contained in each mail).

I set my Outlook to always send plain text mail. However, sometimes it still sends HTML or Rich Text mail (I don’t know/care which — it’s not plain text). I don’t know why it sometimes chooses to reply in HTML or RTF; it’s quite annoying.

Email address auto-completion is horrible. It does not auto-complete even from names in my own contact list all the time (I set my contact list to be the first data source it searches). It seems like there’s some kind of timeout — when you send to recipient X, it’ll stay in the auto-complete cache for some time period and then disappear. That’s annoying. Just because I haven’t sent to X in a week (for example) doesn’t mean I want to type his entire name every time I want to send a mail. This is such a small thing to fix, but yet something I have to deal with many times a day, so it’s big sore point.

Although Outlook mail added a “conversation view” (which is essentially a threaded index view), it still stubbornly refuses to add the “In-reply-to” header line in outgoing mails, which makes it quite difficult for most other mail clients to thread properly.

Outlook mail’s conversation view is pretty poor on threading. It groups mails together nicely, but it only threads my replies (i.e., indents them even further to show who replied to whom).

Outlook mail’s conversation view has an extra line in the index for the subject of the thread. It has a wierd dichotomy of sometimes you can select that extra line, and sometimes you can’t. It’s annoying when you’re trying to use keyboard shortcuts instead of the mouse because I haven’t figured out the precise situations when it is selectable and when it is not (i.e., this affects multi-selecting of mails when you want to perform an action on a group of mails).

There’s built-in junk mail handling, but it is pretty weird. It’s not clear what you’re supposed to do if a spam ends up in your inbox (as opposed to the Junk folder). If you right click on it, there’s a greyed-out option to reclassify it as NOT spam (i.e., exactly the opposite of what I want). Reclassifying erroneously marked spam as non-spam is easy — but I want to know how to tell Outlook that a piece of mail is spam. Does dragging it to the Junk folder do it? [shrug]

Are there settings where I can change some of these behaviors? Quite possibly. But I have poked around (not exhaustively, but I have spent some time on it) and not found them. Arrgh.

June 3, 2006

#@$@#$

I rarely receive valid trackback pings — most of them are spam. Because of this, I rarely look closely at the notifications that I get from my blog about trackbacks, and instead just de-spam them (i.e., delete the trackbacks).

I just accidentally deleted a real trackback. Doh! He tracked back to my entry and I just deleted it. So hopefully the links here will make up for my faux pass. ☺

July 2, 2006

Excuse me, stewardess? I speak jive.

The world still loves analog watches. My old [digital] watch started inexpelicably turning on its backlight randomly (and frequently), therefore draining its battery. It’s solar-rechargable, and therefore easy to charge, but it would fully discharge about once a week (which a) requires several hours of direct, strong sunlight to recharge, and b) has never happened in the years that I have owned the watch). So it was time to replace it. I was looking for a self-time-setting digital watch (i.e., synch to the Ft. Collins radio time signal) with multiple timezone support (because I travel a lot). Not difficult requirements, I thought. But apparently only Caiso makes a whole line of radio-setting watches, and the majority of them are still analog. They have relatively few digital watches. Quite annoying. Why does our culture, who is so fascinated with all things tech, still want analog watches?

Ah, the joys of parenthood.

Kaitlyn will not hesitate to rub her nose all over your shirt, especially when she needs to clear it.

Kathryn will eat anything (when she’s in the mood). Anything.

Kaitlyn’s new favorite phrase is “No want it!”. Frequently repeated many times at high volume.

The munchkins’ latest trick is to talk to each other in their cribs for at least an hour (or three) after we put them down at night. They’ll rattle on about the events of the day, do silly tricks to amuse themselves (throwing toys, bouncing in the crib, playing with blankets, etc.), etc.

Open MPI v1.1 was released. Woo hoo!

Per above, I finally got a new Casio watch. It’s nearly the same model as my old watch, but slightly slimmer and with easier-to-activate buttons.

Sun is starting to heavily contribute to Open MPI. I think they’re going to be one of the best additions to the group yet.

July 20, 2006

MacBook Pro = happiness

I got my new MacBook Pro. Yummy!

Amusingly enough, Cisco’s internal ordering tool thingy won’t ship to my home address; it’ll only ship to the local Cisco sales office. Which I figured was fine; it would guarantee that someone would be around when it was delivered. I signed up on the UPS “e-mail me updates on this tracking number…” thingy, and was shocked to get an e-mail at about 9am on the morning of the expected delivery date saying “We tried to deliver, but no one was there.” GAAAHH!! I promptly called UPS and told them to re-route the driver back; I would go to the location to accept the delivery myself.

Props to UPS for near-real-time updating of the notification (rather than the driver having to return to the distribution point at noon or the end of the day — I got the notification e-mail within about 15 minutes of the missed delivery).

So I went to the sales office (all the sales guys had arrived by then), and was there when the driver came back an hour or so later. So I got my Mac. Much, much happiness.

I have therefore [almost] retired my IBM Thinkpad running Windoze (that I got when I joined Cisco) and migrated entirely to my MBP. When I first got my Windoze laptop, I thought I could be open minded and be able to use Windoze and be just as productive as I could with a Mac. No so. I always felt constrained; especially since I’m a developer, I could never do any real work on my own laptop (Cygwin was a totaly dog, mainly because launching each process was sooooo slow!). So I’m really glad to have a Unix-based machine as my main working unit now.

I had to switch from Outlook to Entourage. Not bad, but it does have some annoying differences:

Minor: You cannot import my .pst files from Outlook directly; you have to upload everything from Outlook to Exchange and then download to Entourage (which took a long, long time and required shepharding so that I didn’t go over quota).

Minor: Overall, it seems to be a bit of a hog; typing can be a bit slow (perhaps it’s the spell checker?).

Minor: The ordering of messages in a thread is not consistent; they’re not always in order. I haven’t figured out why.

Minor: Events on the calendar display do not display the color of the free/busy status of the appoinment; you have to open the appointment and click in a sub-item to see what the free/busy status of the appointment it.

Minor: Dragging mails from an Exchange folder to a local folder copies them (vs. moving them). This is annoying because it typically means that you have to do 2 actions: drag and then delete.

Minor: Every time you exit Entourage, it wants to delete everything in the junk mail folder (what if i haven’t reviewed it yet?). There does not appear to be a setting to disable this behavior.

Major: To-do items are not stored on the server. This is unbelievable to me. Not only are they not backed up, they won’t appear on my Treo.

Major: You can’t change the accept/tenative/reject status of an appointment once you have accepted it. Wow!

It’s not all bad, though. I do enjoy the fact that Entourage syncs to Mac’s Addressbook and Calendar, which, in turn, I have set to sync to my .mac account (which keeps my home iMac, for example, also in sync, and therefore my bluetooth-enabled cell phone). So I really do have 1 addressbook between all my devices at home and work. Which rocks.

I’ve taken the time to update some Mac software that I was using. For example, based on George B.’s advice, I’m now using Darwin Ports instead of Fink. I’m using a different (and, IMHO, better) Apple-ified emacs. And Virtue Desktops ( UPDATE: someday to be likely replaced by OSX 10.5!). And Witch (which rocks).

November 30, 2006

You'll shoot your eye out!

Cisco phones rock.

I have a Cisco IP phone on my desk at home. Since I’m a telecommuter, I live and die by my phone. I came across a few cases recently where Cisco phones simply rock:

I initiated a 3-way call with 2 non-Cisco people (i.e., I made external calls to them). We’re all chatting away when I accidentally pushed the wrong button and hung up. “Whoops,” I thought. “I’ll just call them back.” So I tried calling them both back and got their voice mail. I waited 2-3 minutes and tried again — still got voice mail. So I IM’ed them both and said, “Guys — hang up — I accidentally disconnected and need to call you back.” They both said, “Oh, we thought you went on mute, we’re still connected.” That’s cool.

A colleague of mine in San Jose called another Cisco employee down in Austraila by dialing 011…(etc.). The Cisco phone system automatically recognized that he was calling a Cisco phone number and switched it to an in-system call (rather than place an international call which would have been both expensive and have dubios voice quality — Cisco’s PBX is VOIP, so we can just route the traffic over our own, internal networks). Granted, lots of PBX systems do this kind of thing these days, but it’s still cool.

That same colleague then made a 3-way call to join me into the conference (with the guy from Austraila). When my phone rang, the caller ID said “Conference.” We chatted for a while, and then the guy in San Jose hung up. Not only was I still connected to the guy in Austrailia, my caller ID switched to show his name (and his caller ID switched to show my name). This means that some programmer specifically thought about this case (a caller initiates a 3 way call and then disconnects) and made the system not only keep the callers connected, but also realize that it could update the caller ID intelligently. That’s a well thought-out system.

December 10, 2006

iDVD suckage

It turns out that there is a well-know bug in iDVD (or perhaps something that affects iDVD? It’s not clear) that causes hours of delay when burning a DVD on OS X. It’s not entirely clear if the problem is in iDVD itself or whether it was introduced in an OS X update (e.g., 10.4.8). Here’s a bunch of people talking about it:

Many thanks to those who posted above — this was exactly the problem I was running into as well (spinning beachball of death during audio encoding). It was good to see that being patient would fix the problem.

I have one piece of information to add to the mix: it may not actually be the overall audio encoding that causes the long delays (!).

Let me explain.

I, too, saw many hours of no apparent activity from iDVD while encoding my 1 hour movie. I just happend to check back once when I noticed that it started to actually encode the audio. That is, it was in the “Encoding audio” stage, and the progress bar had just started moving! And once the progress bar started moving, it completed within a minute or two. Then it progressed on to the burn stage.

So something is happening during all those hours of spinning-beachball-of-death, but I don’t think it’s the actual audio encoding itself. Indeed, audio encoding is a relatively “solved” problem these days — itunes and imovie have shown that apple knows how to do this well. But the fact that the progress bar shows nothing and Force Quit shows that iDVD is “unresponsive” indicates to me that this delay may actually be a real bug — something that is supposed to be more-or-less instantaneous (i.e., perhaps step 1 in the audio encoding process, preprocessing the data, or setting up internal data structures, or …?). But instead, some bug in the coding makes the “supposed-to-be-instantaneous” process take many hours.

Shrug. Who knows?

I just wanted to let everyone know that when the audio encoding seems to actually start, it’s pretty snappy (as one would expect). Indeed, the wording of Apple’s help article (URL cited several times above) is pretty cagey: “Even though iDVD may appear to have stopped working, iDVD is probably still encoding audio” (I added the emphasis). That’s quite a statement. ☺

So I’m guessing the real problem is actually something else.

After I wrote that section, I was burning a few DVDs and decided to run the OS X profiling application Shark on iDVD. I don’t know much about Shark, so I could be totally mis-interpreting the results, but it looks like iDVD is spinning in a [wait4()] system call. I’m not sure exactly:

What it’s waiting for

Why it takes so long, or

Why it’s so CPU-intensive

But there you go. I’ll poke around some more and see if I can figure it out (e.g., I don’t know how to interpret the Shark results to see if I can extract the PID that iDVD is waiting for).

FWIW, to see if it was a CPU starvation issue (e.g., iDVD spinning hard and not letting whatever it was actually waiting for make any progress), I did some experimentation with nice values, but to no effect. Increasing the nice value of iDVD (to potentially let sub-processes make progress) didn’t help. Indeed, top shows that even with a high nice value, iDVD is just about the only process (or an otherwise dormant system) that is consuming sizeable CPU time.

January 23, 2007

Windows netbios name caching

I learned something new about Windows networking today, and am putting it here in JeffJournal so that I can find the information again someday when I need it.

I got a call from my church yesterday saying that they couldn’t print to a printer share that hangs off one of the machines on their LAN. I tried a few things with them via e-mail and then said, “I’ll come over tonight and have a look.”

The machine in question is named “volunteer”.

When I got there, the machine appeared to be working fine. It could access any network share (including the main file server), it could see internet web pages, could ping other machines on the network, etc. Other machines on the LAN could ping volunteer, too, so they could see each other via ethernet (not surprising, since there’s only one network switch).

But none of the other machines could access volunteer’s shares at all. If you tried to browse to \\volunteer, windows explorer would eventually timeout and say that the network path was invalid. I rebooted every machine on the network multiple times (including volunteer), all to no avail.

So — what the heck?

I thought to myself, “it’s almost like the name ‘volunteer’ is resolving incorrectly.” So on a hunch, I googled for “clear netbios cache” and found the magic windows command “nbtstat” (I had never heard of this command before). A few queries later I found that, lo and behold, our Samba-based WINS server (run by an outside vendor) was returning the wrong IP address for the name “volunteer”. I read a few Samba docs, and poked around on our Samba server and found the wins.dat file (cache file for nmbd, the Samba WINS server). Guess what I found in there?

I’m not sure why, but every desktop machine seems to be listed 3 times in this file. But “volunteer” had 2 incorrect entries (.237) and one correct entry (.104)! Whoa!

So I killed nmbd on the Samba server, edited the wins.dat file to make all 3 entries be “.104”, retsarted nmbd, and bingo! Now every machine on the network can see \\volunteer and its shares.

I suspect that the volunteer machine actually did have the .237 IP address for a while (all the machines get their IP addresses via DHCP); I had just done some DHCP reallocation and consolidation that weekend (I found out that there were 3DHCP servers running on my network — doh! One on my DSL modem, one on my linux server, and one on the Samba box — I was only aware of the one running on my linux server). So I disabled the DHCP servers on the DSL modem and my linux box, and let the vendor-run Samba box be the DHCP server (it was always the WINS server).

But anyhoo, that’s how I assume “volunteer” switched IP addresses. Why it didn’t also update the data in wins.dat properly, I don’t know. I don’t know what the fields in the wins.dat file mean; perhaps those old IP addresses would have eventually timed out…? ☹

FWIW, in the midst of figuring this all out, I killed the nmbd on the Samba server and then machines were able to find \\volunteer (after I cleared their netbios name caches via “nbtstat -R”) because the WINS server was no longer being used for resolution. Instead, the machine were falling back to broadcast-based resolution, which, while it will only work on a subnet, was sufficient for them to find each other (and get current information). Restarting the nmbd server forced the machines to again get the wrong address, so this led me to poke around and find nmbd’s cache file that had the stale information.

After the fact, I found some important facts:

WINS servers can be re-seeded from a workstation’s perception of what name resolutions are. So if there’s a “bad” workstation out there with stale info, it may poisoin the WINS server.

The “safe” way to get all the information syncronized and current is to turn off all workstations, stop nmbd, remove wins.dat, restart nmbd, and then restart the workstations.

My private wondering is that if you just kill the nmbd for a few hours and let all workstations resolve current information via broadcast (which kinda depends on all workstations try to reach each other and getting all the most recent information), and then turn on nmbd and let nmbd re-seed itself from the broadcasted data (either via workstations registering, a workstation re-syncing its whole table to the WINS server, or watching broadcast data). Didn’t get a chance to try this, though.

March 24, 2007

Bugs

Here’s a list of software bugs that annoy me (started on 19 Feb, but I didn’t get around to finish the list until late March):

Microsoft Outlook 2003 sometimes decides not to enable alarms. You open Outlook and no alarms go off — ever. If you close Outlook and re-open it, the alarms [usually] re-enable themselves.

Mac OS X 10.4.8 sometimes decides not to resume from sleep properly. This morning, I opened my MacBook Pro lid and the username/password popup did not appear to let me unlock the screen. I had to hard reboot the machine. It’s rare, but it happens.

I absolutely cannot get a specific show to download from my TiVo using TiVo-to-go. All the rest download fine; it’s just that one (the last Episode of Friends — my wife won’t let me delete it, so I’d just like to archive it to DVD or something).

iDVD has a bug where it sits and spins for hours when encoding the audio in a DVD (google around — this has been reported ad nauseam throughout the net, but I think most people are missing the point: “Encoding a DVD takes a long time — just be patient”. No. It’s a bug.).

I use VirtueDesktops, and in general, it’s great. But if you have many windows open for a single application and you minimize one of them down to the dock, if you try to maximize it in a different desktop, it may switch back to the desktop with the most windows open in that app before maximizing it.

I just switched my work Treo 650 from Verizon to Cingular so that it would work in Europe for some recent travel that I took. The Cingular version of the Treo 650 software seems to be far less reliable than the Verizon version; it reboots at least once a week (usually spontaneously, but sometimes in the middle of a call), the buttons are decidedly slower to respond, and e-mail is quite unreliable (mail that I know is in my INBOX does not show up on the Treo, and somtimes mail that I send from my Treo does not actually get sent).

I don’t know if this is Cingular-specific, but once you create an outgoing e-mail on the Treo 650 and send it while working offline, you cannot delete it (even though it’s listed in the outgoing queue). So it will always go out once you go back online (well, it might go out, per the previous bullet…).

Andrew L. and I discovered this past week that GoDaddy.com’s account transfer and domain renewal forms do not work with the Mac OS X web browser, Safari. The problem is that they fail in a very non-intuitive way — you enter in the security code that GoDaddy asks for and it comes back and tells you that the code is invalid (i.e., it’s not an outright / blatant failure). So you bang your head against this for a while and then go try Internet Exploder — and it works. Grumble.

March 31, 2007

My TiVo is dead -- long live my TiVo!

My TiVo died today.

It was one of the original TiVo-manuactured series 2’s, activated on June 6, 2002, making it 4 years, 9 months, and 25 days old. It led a good, productive life. Its disk and fan had been making increasing louder noises over the last few weeks; today, it refused to boot after a power blip. Its front-mounted green LED winked out as a final “goodbye, and thanks for all the fish.”

It is survived by its owners, Jeff and Tracy (the munchkins only know it as the Box That Plays Music).

There were 3 Big Bummers about my TiVo’s death:

Dealing without TiVo for a while. I know that sounds callous — why should your life be so dependent upon TV? Well, that was kinda the point: my TiVo allowed me to not be dependent upon TV. It would record stuff whenever it was supposed to; I could then relax and watch shows whenever was convenient. Specifically: I never paid attention to the TV except when I wanted to; I’m barely aware of what shows are on what days of the week.

There were several shows on the hard drive that we had not watched yet. Fortunately, they’re all available online (Medium, Battlestar Gallactica, Raines — all available legally, thankyouverymuch!).

I had a lifetime membership on that TiVo, meaning that I wasn’t paying anything per month for the unit. TiVo doesn’t offer lifetime memberships anymore, so getting a new unit to replace this one would mean paying a monthly fee. I’m not opposed to monthly fees, per se, but when you’ve enjoyed a service for so long without paying a monthly fee (again, completely legally!), it’s a bummer to have to start paying one.

Just for the heckuvit, I called TiVo to see if there was any way to transfer my lifetime membership. And guess what? Since this was my original TiVo unit and it was the first time it failed, they’re sending me a refurb unit (with 80 hours storage; my prior unit only had 60) and transferring my lifetime membership to it. I’ll have to pay “exchange fee” that is about $45 more than I would have paid for a new 180 hour unit, but that’ll pay for itself in a few months because a new unit would have incurred a monthly fee (ok, yes, I’m a total cheapskate). Woo hoo!

I suppose that this really only delays the inveitable — when the munchkins get old enough to have their own TiVo and/or so many “family shows” start sucking up disk space on the machine that we need a [much] larger capacity unit. But that’s ok; I’m happy to put that off to another day (and/or just get a big second disk to put in the current machine).

So Big Kudos to TiVo: thanks for keeping a customer happy when, by the letter of their contracts, they didn’t have to! Just another reason to love TiVo. ☺

April 21, 2007

More bugs

Our new Tivo seems to have a few minor buglets in it:

Since it has more disk space than our last Tivo, it stores many more “recommended” shows — the folder containing recommended shows is several pages long. I have noticed that if you navigate to the Nth page in the recommended folder and view the info for a given show, when you return to the recommended folder index, you’ll be back on the 1st page. That’s somewhat annoying, actually.

Periodically, hitting the 30 second skip button will seemingly have no effect. So you end up mashing it 3-4 times thinking that you just weren’t pointing the remote in the right direction. Still nothing. Some random time later (usually within 1-2 minutes), all the 30 second skips unexepectedly execute consecutively.

My sister finally got a rebate check for her Tivo (she bought it with an included rebate form). When she tried to deposit the check, it bounced! Wow! I’m not entirely clear if the rebate was from Tivo directly or some third-party reseller (like Best Buy), but it’s amazing that the rebate check would bounce.

We’re having problems with my church’s current e-mail hosting provider and are evaluating Google Apps Premier as a possible alternative. Overall the service is pretty nice, but the software seems immature/somewhat buggy. That is, some of the individual components have been around for a while (Gmail, Calendar, …etc.). But Google Apps is supposed to tie them together into a single domain and facilitate sharing between the apps among the domain’s users. This integration doesn’t seem to be fully mature yet. Examples:

We’ve had many issues with trying to schedule resources (e.g., rooms) on the calendar. Sometimes rooms don’t respond to invitations. Sometimes rooms reject the meeting invitation even though they are clearly available. Sometimes rooms claim to be wholly unavailable (and you can’t even send them a meeting invitation) even though they are not reserved.

You can’t change the credit card that will be billed for the Google Apps account once you’ve set it up.

Google Apps accounts are different than regular Google accounts, and they don’t seem to be universally recognized across Google. For example, when using my regular Gmail to send to a Google Apps user, Gmail prompts me with “Invite this user to Gmail!” — shouldn’t it realize that this user is a Google Apps user, and is therefore already on Gmail? Another example: if you’re logged in to your non-Google-Apps Gmail and then go to login to Google Apps, Bad/Strange Things happen — it seems to get confused between the cookies for the two different accounts (some things work, some don’t).

The Mac OS X Mail client can get really slow when you have many thousand e-mails in a local folder. One of my folders (where I archive everything) had about 30+ thousand mails in it. I noticed the following:

Mail was getting very slow. Vacuuming with sqlite3 had no effect on speeding Mail up.

Smart Folders that take their input from this mega-archive folder were regularly inconsistent. Example: I went to the same Smart Folder that sources out of the mega-archive folder 3 times in a row and saw 3 different sets of messages in the index.

The only solution I could find was to break up my mega-archive folder into smaller pieces (which was fairly annoying — it kinda defeats the point of having a mega-archive folder). Mail now is considerably faster and my Smart Folders are now consistent again.

When I was having these problems with OS X Mail, I thought I’d try the new Thunderbird 2.0 client. Unfortunately, it was far worse than Mail at handling large-volume folders. Any time I switched into a goodly-sized folder, there was a significant delay while it loaded it up (“significant delay” = 10+ seconds). After a while, the contents of my entire inbox disappeared (which had less than 400 messages in it). Restarting thunderbird had no effect; it said that my inbox was empty. So I went back to Mail (thankfully, my inbox had not actually been destroyed!).

June 4, 2007

Bugs bugs bugs!

I volunteer at my church to keep all their computers and internet connectivity up and running. We’ve been having significant problems with our e-mail service provider (ESP) recently. I won’t mention their name, but if you put 1 and 1 together, you may figure out who they are. We pay a monthly fee to host about a dozen e-mail accounts on an Exchange server at the ESP.

Not infrequently, we find that we cannot send mail through the ESP to various domains (small/minor domains like insightbb.com, aol.com, etc.). The reject messages indicate that our ESP has been blacklisted for sending spam. These outages — where my church cannot send e-mails to its business partners and its parishioners — usually last about a week and there’s nothing that we can do about it.

We have had terrible luck with the ESP’s tech support. Their first-line tech support is, frankly, worthless. The scripts that they are provided with do not help at all; they never address the problems that we encounter. Getting through the first-line support to the back-end support is a difficult task; second-line support is regulated to e-mail only (which usually compounds the problem, since the problems that cannot be resolved by first-line tech support are complex issues that require careful explanations and attention to detail). Their second-line tech support is therefore also usually unhelpful; they avoid answering direct questions that I ask, give incorrect answers to the questions that I ask (e.g., “Mail sent by users outside of this ESP to my e-mail address is bouncing. Why is this happening?” / “Try rebooting your computer.”), do not answer e-mails that I have sent (i.e., ignore support requests), and can sometimes be downright surly.

Every once in a while, mail to some of the accounts at our ESP bounces. Why? Who knows? I cannot get a straight answer out of tech support.

Communication from our ESP is also terrible. They never tell us when problems are actually fixed; even problems that we have reported and/or are waiting for a resolution (e.g., when we are blacklisted).

Because of these problems, we’ve been evaluating other ESP’s. We tried Google Apps (reported on in a prior entry) and had a few of the church staff members have their e-mail forwarded at the server level from our current ESP to GA. When an outside user sends mail to one of the test users, it goes to our current ESP and then is forwarded on to GA. However, when a non-test-user church staff member sends mail to one of the test users, it does not forward; it simply terminates at the ESP. I have been completely unable to get an answer out of our ESP’s tech support as to why this is happening; it is tremenedously annoying to the users who are testing GA.

Google Apps has been very responsive and helpful in tracking down the problems that we have encountered with their service. Kudos to Brett from GA tech support for taking initiative and talking with me 1-on-1 to ensure that all of our problems get resolved. Because of the responsiveness of GA and their excellent functionality, we might be switching to GA in the not-distant future, using the “standard”/free edition while they’re still fixing some issues in the calendaring functionality, and then upgrading to the “premier”/not-free version when everything is fixed/working. We’ll then be able to drop our old ESP (can’t happen soon enough, if you ask me!).

OS X’s Mail.app seems to sometimes does not save a copy of outgoing mail in the sent mail folder on my IMAP server. Here’s a detailed description that I submitted on http://bugreport.apple.com/ (later marked as “Duplicate/3322819” by the Apple Bug Gods; I unfortunately cannot view this bug to see what it says):

Summary:

Periodically, Mail.app fails to store outgoing mail in the sent mail folder.

Steps to reproduce:

It’s difficult to say because this only happens once in a while. Most mail that I send shows up in my sent mail folder, as expected. It’s only a very small percentage of mails that don’t (maybe one in 200? I use my MBP for my job, so I sent dozens of e-mails a day). It may have something to do with the fact that I have very large numbers of mails in the mailboxes on my Mac (e.g., sent mail folder has several thousand messages — several other folders on my mac have tens of thousands of messages).

I have noticed no pattern to when this problem happens (e.g., only happens when I “reply”, or only happens when I “reply all”, or only happens when I compose a new e-mail — I’ve had it happen in all of these scenarios).

Expected results:

I expect all mail that I send to show up in the sent mail folder.

Actual results:

I’ll either notice at some random point later that a mail I sent is not in the sent mail folder. A small number of times, I have managed to catch this problem “in action”, so to speak. Just yesterday, I sent a mail and then immediately wanted to forward it to someone else. So I clicked “send” on the message window and then immediately went to my “sent mail” folder. The message that I sent was there at the bottom of the index (I have it sorted by date), but with an index number of “1” (even though it’s at the bottom — the message above it was index 2200 or so). The message then disappeared out of the sent mail index a few seconds later, and appears to be totally gone (I can’t find it anywhere).

I have seen this “shows up in sent mail with an index of 1 and then disappears” behavior a small number of times, but only when I’ve immediately switched to the sent mail folder and watched the message disappears.

Regression:

I’m an OSX 10.4 user on an MBP; this problem has occurred on and off over the last year (and I shrudder to think of the cases where I didn’t notice that the message isn’t in my sent mail — it could be happening much more frequently than I thought…? I honestly don’t know because I rely on Mail.app to record all my outgoing mail for me). I always keep my MBP up-to-date on all Apple patches, so it’s been constant throughout the 2nd half of 2006 until now (late May 2007).

Notes:

You will see that I have the MailTags Mail.app extension installed. I literally instaleld this extension very recently (last week, I think?); I have been seeing the “mail disappears from sent mail” problem much longer than that.

I also have File Vault enabled on my home directory. I think that I was seeing this problem before FV was enabled, but that was so long ago that I’m not 100% sure.

‘mbp-mail.app-losing-sent-mail.spx’ was successfully uploaded

27-May-2007 12:29 PM Jeffrey Squyres:

Thanks for responding so quickly.

My Mail.app is configured for 2 accounts, both IMAP:

1. One server is using dovecot. I have not directly observed problems with this account.
2. The second server (my work address) is MS Exchange; I’m afraid I don’t know what version offhand (I’d have to inquire with our IT department; let me know if that would be helpful).

27-May-2007 12:31 PM Jeffrey Squyres:

I’m sorry — I did not specify: yes, my sent mail folders are both accessed via IMAP (one for each respective server), and I have “store X messages on server” (for all values of X, including “sent”) for both accounts in the preferences accounts / mailbox behaviors tab. “Delete sent messages” is set to “never”.

July 29, 2007

Lights, camera, action!

For our anniversary this year, Tracy had been less-than-subtle that she wanted a new camera (picture camera, not camcorder). The old one (a Pentax Optio) is fine, but suffered from two deficiencies:

It’s a few years old (and is 4MP); pictures we have obtained from friends’ cameras just “look better” (in part because they’re higher MP).

Modern cameras have a few more features that help the, er, point-n-click-challenged.

So I did a bit a research and came up with the Sony CyberShot DSC T-100 camera. It’s ~8MP, got lots of good reviews, has a cool red chassis with a nifty sliding front panel for opening/closing the camera, and a very large screen on the back for previews. This model even got the CNet editors choice award. So it seemed like a good bet.

Off to Froogle — it pointed me to prestigecamera.com. So I duly ordered the CyberShot off the Prestige Camera web site. I wanted to ensure that the camera would arrive in time for our anniversary, so I called the Prestige Camera 800 number to talk to a human. A smooth-talking sales guy a) assured me that it would arrive in time, and then b) convinced me to buy a bunch of extra stuff that I wasn’t initially planning on getting. It turns out that the camera you get comes with a 10 minute battery and a storage card for about 10 pictures. Amazing (i.e., disappointing). So I had to get a better battery and a larger storage card, which, bundled up with a few other goodies made the whole thing a bit more expensive than I was planning on. But even after getting off the phone with the guy, I was overall happy with my purchases, so it was ok.

The camera arrived on a Thursday and I played around with it. Much to my dismay, I was quite unhappy with the quality of pictures that it took. I took the same pictures with my old Optio and my new Cbyershot and then compared them:

Many of the Sony pictures had a decidedly yellow tint; the Optio pictures seemed to have much truer-to-life colors.

The Sony flash appears to be offset from the lens; many of the pictures that used the flash had definite shadows.

Many of the Optio pictures just looked “better” than the Sony pictures (yes, I know that’s subjective). I would look at both pictures side-by-side on my Mac; the Optio pictures just looked sharper, had better overall focus coverage of the entire frame, zoomed in smoother (which was amazing to me since the Optio is 4MP and the Sony is 8MP), etc. And yes, I verified that the Cybershot was taking pictures in the 8MP setting.

After poring through the Sony docs, I found settings to correct some of the issues, but you had to manually select them to fix each issue (and they weren’t uniform in all lighting conditions — you had to manually select various settings for each different lighting condition).

In short: for the price I paid, I was quite disappointed with the camera as a point-n-shoot device.

So I called Prestige Camera, and the good folks there agreed to do a one-for-one swap for a Canon IXY 810 IS for no charge[equivalent to the Canon Powershot SD850 IS], in part because I had only opened the camera; all the other packaging was intact. The Canon is actually about $15 cheaper, but given that I effectively got to “try before you buy”, I really couldn’t complain. They even overnighted me the new Canon and associated packaged equipment. So kudos to Prestige Camera for taking care of their customers!

I am much happier with the Canon — it takes high-quality pictures in many different lighting conditions and is much more of an automatic point-n-shoot than the Sony was.

This is all my $0.02. If someone shopping for a new camera finds this writeup, I hope it’s useful to you.

September 3, 2007

Apples and Oranges

I recently got iLife ‘08 for my iMac; I had previously been using iLife ‘07.

I really like iPhoto ‘08. The “events” organization is awesome. However, I did have one repeatable crash when I moved one specific event’s pictures into another event. I’ve merged/moved/edited dozens of events; it was somehow only a problem for moving pictures from this one specific event to another specific event — I got iPhoto to crash 3 times in a row. I dutifully submitted problem reports each time. (Sidenote: I just got a software update for iPhoto — v7.0.2 — I don’t know if this problem has been fixed or not)

And I love the .Mac gallery publishing in both iPhoto and iMovie. That’s where I publish all my family pictures now. Gallery was good, but this is waaay better.

I used iMovie ‘08 last night for the first time to make a home movie. Eh; it’s ok. It definitely does have some nice new features, but there are also some features that I sorely miss from iMove HD (‘07):

Pros:

The video skimming is pretty cool/useful, but it takes some getting used to. I’m not totally used to it yet.

The ability to trivially specify which hard disk to save imported video is great (because video sucks up sooooo much space!).

Having all your video clips in one place — and being able to share them between multiple projects — is quite handy. I had to do some whacky stuff to share clips between multiple different iMovie projects (which usually resulted in quite a lot of wasted time and disk space due to clip copying).

Trivial creation of movies to multiple different resolutions, bundled with the one-click publishing to YouTube, .Mac, etc. is wonderful.

Cons:

Video skimming can be “jumpy” if you’re on older hardware, like my iMac G5.

There is no way to fade in/out audio tracks. You can set the audio level for a clip and the background music track, but you cannot fade it in or out. It might be ok if you could do this in conjunction with Garage Band, but the majority of home movies I make are paired with audio purchased from the iTunes store, but Garage Band will not let you use those tracks (yes, I know I could burn them to a CD and then re-rip them, but I don’t want to/shouldn’t have to. I purchased them and iMovie lets me use them — why won’t Garage Band?).

I found it very useful in iMovie HD that you could see the exact time/frame number where you were editing. iMovie ‘08 no longer shows this information; it made it harder for me to exactly edit the movie like I wanted to.

You cannot meaningfully import iMovie HD (‘07) projects; all your transitions, titles, and extra audio tracks are lost. An iMovie tutorial on apple.com calls this a “feature” (“Now is the perfect time to update your old project”); I completely disagree. Luckily, I found by accident that the iLife ‘08 installation does not overwrite the old iMovie HD application; so you can still access all your old projects though the original ‘07 application. But that kinda defeats the point of iMovie 08’s consolidation-of-all-video-clips feature.

I definitely ran into some bugs in iMovie ‘08. Here’s some examples:

Sometimes when I create a new project, it’s not possible to edit the name. I have to quit and re-launch iMovie for the new project’s title to be editable. That’s just weird.

Sometimes in the ;trim clip” view, the end-of-clip grab handlebar spans two clip heights making it difficult to grab-and-drag properly.

I’m not a big user of Garage Band or iWeb, so I can’t really comment on those.

All in all, iLife ‘08 is worth the upgrade, IMHO. The .Mac publishing alone is great. I was a little disappointed with the regression of some features in iMovie, but I’ll probably cope with a mix of using iMovie ‘07 and ‘08. Oh well.

September 10, 2007

Google Analytics

I setup Google Analytics on JeffJournal a while ago to track who’s coming here (if anyone), what they look at, etc. The majority of hits on JeffJournal are (unsurprisingly) from people searching via google. Google tells me what people were searching for when they landed on JeffJournal. Here’s some of my favorites from the list, in order of frequency:

Purple (yes, just the word “purple”): 27 times

Ted Nudget: 14 times

“Get out of my chair dillhole”: 12 times

Tublecane: 4 times

Do elephants sweat?: 2 times

Insusient: 2 times

Past, present participle: 2 times

What does sagacious mean: 2 times

Winshields on 92 Saturns: 2 times

“Garelli 5000”: 1 time

“AIX sucks”: 1 time

Many of these hits come from the fairly random titling of my journal entries (it’s nice to see some other News Radio fans out there…). But it’s still amusing, nonetheless…

September 13, 2007

Mac mac mac

My sister got one of the shiny new iMac’s - woot! I now have a co-rebel in the family.

She actually had the machine shipped to my house and then drove down for a visit (well, she was coming for a visit anyway). I gave her a crash course in Mac stuff. Aside from a potentially-annoying-to-install printer driver (looks like a job for me over Thanksgiving…), it seems to be working ok for her.

Louisville KY just recently got an Apple store, too. I haven’t been there yet, but Tracy walked by it the other day and said, “Oh yes, you could easily spend a lot of money in that store…”

September 22, 2007

Religious spam

Among the tech-geek volunteering that I do for my church, I administrate their e-mail listserver which they use to communicate among their various committees, support groups, and the big parishioner broadcast e-mail list that is used to send periodic parish-wide announcements.

But there’s a slimy side-effect of this volunteering: I see a fair amount of Christian-oriented spam. The spams are sent both to the lists themselves (thankfully the listserver automatically discards posts from non-members) and to the technical addresses associated with the lists (e.g., the “owner” e-mail aliases, etc.). The spams masquerade as things that a Christian church should want to send to its parishioners: “Fatima retreats,” “Cost-effective bibles,” and my personal favorite: “Hear [insert pseudo-religious name]’s message for peace.” There’s even offers to enable you to spam “your important religious messages to tens of thousands of Christians.” Amazing (but sadly not surprising).

Some of the spams are clearly made by public relations professionals — slick graphic spreads featuring sincere, distinguished-looking men in religious-looking robes holding bibles and/or preaching from a pulpit. They’ve obviously got some real money behind these endeavors — many of them have real web sites with information supporting the content in their e-mail.

Although these spams masquerade as legitimate businesses and have professional appearances, they are just the same as your common word-misspelling / random-phrase anti-spam defeating / image-based 419, pharmaceutical, and stock pump-n-dump scams: it’s all about the money. You have to pay to see whatever valuable message they need to deliver to you to guarantee your salvation.

One could easily argue that it costs money to do anything in this world, even to put out the good word of your favorite religious message. And it’s quite probable that some of these messages are from real organizations who are just trying to put out the good word of their god. Fair enough. But when when these announcements are sent unsolicited to “postmaster@lists.mychurch’s.domain,” they’ve lost the moral high ground: that’s clearly an attempt to drum up business. The fact that I only get this kind of spam from e-mail addresses associated with my church clearly indicate that the spammers are targeting religious organizations.

I know to ignore these scams and report them as spam to our ISP. But others don’t. Spam wrapped up in religious overtones can be a lot more attractive because it plays on an emotional response from its intended victims. How many not-internet-savvy users have fallen for these schemes? I have no idea — I freely admit that I’ve done zero research in this area; this journal entry is solely based on my opinion.

But the fact that I continue to receive these spams, some of which clearly cost a lot of money to make, is quite discouraging / saddening.

October 13, 2007

My new Blackjack phone

For about a year, I have had a Treo smartphone for work. It accessed my e-mail, could browse the internet, etc. It worked fairly well. Every once in a while, it would spontaneously reboot (maybe once every 2-3 weeks?), and the bluetooth on it is quite slow to pick up. But it was a generally reliable phone.

Work just swapped out the Treo’s for the Samsung Blackjack. Unlike the Treo, the Blackjack is based on Windows Mobile. I’ve only used the Blackjack for a few days now, but I’m less than impressed. I’m still figuring out all the things that are now “different” — but that’s not what I’m complaining about. Here’s what I don’t like:

Phone has locked up 3 times in 2 days.

It was completely locked up once such that I had to remove the battery.

It has “forgotten” my bluetooth settings 5 times such that my bluetooth headset suddenly stops working. I’ve been using the same headset with my Treo for many months, so I’m pretty sure it’s not the headset that is the problem here.

Sometimes a functional button will stop working. That is, it worked fine and all the sudden pressing it does nothing. If I reboot the phone, the button starts working again, so I don’t think it’s a mechanical problem.

The Treo had a better/easier interface and integration with my Exchange addressbook for SMS messages.

That’s pretty much it. I’m sure I’ll get used to the Blackjack over time and it’ll become the “natural” for me to use (and the Treo will become a distant memory), but for now, I’m annoyed that it’s just as unstable as I would assume that a Windows desktop would be. ☹

October 17, 2007

Unix linkers

Do you think you understand Unix/POSIX linkers? I thought I did. Then I started working on the Open MPI project. Then I realized that I didn’t have a clue how they work (e.g., do you know about OS X’s two-level and flat namespaces?).

A complex question came up recently on the Open MPI mailing list about embedding Open MPI in an R or Python language plugin. After 48 hours of extreme confusion and off-list discussions between myself and Brian B., I came up with a chart that helps lessen the confusion at least somewhat. It took me all day to write up that chart. Woof.

October 28, 2007

Fixing .mac sync problems

I’ve been having .Mac sync problems of late. I have fairly modest needs: I have two macs (MBP/work and G5 iMac/home, both running 10.4.10) and I only synchronize my addressbook and Safari bookmarks between them. For me, keeping this data in sync between the two machines is incredibly useful.

However, recently I’ve been running into a problem on my MBP — it simply wasn’t syncing (and not telling me that it wasn’t syncing). I noticed it when I added a bookmark on my home iMac but several days later, it still hadn’t shown up on my MBP. Doh! So I started digging deeper.

Upon closer investigation, I found two distinct failures on my MBP:

If I forced a manual .Mac sync, I would get an error like this sometime during the sync, and then the sync would stop:

Sync Error:
[ISyncConcreteSession pushChange:]:
you can't modify a record that doesn't exist:
<ISyncChange 0xblah>{ modify record id 'blah blah blah'
set com.apple.ical.type = local
set title = Unfiled }

(that’s not verbatim — the important part is “you can’t modify a record that doesn’t exist”)

If I went to the .Mac system prefs, I could see that I had my username/password entered correctly (because it would accurately show how many days I had left in my subscription and how much space I was currently using on my iDisk), but if I went to the “advanced” tab, it would popup a window saying:

An error occurred during this operation.
Could not retrieve .Mac configuration.

(I’m parphrasing the first line because I never wrote it down, but I know the 2nd line is right)

And then no computers were listed in the advanced tab. Checking the same Advanced tab in the .Mac preferences on my iMac, I saw that both computers listed.

So it seemed to be a problem that was local to my MBP.

I googled around a lot and trolled through the .Mac help. Most of the information that I found consisted of the following:

Backup your data, unregister the problematic computer via the .Mac system preferences, and then register it again, and/or

Use the ‘Reset sync data’ button in the .Mac system preferences

Well, I couldn’t unregister or reset the sync data on the MBP because the “Advanced” tab was greyed-out on my MBP (assumedly because of the error message that it couldn’t retrieve the .Mac configuration information for the entire Advanced tab). I tried unregistering the MBP on the working computer/iMac, but I still got the same errors on the MBP.

It seemed that the MBP thought that it was still registered, even if it wasn’t. Hrm.

I found older help posts (circa 2003-2005) that talked about removing sync history through iSync. But .Mac is no longer performed through iSync, so that seemed a dead end. Indeed, I don’t use iSync for anything at all. But since I was desparate, I poked around in iSync anyway. I found the following two things in iSync preferences:

A “reset sync history” button

A master checkbox for “Enabling syncing on this computer” that specifically mentions .Mac (which seems odd, since iSync isn’t use for .Mac syncing anymore).

Doing this allowed the “Advanced” tab in my .Mac system prefs to start working again. Woo hoo! I could then perform the other suggested recovery actions, such as unregistering the MBP and then re-registering it, etc. Now things seem to be working (let’s give it a week to see if it keeps syncing properly…). As I’ve been typing out this entry, I see that my new bookmarks have appeared in Safari. Woo hoo!

But since I played with both options in the iSync system prefs at the same time, I unfortunately don’t know which of the two fixed it, or if both are required. YMMV.

Hopefully, others will find this entry via googling and find it useful…

November 10, 2007

8 out of 7 people are bad at math

I Leoparded last weekend (i.e., upgraded my iMac and MBP to Leopard). A few things I have noticed:

I found a bug in OS X’s Mail client regarding plain text and rich text composing (short version: I have “plain text” set as my preferred format, but still Mail composes some mails in rich text). I filed a bug about this with Apple and they closed it as a dup and something that they’re supposedly already working on.

Spaces is “ok” (vs. great). I wouldn’t say that it’s much better than Virtue Desktops. It gets many of the same things “wrong” as Virtue; if you switch to a different application (via cmd-tab), even one that has an open window on the current Space, you may still get switched to a different Space. I once got Spaces to “lose” all the windows on space 6 (i.e., the windows were supposedly there, but Spaces wouldn’t display them anymore — the windows in Spaces 1-5 were fine. I could even make new windows in Space 6 with no problems), but I haven’t been able to repeat it, so I haven’t filed a bug with Apple.

Quick Look is great, especially for e-mail attachments. It doesn’t always do a perfect job; I’ve seen it fail to show any details on some files (e.g., even powerpoint files that it should know how to display) and I’ve seen it skip some details that are in other files (e.g., not render some of the text on a powerpoint slide). But I guess that’s ok — it’s a quick look, not a detailed examination, after all…

I’ve caused Leopard to lock up a few times (requiring a soft or hard reboot); I’m not entirely sure what I did to make that happen; I was just using the machine normally.

Twice when I’ve rebooted, Leopard has associated the wrong application for opening Powerpoint files. I had to reset it to the right application (and then make all similar files open the same way). I don’t know why it seemed to “forget” how to open the right application.

Open MPI’s build system can make Leopard’s ld throw a bus error. Awesome. Technically, our assembly isn’t exactly correct, but it’s the minimum that will compile on all the linkers that we care about. Making ld on OS X throw a bus error is new, though.

tcsh still sometimes aborts for no apparent reason (it did in Tiger, too). It has something to do with typing ctrl-C on a command line. It is very difficult to reproduce this error; I’ve not found a consistent formula to make it die (i.e., just typing ctrl-C doesn’t make it happen). It doesn’t dump a corefile in /cores, either, so there’s little additional clues as to what happened.

Open MPI is included in Leopard as universal binaries for 4 architectures. Woof!

I’m not much of a designer kind of guy so I won’t comment much on the aesthetic changes Apple made, but I will say that the light blue dot on the dock indicating that an application is running is kinda hard to see sometimes.

The printing subsystem is a bit nicer than Tiger’s, but it no longer automatically finds the CUPS/IPP-advertised printer on my home LAN; I had to configure it manually. Tiger’s printing subsystem would always automatically find the printer.

The network subsystem is also a bit nicer than it was in Tiger.

DTrace looks pretty cool, but it has some differences compared to Solaris’ DTrace (making portable integration into Open MPI a bit more difficult).

I haven’t been able to make iChat AV work (audio/video/or vnc — regular chatting works fine), so I can’t comment on it. But I wasn’t able to make it work in Tiger, either — I’m guessing that there’s something weird in my network/ISP setup that is not letting the connections go through (need to setup weird port forwarding or something). I haven’t yet spent enough time with it to figure it out. Shrug.

Spotlight seems slightly faster. I’m guessing that it’s still bogged down by the few hundred thousand e-mails I have in Mail.

I love Mail’s new “export a folder” feature. It allowed me to archive off a bunch of really old mail to some permanent storage on a different server. Removing about 200k mails from Mail seemed to speed it up a bit (yes, I could have done this before by going into $HOME/Library/Mail, but I didn’t really think about it until I found the feature in Leopard Mail and thought “hey, this seems like a good idea!”).

Safari’s text search is waaaay better than the old one; I love how they visually pop the search items out at you when it finds matching text in the web page.

I really like the uniform use of the “Downloads” folder (why didn’t they do this before?); both Safari and Adium download things there by default and it makes the resulting files easy to find (without cluttering up my already-busy desktop).

So is it a huge improvement / worth it?

There’s a million small little things that are nice. But probably the main thing that changed my day-to-day usage is Spaces (I’m trying to use that instead of Virtue Desktops) and Quick Look. Aside from 1-2 new quirks in Mail, it seems handle very large mailboxes a bit better — and that’s important to me. I would really like to get iChat VNC working so that I can help with some of my relatives’ Macs when they have problems (this was actually the “killer feature” that had me go out and buy Leopard).

November 23, 2007

What I hate about my new cell phone

Per a prior journal entry, my work just changed from Palm Treo smartphones to Samsung Blackjack smartphones. The Blackjack is based on Windows Mobile 5. I’ve made up a list of things that I don’t like about my new phone. To be fair, the blame is equally shared by Windows Mobile 5, AT&T, Credant (the application used for locking the phone/encrypting the data), and Good Messaging (the application that ties into our back-end Exchange servers for e-mail, etc.).

For whatever reason (as compared to my first blog entry), the phone appears to be more stable now — it doesn’t lock up nearly as much, but the list of things I don’t like is still pretty long:

Windows problems:

The “done” button location is inconsistent; sometimes it’s on the left, sometimes it’s on the right.

You can create “speed dial” shortcuts (press-and-hold a number on the keypad to trigger an action), but: a) they’re not actually speed dials; they’re actions (e.g., run an application), so it took a long time to figure out that this functionality could be useful, but b) there’s no indication anywhere of what your “speed dials/actions” are after you set them. So you’d better have a good memory.

It’s a minimum of 7 clicks to get to where you can send an SMS (not counting the clicks to find the right contact) — sometimes more. Why should such a common action be so difficult?

If you cancel an SMS message, it goes to drafts. Then you have to delete it from the drafts folder (it takes 11 clicks to get to the drafts SMS folder).

There is a nice feature to turn off all transmitters (phone and bluetooth). But sometimes when you turn them back on, the bluetooth transmitter refuses to turn back on. It requires a reboot to fix this problem.

The Java on the phone is unusable; Java pops an authorization window every time an application uses HTTP or HTTPS. You cannot setup Java to say “this application is allowed to use HTTP/HTTPS forever.”

The phone plays any annoying (and very loud) noise upon startup/shutdown that you cannot turn off (I suspect this is AT&T’s doing, though — not WM5). The noise is accompanied by an animated fireball, supposedly to indicate AT&T’s blazing fast network. My wife, who heard the sound but didn’t see the accompanying graphic said, “Did your phone just flush?”

Internet Explorer provides no way to clear the current URL. If you want to go to a new web site, you either to have to go a bookmark or you have to fully backspace out the current URL and type a new one.

I find that Internet Explorer does not render many popular web site; it just stops in the middle of loading the page. Some sites work fine; other sites just hang.

The phone randomly reboots every once in a while. It rebooted while I was typing up this list, for example.

Good problems:

In all Good screens, small yellow banner comes up when new mail arrives, but there’s no way to get rid of it (despite there being an “X” on the right hand side of the yellow banner, implying that you can click on it somehow). So the banner stays there until you go read the new mail.

There is no integration between Good contacts and SMS. This is highly frustrating; I get SMS messages that simply show the phone number that they’re from; it doesn’t show me who they’re from. Who remembers phone numbers these days?

Good messaging does not automatically start when the phone boots (!).

When a reminder alert appears for a to-do item, you have to clear the alert to get back to the phone (vs. leaving the alert there because you haven’t actually done the item yet — like Outlook’s Alert’s window).

In the inbox, there’s no way to jump to the beginning or end of the inbox (or current message) — there’s only the thumbwheel to scroll up and down (which is difficult if you have a few hundred messages in your inbox, for example).

Good only shows the last hundred messages or so in your inbox. I understand conserving resources/memory, but there’s no way for the user to control how many/how few messages appear on the phone.

Credant problems:

Credant will lock your phone while you’re on a call. Even if you want to go on/off mute, you have to unlock the phone (which can be many clicks if your PIN is lengthy). To be fair, I don’t know if this is a Credant problem or someone set a policy that Credant would do this (i.e., I don’t know if it’s a bug or a feature).

Credant is schitzo about what you can/cannot do when the phone is locked. For example, you can see/clear to-do reminders when the phone is locked.

December 31, 2007

iBought iPod

I bought my first iPod yesterday: an 80GB Classic (I actually got a Nano as a parting gift from Indiana University when I left for Cisco, but my wife claimed it immediately — I’ve never seen it again).

I resisted buying an iPod for a long, long time, mainly because I don’t listen to music on headphones very much. But I do listen sometimes (especially when traveling, while on planes, etc.). However, I had a better rationalization reason: Tracy and I maintain all of our music on our home iMac (several hundred CD’s — all of which we own, thankyouverymuch). I’ve kept copies of select music on my work laptop for convenience (e.g., to listen while traveling). But I keep running out of disk space on my laptop — it became annoying to balance the disk space I needed for work while maintaining a decent selection of music to cycle through. A good solution seemed to be to use an iPod to hold all of our home music and listen to it through my MBP (as a bonus, it’s a backup of all of our music). The 80GB model is more than enough to hold all of our music and seems to work well.

After buying the iPod, I spent an hour or two yesterday “cleaning up” our digital music collection — finding some missing album art, ensuring consistency of artist and album names, etc. I find myself doing this every few years, especially when moving to a new technology. Just for perspective: my digital music collection started years ago (early/mid-90’s?), using Grip and LAME/Bladeenc on Solaris to rip my CD’s to MP3s (some of the MP3’s I edited yesterday still had “Created by Grip” comments in the ID3 tags — wow).

While in the Apple store, I also played with an iPod Touch — just for the heckuvit. It’s quite a yummy device. The interface for the Calendar / To-Do stuff and adressbook stuff is classic Apple: elegant and simple.

February 10, 2008

Popeye vs. Rambo cage match

I got my new iMac this weekend — mmm… Apple refurb store… great way to save $$$ when buying yummy “new” machines. I bought it because our old family G5 iMac was getting a bit long in the tooth; it’s excessively slow when dealing with our 3 billion digital photos in iPhoto and trying to make munchkin family iMovies. The new iMac is a most excellent 24” screen with a 2.8ghz core 2 duo. It’s been a looong time since I’ve had a monitor that large!

I also got Final Cut Express 4; it should be quite a few steps up from the latest generation of iMovie (I was quite disappointed in iMovie ‘08; it’s significantly “dumbed down” compared to iMovie ‘07). I didn’t get much chance to place with FCE this weekend, most of my time was spent…

Installing Windoze (and the required 100+ (!) Windoze updates — and accompanying dozen or so reboots). Yes, that’s right. Another reason I wanted to get a new iMac was to have an intel chip so that I could run virtual Windoze.

Why? To run a real version of Quicken, of course! (Quicken for Mac just sucks — don’t get me started) But let me digress once again: when wanting to run Windoze, which should one choose: Boot Camp, Parallels, or VMWare Fusion?

Boot Camp: not even a contender for me. I want OS X and Windoze to run at the same time.

Parallels: I run Parallels on my work laptop (MacBook Pro) and it works just fine. I’ve been pleased with it.

VMWare Fusion: But I’ve been hearing good things about Fusion lately.

I did some googling, but most of the “Parallels vs. Fusion” info out there is 6-12 months old, and based on Fusion betas. There’s a few recent articles, but not much at all. Since Fusion offers a free 30 day trial, I gave it a whirl. Before describing what I found, let me review my criteria:

I don’t care too much about performance differences. I’m mainly (only?) going to be running Quicken under Windoze, so if Parallels or Fusion is 5-10% faster than the other, I don’t care.

For the same reason, I also don’t care about super-duper graphics.

I don’t care about Vista support. I’ll be running XP.

After trying Fusion v1.1.1 for 24 hours, I am sorely disappointed. It is a very basic VM application and lacks a lot of features (at least compared to Parallels!). I admit it: I’m spoiled by Parallels. Here’s some random points:

Parallels 3.0 build 5584 has much better integration — its “Coherence” mode is far superior to Fusion’s “Unity” mode. For example, Fusion consistently shows overlapping Windows in expose incorrectly, and also is slow to update / doesn’t update overlapping windows in some scenarios.

Parallels’ “smart select” is also truly cool: be able to associate a file type in OS X with a Windoze application — double click on the file in OS X and have it launch under Windoze [launching Windoze if it’s not already running, mind you] with that data file. That’s both cool and genuinely useful! Read between the lines: associate .doc files with Windows Word (ditto with the other Office file types), if you do have Windoze Office but do not have Mac Office.

I also very much like being able to “natively” share Desktop / Documents / Pictures / Music between Windoze and OS X. It takes the whole “if I edit the file in one place, do I have to transfer it to the other?” issue out of the equation. Awesome.

Parallels also has a nice GUI for managing disk snapshots that Fusion lacks.

Parallels VM’s also have many more configurable options than Fusion VM’s. I like that, but I recognize that others might find Fusion’s simplicity/lack of options easier to manage.

Fusion seems fine as a VM, but appears to be missing many of the nicer features that Parallels has (which makes sense; Parallels got quite a good head start).

This unfortunately means that I’ve had to load XP twice (and all 100+ (!) Windoze updates each time). That’s why I spent lots of time installing Windoze this weekend. Sigh.

I’ve been using Mac Quicken for a few years now. It clearly does not receive the same development effort that Windoze Quicken does. There are many bugs and annoyances (which have not been fixed in multiple major Mac Quicken releases), and its capabilities are far inferior to Windows Quicken. So it was a relief to have the ability to move back to Windoze Quicken. The Mac Quicken has instructions about how to export your data into a format that the Windoze Quicken can import, so I thought I was good to go.

Unfortunately, it didn’t work out that way.

It turns out that those export-from-Mac-import-to-Windoze instructions are a few YEARS out of date (even though they are bundled in Mac Quicken 2007!). The Windows Quicken no longer supports importing of QIF files (despite the fact that the Mac Quicken does not support exporting to anything other than QIF files). There’s a few scripts and programs around the internet that supposedly help, but I haven’t been able to get Windoze 2008 Quicken to read any of my data yet.

Unfortunately, I’m outta time this weekend, so I’ll have to try again next weekend…

Grumble.

I should note that I’m quite happy with everything else with my new iMac. It’s nice and fast, the display is huge, and I was able to transfer over all my old photos and movies files in about an hour or two (yay firewire Mac-to-Mac transfers!).

February 17, 2008

Must be received within 14 days of receipt

I should report back about my entry from last week: I got all my Mac Quicken data imported into Windows Quicken after two important things:

I actually filed a tech support ticket with Quicken (I paid for support, after all…) asking how I’m supposed to migrate my Mac data to Windows. I finally got on the phone with them on Thursday and, after convincing the tech support lady that Win Quicken 2008 would not import the QIF file from Mac Quicken (#$%@#$@!!!!), she put me on hold to check other resources. She came back a few minutes later with a one-time download link for me for Windows Quicken 2004 (which does support importing QIF files). Schwing! So I installed WQ2004, imported my QIF, upgraded to WQ2008, and voila!

Well, not quite. :-) I actually had to run my QIF file through a perl script to scrub it for two things before I imported it:

Several account and category descriptions were corrupted (repeatably so — they were corrupted the same way every time I exported the QIF file) such that they contained characters above ASCII 127. I clipped that stuff out.

All the years were expressed in 2 digits, so Quicken 2004 imported them as (1900+2_digit_year). Hence, lots of my transactions were dated 1900-1908. Ick. Supposedly you should be able to open the OS X Sys Preferences/International and set the “short” date format to have 4 digits; the MQ2007 QIF export should then use that format (i.e., 4 digit years). But it didn’t seem to work for me — the QIF export always had 2 digit years. [shrug] So I set my perl scripty-foo to convert the years to 4 digits. Then everything imported to WQ2004 fine (and subsequently upgraded to WQ2008 fine).

Woot!

I’m still getting used to WQ2008; it’s quite different (plus, it’s in Windoze). But it already seems far more powerful than MQ2007.

February 18, 2008

It's the latency, stupid

I took our car in today for regularly-scheduled maintenance. The work was supposed to take about 2 hours, so I opted to wait at the dealer while it was being done.

I pulled out my laptop and my cryptocard, hooked up to the complimentary wifi, connected to my VPN, and was fully connected to my work. I chatted with colleagues in London and Israel and across the United States. I sent dozens of e-mails. I downloaded some data files and an update for my instant messenger program. I logged into servers in California and worked on resolving some bugs in Open MPI. I did all this without even thinking twice about it.

But for some reason, I abruptly stopped working, sat up, and looked around. I saw cars being dissected in the garage through the window. I saw an obviously newlywed couple signing papers to buy a new car. I saw a woman at the receptionist’s desk scheduling some future maintenance work on her vehicle. I saw other salespeople chatting by the water cooler.

And then it hit me: I’m sitting in a car dealer’s waiting room. And I’m fully connected to everything that I need to do. I’m talking with people on different continents. I’m working on servers thousands of miles away. Wow! Isn’t that just cool?!

We tend to such connectivity for granted these days. But take a step back: isn’t it amazing? You can be anywhere, any place, any time, and be connected to your friends, family, and colleagues around the globe. Such things weren’t possible even a few years ago.

December 25, 2008

Tivo -> OS X Quicktime

I spent a good deal of time researching this recently, so I thought I’d put it down in an entry so that it can be found by others.

Two facts make it desirable for me to download shows from my Tivo to my Leopard OS X /MacBook Pro laptop and/or iPod:

I travel not infrequently

I’m usually behind on my Tivo shows

I bought the Tivo Roxio Toast application (which is Tivo’s native download/view application), but all it does is queue up downloads and have a native MPEG 2 player to play the Tivo videos. It does not make videos playable on your iPod, or allow you to “export” them to other players.

However, I do notice (from Toast) that Tivo now supports some flavor of “resuming” downloads. I have tried HTTP resume with curl (via manually HTTP downloading Tivo files) at one point and it didn’t work; perhaps one of the tivo software updates has enabled HTTP “resume”…? Resuming downloads is very handy if you’re halfway through downloading a 1+GB file and the connection gets cut for some reason.

So I abondoned Toast and started using TivoDecode Manager (TDM) for this task. TDM downloads files from your Tivo, decodes them, and then morphs the file into a format that is both viewable in Quicktime and on iPods. Unfortunately, I grew frustrated with its slow / balky / unpredictable / buggy graphical interface.

So I started doing the following manual process instead:

Visit http://my_tivo_ip_address/

Type in “tivo” for the username and my Media Access Key (MAK) for the password.

See a list of all the shows that are currently on my Tivo.

Right click to copy the link for each show I want to copy.

Past the link into a perl script that downloads them all for me. Why so much trouble — why not download them directly from my web browser? There’s a few reasons:

An hour long show is over a gigabyte — they’re huge files

My Tivo is on my wireless network, and while I have a pair of rockin’ Cisco Aeronet 1132’s for 802.11g speeds, it still takes a long time to download each show (I never bothered to figure it out conclusively, but it seems like the download speed from Tivo is rate-limited, either by the cheap USB dongle-to-wireless NIC that I have on it, or by the Tivo software itself)

Tivo only lets one show be downloaded at a time

I have a Linux server at home which is powered on 24/7; it’s perfect for “in the background” downloading like this

Plus, the Linux server is on the wired side of my LAN, so the Tivo is the only entity sucking up wireless bandwidth — I don’t have both the Tivo and the downloading computer competing for wireless bandwidth

Summary: When recording at “medium” quality (which is what I have set as the default on my Tivo), it takes 45 minutes to download a 1 hour show to my Linux server (yow!). So it is much better to setup a batch script to download and let it run unattended.

I then use the TivoDecode command line tool to decode the file into an MPEG 2.

There are then 2 options for watching the file:

The mplayer application is a fine player that natively plays MPEG 2 videos

Or, if I want to watch it on my iPod (sometimes its just much more convenient to pull out an iPod on a plane rather than a big MacBook Pro — especially since I tend to fly cattle class…), I re-encode it to MPEG 4. The resulting MPEG 4 file can be dragged-n-dropped into iTunes and synced to my iPod in the normal way.

The recoding from MPEG 2 to MPEG 4 step is the tricky part, especially for those who are not experts in video / audio encoding technologies (like me). As I mentioned above, I used to use the TivoDecode Manager (TDM) for converting the files to MPEG 4. TDM simply invokes some other command line tools for the majority of its work, so I stole a bunch of its ideas and hacked up my own “convert” script and used the command line tools that were included in the TDM application bundle (specifically: “mencoder” from the mplayer package).

However, there were two problems with the “mencoder” that ships with TDM:

It is not compiled with MMX or SSE support (even though mencoder does support both MMX and SSE). I’m running on Intel chips that have MMX support — using the mencoder support for MMX greatly speeds up the encoding process.

It sometimes fails to convert Tivo MPEG 2 files to MPEG 4; it’ll convert the first minute or two of the file, and then abort, complaining about too many audio frames buffered.

The second one was the stickler for me — there were some shows that it just plain refused to convert. I don’t really know enough about audio / video encoding to know what the exact problem is, but it is very repeatable (with some shows). Shrug.

So I finally got some time and looked into this. Specifically, I downloaded and compiled my own mplayer suite (including mencoder) — that’s when I discovered that using MMX and/or SSE support makes a huge difference in encoding time. Ensure to have libflaac installed first so that mencoder can use the AAC audio encoder for the soundtrack. I’m sure there are other audio codecs that Quicktime supports, but I could only get videos encoded with AAC to play properly in Quicktime.

Note: I used Darwin ports to install the “faac” and “lame” packages to get mencoder (and later, ffmpeg — see below) to include both libflaac and libmp3lame encoding support.

But no matter what options I tried, I could not get mencoder to re-code my problematic MPEG 2 files to MPEG 4. Doh!

FWIW, here’s the command line that I was using for mencoder (originally taken from TDM):

The resulting ffmpeg-generated MPEG 4 files are a bit larger than the MPEG 4 files generated by mencoder, but I don’t really understand all of the ramifications of all of the command-line options that I am using. So there’s probably a good reason for that, but I don’t know what it is.

The ffmpeg encoding process is a lot faster that when I’m used to with TDM’s mencoder. I think that there are (at least) three reasons for this:

I’m using ffmpeg’s MMX support (remember that the precompiled mencoder I had did not have MMX support, even though mencoder itself does support MMX)

I did play around with command line options to get “fast” settings (this may have sacrificed some audio/video quality, but I am nowhere near an expert, so my untrained eyes/ears can’t tell the difference in the output)

I got a recent SVN checkout of ffmpeg — a much, much newer version (SVN r16304) than the mencoder version I was using

It would be really great if someone would re-do TDM “right” — have a nice integrated GUI that does all the downloading and re-encoding and importing into iTunes for you. As it stands, both Toast and TDM fail “the spouse test”: my very-intelligent-but-not-a-computer-scientist wife would be unable to reliably download files from Tivo and convert them to iPod format. And the manual process that I use fails the spouse test because it requires shell commands (that’s an automatic disqualifier). Blah.

I wish that I had the time to re-do TDM, but sadly, I do not. Any takers?

January 24, 2009

Cabbage is a dish best served cold

I backup the data on my work laptop every week. I’m quite fastidious about it because I’ve been burned by lost data before. So I usually start my backup somewhere around 6-7pm on Friday evening (I use rdiff-backup to a Linux server on my 100Mbps home LAN — yowzers). Then I top it off by rebooting my laptop either later that night or Saturday sometime (OS X is pretty stable, but I find that periodic reboots are still a Good Thing).

This week was no exception to my routine; I backed up everything Friday night. This morning, however, my MacBook Pro failed to reboot. Doh!

I obviously wasn’t afraid of losing any data since I had just backed up everything. But I wasn’t looking forward to the tedium of re-installing everything. Ugh.

Kyle told me about holding down the “T” key during an OS X boot which enables firewire “target” mode, meaning that you can hook up your laptop to another computer (e.g., my iMac) and the disk on the MBP basically appears as a remote disk. Woot! So I copied over all my data and a pile of extra applications that would have been annoying to track down to my iMac (ok, I really got yet another copy of my data over a faster network media, but it still made me feel good). I did try to run Disk Utility on the MBP disk, but it told me the exact same thing as fsck — no love.

So I rebooted the MBP from my Leopard install DVD, took a deep breath, and… erased my entire laptop hard drive. Yow. It was surprisingly scary. Happily, the disk fully zeroed out without any errors, so I guess the disk itself is ok (SMART reports that it’s ok, too).

I’m now re-installing Leopard on the latop and will re-copy all my data back when it finishes. I’ll still need to re-install a bunch of apps, though (e.g., those installed by Darwin ports, all the OS X updates, etc.).

I found a spare 250GB external drive lying around that I wasn’t using; I’ll now be augmenting my weekly rdiff-backup with either for Time Machine or SuperDuper; I haven’t decided which yet. I’m leaning towards SuperDuper because I can still use my rdiff-backup for periodic file loss (which isn’t that common for me; most of my software development work is done on Linux machines remotely) and use SuperDuper for catastrophic disk loss.

Sigh. I thought I had myself covered for backups. But I guess not; this exercise has wasted several hours so far. But if I use SuperDuper, next time (hypothetically) it’ll only take 30-60 minutes to fully restore.

February 20, 2009

Firewire guppies

This week I got to attend a WebEx meeting (meaning: teleconference + simultaneously visiting a web site where the teleconference speaker was flipping through his slides) about how email and related services work behind the scenes at Cisco. It was actually pretty fascinating.

Everyone takes email for granted these days, but only a handful of tech geeks (relatively speaking) really understand how complicated email really is. Email is hard. Very hard. Especially when you have a single top-level domain (cisco.com) that has to span a world-wide organization. It takes some really well-thought out architecting to make it “just work” for the tens of thousands of Cisco employees around the world.

Here’s the white paper describing how Cisco does its messaging — fascinating stuff if you’re into such things:

March 1, 2009

XM radio hates me

I’ve been an XM subscriber for several years; we even have two radios — one in our minivan and one in my office that I listen to all throughout the workday. I very much enjoy their music selection.

I got an email recently that it was getting time to renew — would I be interested in renewing for a multi-year bundle? You can save a little money by paying the balance up front (vs. monthly); I didn’t mind giving a little money to cash-strapped XM in the hopes that it will help them stay afloat. So I renewed on a multi-year plan early last week.

On Friday, I was out all morning and first used my minivan XM radio mid-afternoon. It went to the “preview” channel, meaning that it had somehow been deactivated. Blast!

I was busy with work and running errands all afternoon, so I didn’t think about this until I got home and turned on the XM radio in my office. It exhibited the same behavior: only tuning the “preview” channel, indicating that it, too, had been deactivated. Gaahhh!

I tried logging into the XM web site to see what was up, but it was refusing my login (“the XM system is having problems; please try later”). So I called XM and asked what was up. After a very confusing conversation, the support rep said that a supervisor would be calling me back between 24-72 hours later (which seems like a really odd timeframe). In the meantime — no XM radio. So sad. ☹

I finally got a call Sunday evening: the minivan radio had been fixed (I tested it and it works again), but my office radio had not. I can login to the web site now, but it only shows the car radio — it doesn’t show my office radio. I’m waiting for yet another call back from XM to figure out how to fix that.

I asked the rep this evening what the cause for these problems was. She gave me a very confusing answer that led me to believe that the fact that I renewed (somehow I apparently changed plans in the process?) caused a 24-72 hour delay in processing that deactivated my radios in the meantime. Very, very frustrating.

Seriously, for a cash-strapped company like XM, they should not be offering promotions like this and then cutting off service for multiple days to long-time/loyal listeners while they “upgrade” your service.

March 5, 2009

Happiness and Sadness

I have to return my old Blackjack to Cisco for erasing and recycling. Sadness (I cannot just throw it off a roof — or perhaps something even more creative — that would have been happiness).

XM still hasn’t fixed my second radio yet — no XM radio at work since last Friday. More sadness.

XM promised me on Sunday that they would call me back about my second radio, but they hadn’t yet. So I called them today. After sitting on hold for 30+ minutes, I finally got a very friendly and helpful representative. She unfortunately could not do anything to help my second radio not working (apparently it was deactivated off my account — I have no idea why), so she promised — very earnestly — that she would bump this to her supervisor and try very hard to get him to call me back today. I do believe she was honest and genuinely trying to help (and she was actually quite apologetic and sympathetic), but I got the same speech on Sunday. So we’ll see what happens.

Perhaps because she felt bad for me, she bumped me up to a “lifetime” subscription for the Honda-built-in XM radio in our minivan. It was only a few $ more than what I had just paid to renew it for 3 years anyway, so it was a good deal. Let’s hope XM stays in business long enough to make that worthwhile! ☺

March 7, 2009

XM hates me a little less

So the XM saga is over… at least for now. My XM account has been restored and I can listen to my radio again.

As probably was to be expected, I did not get a call back from my previous call to XM support (despite promises from the helpful XM rep that I spoke to). So I called again today and sat on hold for 30+ minutes. I finally got a rep who re-activated my radio on the spot. He also gave me a month’s worth of credit for my lost time and to cover the rest of my existing subscription that should have been on my account. He also signed me up for 2 more years on my office radio (which is what I tried to do over a week ago).

Why on earth couldn’t the other XM reps that I talked to do that? Why did I have to be offline for over a week? Extremely frustrating / disappointing. ☹

I like to have my home phone, work phone, and work cell hooked up to GV for the whole “single number reach” thing (stealing a term from Cisco’s VOIP solution)

But since my work phone also has single number reach (paired with my work cell), if someone calls my GV number, then my cell phone effectively rings twice (once because GV is calling it, and once because my Cisco work phone is calling it) — which results in Badness

So if I disable my work cell in GV, then my work cell rings once (yay), but then SMS doesn’t work (bonk)

What would be great for GV would be if you could have a cell phone registered that does not receive calls but does receive SMS messages. This would be great for me, and I can imagine that others might want that as well (but probably for different reasons).

I also tried to use my GV phone number to activate a Google App Engine account recently and it didn’t work. I.e.,

April 12, 2009

Who needs a stinkin' entry title, anyway?

Per a prior entry, I got my Blackjack Bold. In general, I’m much happier with it than I ever was with my Crapjack. It has one or two annoying “features”, but in general, it’s great.

My wife has a Blackberry Curve, so I’ve been able to compare the two models pretty closely. For my needs, I’m much happier with the Bold. It’s a little bigger and thinner — meaning that it has a bigger and brighter screen (IMHO). The keys feel a bit more natural to me, too. And the GUI on the Bold is definitely more “modern”.

Of course, within two weeks of receipt, I put my Blackberry through the laundry. DOH! And I did it the day before I left for a week-long trip. Yoinks. It was quite painful to be traveling and not have a phone. So I had to buy another one. Yuck. :-( But I finally got my replacement, and I’m being much more careful with it…

PROS:

It’s much more reliable; I haven’t seen it randomly reboot, for example.

UPDATE: I’ve now seen plenty of random reboots…

The interface is much more modern (which is really a subjective point, but I still like it much better than the Crapjack interface) and is much more internally consistent than the Crapjack interface

Easy things are easy on the Blackberry. I don’t feel like looking up my old entry, but it was many, many clicks just to send a text message on my Crapjack. From the start screen, it’s 2 clicks and trackball 2 rolls (excluding typing characters to type the name of the person I want to SMS) to send a text message (or call or MMS or email). Much better.

There are keyboard shorcuts in the email index screen. For example, you can jump to the top of the index, bottom of the index, to the previous/next day, etc. You don’t have to scroll scroll scrooollll to get to where you want to go.

There’s a Google Mobile app that works on the Blackberry (it didn’t work at all on the Crapjack because of weird Java issues). I love getting both my work and home email on one device.

CONS

There’s no way to lock the keypad. It’ll “lock” the device, but then any keypress will start entering the device password.

UPDATE: Holding down the “mute” button on the top will lock the device. Handy.

It appears that the Blackberry Enterprise service installs some software on the back-end Exchange server that monitors your INBOX. Any time a message arrives or changes state (e.g., you reply to a message), it deletes the message, processes it, and then puts it back in your INBOX. Some of the processing includes making every message be a multipart-mine message — it adds an HTML-ified version of every plain-text message, for example. This is highly annoying to me for two reasons:

I receive my work email via IMAP on the OS X Mail application. Apparently, Mail is fast enough to see the message before the Blackberry service deletes it. So a mail will arrive, appear in my INBOX, then disappear, and then re-appear. Very annoying, because it changes focus, “blinks” the index pane, etc.

I send and receive mostly plain text email (because I am a programmer and communicate mostly with other programmers). Since the Blackberry service adds an HTML version of the mail to each message, the Apple Mail application assumes that you want to see the HTML version (there’s no way to tell it “I prefer the plain text mail”). So I’ve effectively lost the ability to see plain text mails. Arrgh!

Hopefully, the above Exchange integration item can be fixed in the not-distant future (unfortunately, I’m not holding my breath — there have been reports of Blackberry doing this for several years). Other than that, though, I’m pretty happy with my new device.

April 25, 2009

Argyle serving platter patterns

My iMac mightly mouse finally died. May it rest in peace.

By “died,” I mean that the horizontal / vertical scrolling button wheel thingy on the top finally quit working. I’ve read lots about this on the net — I knew it was just a matter of time before mine stopped working. From having cleaned on many trackball mice, I can see how gunk just accumulates in the mechanism and that is probably what makes it stop working. Shrug.

I got a cheapie optical Logitech scroll mouse to replace it. Works fine.

In other Mac news, Anna A. face-dialed me with her iPhone the other day (meaning: it’s like pocket dialing, but with your face. She was on another call, holding the phone up to her ear and talking, and her face pushed enough buttons on the iPhone to call me. Awesome. ☺ )

September 13, 2009

Spammy spam spam

I do all my personal email via Google mail (gmail). They do a really great job of catching spam. This is pretty important to me because I’ve had my same personal email address for 10 years! (squyres.com turns 10 this Wednesday) Hence, that address turns up on lots of spammers’ lists.

One of the things Gmail does is maintain a rolling 30-day window on your spam. That is, when Gmail identifies spam, they move it to your spam folder. When that spam is 30 days old, it’s automatically deleted. Simple.

Just because I’m a curious guy, I periodically like to calculate my “spams per hour” rate (sph) — an average of how many spams I received every hour for the past 30 days. The calculation is simple: number of spams in my Gmail spam folder divided by 24 divided by 30. Or, for the mathematically inclined, “x / (24 * 30)”.

My sph sometimes varies wildly; I have seen it as low as 2 and as high as 12. Today, my sph is 3.71. Just a few days ago it was over 6.

Consider that to make a change from 6 to 3.71 reflects a fairly large difference in the total number of spams:

6 sph = 4,230 spams

3.71 sph = 2,671 spams

Seeing large fluctuations like this usually means that some kind of “spam event” has occurred within the last 30 days. For example, I have a decrease of ~1,500 spams compared to earlier this week; it’s possible that some spammer got knocked off the net (or filtered) or otherwise stopped sending. It’s also possible that my address fell the spammer lists that were active over the last 30 days.

But the optimist in me prefers to think it was the former. ;-)

When my sph reached 12 a few months ago (8,640 spams in my folder), it was a fairly dramatic ramp-up — the spammers must have been redoubling their efforts to send out huge volumes of mails and/or figured out some clever trick to avoid server-side blocking. The fall off was equally dramatic; IIRC, within the span of 1-2 days, my sph fell to 2 or 3.