News

Welcome to End Point’s blog

When you want to forcibly replace a symbolic link on some kind of Unix (here I'm using the version of ln from GNU coreutils), you can do it the manual way:

rm -f /path/to/symlink
ln -s /new/target /path/to/symlink

Or you can provide the -f argument to ln to have it replace the existing symlink automatically:

ln -sf /new/target /path/to/symlink

(I was hoping this would be an atomic action such that there's no brief period when /path/to/symlink doesn't exist, as when mv moves a file over top of an existing file. But it's not. Behind the scenes it tries to create the symlink, fails because a file already exists, then unlinks the existing file and finally creates the symlink.)

Anyway, that's convenient, but I ran into a gotcha which was confusing. If the existing symlink you're trying to replace points to a directory, the above actually creates a symlink inside the dereferenced directory the old symlink points to. (Or fails if the referent is invalid.)

To replace an existing directory symlink, use the -n argument to ln:

ln -sfn /new/target /path/to/symlink

That's always what I have wanted it to do, so I need to remember the -n.

Years ago, Jon and I spoke of submitting patches to implement some form of "follow the leader" (like the children's game, but with a work-specific purpose) in GNU Screen. This was around the time he was patching screen to raise the hard-coded limit of windows allowed within a given session, which might give an idea of how much screen gets used around here (a lot).

The basic idea was that sometimes we just want to "watch" a co-worker's process as they're working on something within a shared screen session. Of course, they're going to be switching between screen windows and if they forget to announce "I've switched to screen 13!" on the phone, then one might quickly become lost. What if the cooperative work session doesn't include a phone call at all?

To the rescue, Screen within Screen.

Accidentally arriving at one screen session within another screen session is a pretty common "problem" for new screen users. However, creative use of two (or more) levels of nested screen during a shared session allows for a "poor man's" follow the leader.

If the escape sequence of the outermost screen is changed to something other than the default, then the default escape sequence will pass through and take effect on the inner screen. In this way, anyone attached to the outermost screen will be following whomever is controlling the inner screen session as they flip between windows, grep logs, launch editors and save my vegan bacon! To "break away" from the co-working session, a user would simply use the chosen non-default escape sequence of the outermost screen to create a new window or disconnect entirely.

Sound confusing? Give some of the following commands a try. You can always just close out all the windows of a screen session and eventually you'll make it back to your original shell.

from within the "followme" session, start the inner screen where actual work will be performed:

screen -S work

get friends and co-workers (logged-in as the same user) to connect to your "followme" screen:

screen -x followme

work as normal using the default: <CTRL> <a> sequences (which ought to affect the inner "work" session).

to "break away" from the "work" session, use: <CTRL> <e> sequences (which ought to affect the outer "followme" session). For example, to disconnect from the shared session, one would type: <CTRL> <e> <d>

Note: If those sharing the screen session are already acclimated to screen-within-screen, you can skip the non-default escape sequences entirely and use <CTRL> <a> <a> as the escape sequence (another <a> for every level of screen-within-screen). This also happens to be your evasion route for accidental screen-within-screen moments.

Remember that, by default, everyone who wants to share the screen must already be logged-in as the same user (without the use of sudo or su). There are methods of allowing shared screen access between users, but those are outside the scope of this post.

I recently saw a problem in which Postgres would not startup when called via the standard 'service' script, /etc/init.d/postgresql. This was on a normal Linux box, Postgres was installed via yum, and the startup script had not been altered at all. However, running this as root:

service postgresql start

...simply gave a "FAILED".

Looking into the script showed that output from the startup attempt should be going to /var/lib/pgsql/pgstartup.log. Tailing that file showed this message:

However, the postgres user can see this file, as evidenced by an su to the account and viewing the file. What's going on? Well, anytime you see something odd when using Linux, especially if permissions are involved, you should suspect SELinux. The first thing to check is if SELinux is running, and in what mode:

Yes, it is running and most importantly, in 'enforcing' mode. SELinux logs to /var/log/audit/ by default on most distros, although some older ones may log directly to /var/log/messages. In this case, I quickly found the problem in the logs:

Here we see that although the postgres user owns the symlink, owns the data directory at /var/lib/pgsql/data, and owns the file in question, /var/lib/pgsql/data/postgresql.conf, the conf file is no longer really on /var/lib/pgsql, but is on /mnt/newpgdisk. SELinux did not like the fact that the postmaster process was trying to read across that symlink.

Now that we know SELinux is the problem, what can we do about it? There are four possible solutions at this point to get Postgres working again:

First, we can simply edit the PGDATA assignment within the /etc/init.d/postgresql file to point to the actual data dir, and bypass the symlink. In this case, we'd change the line as follows:

#PGDATA=/var/lib/pgsql/data
PGDATA=/mnt/newpgdisk/data

The second solution is to simply turn SELinux off. Unless you are specifically using it for something, this is the quickest and easiest solution.

The third solution is to change the SELinux mode. Switching from "enforcing" to "permissive" will keep SELinux on, but rather than denying access, it will log the attempt and still allow it to proceed. This mode is a good way to debug things while you attempt to put in new enforcement rules or change existing ones.

The fourth solution is the most correct one, but also the most difficult. That of course is to carve out an SELinux exception for the new symlink. If you move things around again, you'll need to tweak the rules again, or course.

I had a flash of inspiration to write an article about external links in the world of search engine optimization. I've created many SEO reports for End Point's clients with an emphasis on technical aspects of search engine optimization. However, at the end of the SEO report, I always like to point out that search engine performance is dependent on having high quality fresh and relevant content and popularity (for example, PageRank). The number of external links to a site is a large factor in popularity of a site, and so the number of external links to a site can positively influence search engine performance.

After wrapping up a report yesterday, I wondered if the external link data that I provide to our clients is meaningful to them. What is the average response when I report, "You should get high quality external links from many diverse domains"?

So, I investigated some data of well known and less well known sites to display a spectrum of external link and PageRank data. Here is the origin of some of the less well known domains referenced in the data below:

I retrieved the PageRank from a generic PageRank tool. SEOmoz was used to collect external link counts and external linking subdomains. Finally, Yahoo Site Explorer was used to retrieve external link counts to the domain in question. I chose to examine both external link counts from SEOmoz and Yahoo Site Explorer to get a better representation of data. SEOmoz compiles their data about once a month and does not have as many urls indexed as Yahoo, which explains why their numbers may be lagging behind the Yahoo Site Explorer external link counts.

Out of curiosity, I went on to plot the Page Rank data vs. Log (base 10) of the other data.

PageRank vs Log of SEOmoz external link count

PageRank vs Log of SEOmoz external linking subdomain count

PageRank vs Log of Yahoo SiteExplorer external link count

PageRank is described as a theoretical probability value on a logarithmic scale and it's based on inbound links, PageRank of inbound links, and other factors such as Google visit data, search click-through rates, etc. The true popularity rank is a rank between 1 and X, where X is equal to the total number of webpages crawled by search engine A. After pages are individually ranked between 1 and X, they are scaled logarithmically between 0 and 10.

The takeaway from this data is when an "SEO report" gives advice to "get more external links", it means:

If your site has a PageRank of < 4, getting external links on the scale of hundreds may impact your existing PageRank or popularity

If your site has a PageRank of >= 4 and < 6, getting external links on the scale of thousands may impact your existing PageRank or popularity

If your site has a PageRank of >= 6 and < 8, getting external links on the scale of tens to hundreds of thousands may impact your existing PageRank or popularity

If your site has a PageRank of >= 8, you probably are already doing something right...

Furthermore, even if a site improves external link counts, other factors will play into the PageRank algorithm. Additionally, keyword relevance and popularity play key roles in search engine results.

One of the neat tricks you can do with Bucardo is an in-place upgrade of Postgres. While it still requires application downtime, you can minimize your downtime to a very, very small window by using Bucardo. We'll work through an example below, but for the impatient, the basic process is this:

Install Bucardo and add large tables to a pushdelta sync

Copy the tables to the new server (e.g. with pg_dump)

Start up Bucardo and catch things up (e.g. copy all rows changes since step 2)

Stop your application from writing to the original database

Do a final Bucardo sync, and copy over non-replicated tables

Point the application to the new server

With this, you can migrate very large databases from one server to another (or from Postgres 8.2 to 8.4, for example) with a downtime measured in minutes, not hours or days. This is possible because Bucardo supports replicating a "pre-warmed" database - one in which most of the data is already there.

Let's test out this process, using the handy pgbench utility to create a database. We'll go from PostgreSQL 8.2 (the original database, called "A") to PostgreSQL 8.4 (the new database, called "B"). The first step is to create and populate database A:

This adds a new sync named "pepper" which is of type pushdelta (master-slave: copy changes from the source table to the target(s).). The source is our old server, named "oldalpha" by Bucardo. The target database is our new server, named "newalpha". The only table in this sync is "accounts", and we set ping as false, which means that we do NOT create a trigger on this table to signal Bucardo that a change has been made, as we will be kicking the sync manually.

At this point, the accounts table has a trigger on it that is capturing which rows have been changed. The next step is to copy the existing table from the old database to the new database. There are many ways to do this, such as a NetApp snapshot, using ZFS, etc., but we'll use the traditional way of a slow but effective pg_dump:

This can take as long as it needs to. Reads and writes can still happen against the old server, and changes can be made to the accounts tables. Once that is done, here's the situation:

The old server is still in production

The new server has a full but outdated copy of 'accounts'

The new server has empty tables for everything but 'accounts'

All changes to the accounts table on the old server are being logged.

Our next step is to start up Bucardo, and let it "catch up" the new server with all changes that have occurred since we created the sync:

bucardo_ctl start

You can keep track of how far along the sync is by tailing the log file (syslog and ./log.bucardo by default) or by checking on the sync itself:

bucardo_ctl status pepper

Once it has caught up (how long depends on how busy the accounts table is, of course), the only disparity should be any rows that have changed since the sync last ran. You can kick off the sync again if you want:

bucardo_ctl kick pepper 0

The final 0 there will allow you to see when the sync has finished.

For the final step, we'll need to move the remainder of the data over. This begins our production downtime window. First, stop the app from writing to the database (reading is okay). Next, once you've confirmed nothing is making changes to the database, make a final kick:

bucardo_ctl kick pepper 0

Next, copy over the other data that was not replicated by Bucardo. This should be small tables that will copy quickly. In our case, we can do it like this:

I recently was assigned a project that required an interesting solution, Crisis Consultation Services. The site is essentially composed of five static pages and two dynamic components.

The first integration point required PayPal payment processing. Crisis Consultation Services links to PayPal where payment processing is completed through PayPal. Upon payment completion, the user is bounced back to a static receipt page. This integration was quite simple as PayPal provides the exact form that must be included in the static HTML.

The second integration point required a unique solution. The service offered by the static brochure site is dependent on the availability and schedule of the company employees, so the service availability remains entirely dynamic. The obvious solution was to include dynamic functionality where the employees would update their availability. Some thoughts that crossed our minds of how to update the availability were:

Could we build an app for the employees to update the availability given the budget constraints?

Could the employees use ftp or ssh to upload a single file containing details on their availability?

Are there other dynamic tools that we could use to track the availability of the consultant such as SMS or Twitter?

Initially, we investigated using Google App Engine with a Python app that retrieved the availability information from an existing tool. To keep the budget down and try to stick with a purely static site on the server, we decided to investigate using Twitter for integration. I reviewed the Twitter API and found some code snippets for integrating Twitter via JavaScript. Below are snippets and explanations of the resulting code.

First, a script that retrieves the Twitter feed is appended to the document body. In this case, the endpoint Twitter account is pinged to get the most recent comment (count=1), and the resulting callback 'twitterAfter' is made after the JSON feed has been retrieved.

Next, the callback 'twitterAfter' function is called. The callback function includes logic to determine if the consultant is available based on the most recent twitter message. If the datetime is in the future, the consultant is not available and will be available at that future datetime. If the datetime is in the past, the consultant is available and has been available since that datetime.

In both the basic and advanced callback methods above, content on the page is updated to inform users of service or consultant availability. In the application of the advanced callback method, the user is notified when the consultant will be available.

The client side Twitter integration solution fit our budget and server constraints - the functionality lives entirely on the client side, so we weren't concerned about server installation, setup, or requirements. Additionally, Twitter is such a popular app that there are many convenient ways to tweet availability from a mobile environment.

Recently, I wrote up a new class and some tests to go along with it, and I was
lazy and sloppy. My class had a fairly simple implementation (mostly a set of accessors, plus a to_s method). It looked something like this:

I had been trying to determine the essential attributes of the class (e.g., what are the minimal elements of this class? should I have a base class, then sub-class it for the various differences, or should I have only a single class that contains everything I need?)

As a result of the speculative nature of the development, my tests only included a few of the attributes.

What's wrong with that?

On the surface, there is nothing technically wrong with skipping accessor tests: after all, testing each accessor individually is really testing Ruby, not the code I wrote. Another excuse I made is that testing each individually is very non-DRY - the testing code itself has lots of duplication.

The problem is that the set of tests should be considered a contract between the class writer and the outside world. By not including the correct and complete list of accessors, I left out important information; it's a check, already signed by the class developer, but with the amount left blank.

I've seen some code solve the non-DRY-ness problem like the following:

Attributes.each do |attr|
it "should have an accessor for #{attr}" do
...

That let's the testing code be nice and compact; simply load in the class, then iterate over the Attributes to verify that the accessors are present.

From a tests are contracts standpoint, this approach is terrible, though, perhaps even worse than the original, incomplete set of tests I had written. All the reader of the tests learns is that there is an array of attributes; the reader has to go look at the implementation itself to see what those attributes are.

Better is to use an anonymous array in the test, duplicating the attribute list; i.e.,

[:name,:rank,:serial_number].each do |attr|
it "should have an accessor for #{attr}" do
...
end
end

That seems to be a good balance between keeping tests as contacts yet keeping them DRY.

There are a few common ways to start processes at boot time in Red Hat Enterprise Linux 5 (and thus also CentOS 5):

Standard init scripts in /etc/init.d, which are used by all standard RPM-packaged software.

Custom commands added to the /etc/rc.local script.

@reboot cron jobs (for vixie-cron, see `man 5 crontab` -- it is not supported in some other cron implementations).

Custom standalone /etc/init.d init scripts become hard to differentiate from RPM-managed scripts (not having the separation of e.g. /usr/local vs. /usr), so in most of our hosting we've avoided those unless we're packaging software as RPMs.

rc.local and @reboot cron jobs seemed fairly equivalent, with crond starting at #90 in the boot order, and local at #99. Both of those come after other system services such as Postgres & MySQL have already started.

To start up processes as various users we've typically used su - $user -c "$command" in the desired order in /etc/rc.local. This was mostly for convenience in easily seeing in one place what all would be started at boot time. However, when running under SELinux this runs processes in the init_t context which usually prevents them from working properly.

The cron @reboot jobs don't have that SELinux context problem and work fine, just as if run from a login shell, so now we're using those. Of course they have the added advantage that regular users can edit the cron jobs without system administrator intervention.

One of the ways I like to retrieve email is to use fetchmail as a POP and IMAP client with maildrop as the local delivery agent. I prefer maildrop to Postfix, Exim, or sendmail for this because it doesn't add any headers to the messages.

The only annoyance I have had is that maildrop has a hardcoded hard timeout of 5 minutes for delivering a mail message. When downloading a very long message such as a Git commit notification of a few hundred megabytes, or a short message with an attached file of dozens of megabytes, especially over a slow network connections, this timeout prevents the complete message from being delivered.

Confusingly, a partial message will be delivered locally without warning -- with the attachment or other long message data truncated. When fetchmail receives the error status return from maildrop, it then tries again, and given similar circumstances it suffers a similar fate. In the worst case this leads to hours of clogged tubes and many partial copies of the same email message, and no other new mail.

This maildrop hard timeout is compiled in and there is no runtime option to override it. Thus it is helpful to compile a custom build from source, specifying a different timeout at configure time. In my case, I set the timeout to be 1 day:

If you choose to configure with --without-db as I do, you need to manually remove two occurrences of makedatprog from Makefile, as makedatprog is a utility only needed by DBM and won't have been compiled. Then make install as root and edit your ~/.fetchmailrc lines, adding mda "/usr/local/bin/maildrop", and restart the fetchmail daemon.

Long messages will still take a long time to deliver over a slow link, but they will at least be allowed to eventually finish this way.

We're big fans of Test Driven Development (TDD). However, a
co-worker and I encountered some obstacles
because we focused too intently on writing tests and didn't spend
enough up-front time on good, old-fashioned specifications.

We initially discussed the new system (which is a publish/subscribe
interface used to do event
management for a reasonably large system, which totals around 70K lines
of Ruby). My co-worker did most of the design and put a high-level
one-pager together to outline how things should work, wrote unit tests
and a skeleton set of classes and modules, then handed the project to
me to implement.

So far, so good. All I had to do was make
all of the tests pass, and we were finished.

We only had unit tests, no integration tests, so there was no
guarantee that once I was done coding, that the integration work would
actually solve the problem at hand. In Testing (i.e., the academic
discipline that studies testing), this is referred to as a
validation problem: we may have
a repeatable, accurate measure, but it's measuring the wrong thing.

We knew that was a weakness, but we pressed ahead anyway,
expecting to tackle that later. As an example, we identified 3
different uses of this publish/subscribe event management mechanism
that had wildly different use cases. When we discussed these
with the customer,
he clarified that one of the use cases is needed in the immediate term,
one is useful in the short term, and that the third is out of scope.
Getting that information was helpful in keeping us on track, and not
having the scope grow unmanageably.

Tests are code and no code (of sufficient size and complexity) is bug-free;
thus, tests have bugs.

When tests are the only spec, what is the best way to proceed?

The developer can assume tests are correct and Make The Tests Pass; clearly
that is not always the best approach. It's better for the developer
to exercise judgement and fix obvious errors. However, the developer's
judgement can be wrong, so the test writer needs
to pay special attention to any changes to the tests (and the need
to catch problems implies that the test designer and developer need
to be in tight communication -- don't just hand your co-worker your tests
and then go on a long vacation).

Sometimes tests aren't buggy, per se, but they may be ambiguous.
Variable names may not communicate clearly. They may be too large and not clearly test one thing. They may be too broad and leave unspecified the intent or design
parameters in mind.

In our recent experience, one test had the following code (slightly
altered):

What's wrong with this? On the surface, nothing is wrong, until the
bigger picture is viewed: there is no other
mention of a listener anywhere else in the tests, high-level
design document, or code. Is a listener a subscriber? Should there be
a separate listener class somewhere? After all, mock 'foo'
often means that there should be a foo object. Perhaps the
test developer forgot to include a file (or the developer overlooked it).

What actually transpired is not so mundane, but it identified a
very different approach in testing. My colleague made the observation
that it doesn't matter if a listener is a subscriber or not for this
particular test, as it's really only a syntactic placeholder: we could
do a variable renaming for listener and that should not
change the meaning of the code.

While his observation is true and correct, it ignores Abelson
and Sussman's viewpoint that "programs must be written for other humans
to read, and only incidentally for machines to execute." As the
implementer behind the pseudo-Chinese-wall of his tests, I expected the
tests to tell me how the universe of this system should be constructed,
and the mention of a listener communicated something other than
the intended message.

Sometimes, even unit tests require extensive setup, and it can be
tempting to add in extra tests and checks which don't add a lot of
value but instead make the intent unclear, make the tests themselves
less DRY, and give yet another opportunity to introduce bugs. One
example looked something like:

Note that the purpose of the test is Creating a subscriber entry
with a method name should create a block that invokes the method name,
yet the test checks the size of the subscriber entry, verifies that
the event is received, and that the callback itself is stored
via a weak reference so that it can be garbage collected. Each
of those should be in separate tests. In fact, the stated goal of
the test is only implicitly checked. Better is
something like

which only tests that the named callback is invoked, properly setting a
sentinel value.

If tests are being used as specification, then they will hide important details. A simple example if the type of storage to use for a particular set of values (for us, it was callbacks). Should they be in an array? a hash? a hash of arrays? something else?

How to handle this is a little more tricky -- implementation details like this could arguably not be part of a set of tests, as the behavior is the driver, not implementation details. However, without a specification, or a design document that outlines what kinds of performance characteristics we should aim for here, the implementer has to make choices, and those choices are not necessarily what the test writer would have wanted.

There is no one right answer there: for those who want to only use tests, then the tests need to be complete and cover the implementation details. Of course, this means that if a future scaling problem requires a change in data structures, then the test will also need to be ported to the new architecture. If specifications or design documents are used, that can speed the implementer's work, but leaves open some questions of correctness (e.g., did the implementer use the right architecture in the right way).

We solved these problems (and more) in true Agile fashion: through good communication among the customer,the test designer, and the developer, but this experience reinforced to us
that tests alone are insufficient, and good communication needs to be maintained in the development process.

Once upon a time there were still people using browsers that only supported SSLv2. It's been a long time since those browsers were current, but when running an ecommerce site you typically want to support as many users as you possibly can, so you support old stuff much longer than most people still need it.

At least 4 years ago, people began to discuss disabling SSLv2 entirely due to fundamental security flaws. See the Debian and GnuTLS discussions, and this blog post about PCI's stance on SSLv2, for example.

To politely alert people using those older browsers, yet still refusing to transport confidential information over the insecure SSLv2 and with ciphers weaker than 128 bits, we used an Apache configuration such as this:

That accepts their SSLv2 connection, but displays an error page explaining the problem and suggesting some links to free modern browsers they can upgrade to in order to use the secure part of the website in question.

Recently we've decided to drop that extra fuss and block SSLv2 entirely with Apache configuration such as this:

The downside of that is that the SSL connection won't be allowed at all, and the browser doesn't give any indication of why or what the user should do. They would simply stare at a blank screen and presumably go away frustrated. Because of that we long considered the more polite handling shown above to be superior.

But recently, after having completely disabled SSLv2 on several sites we manage, we have gotten zero complaints from customers. Doing this also makes PCI and other security audits much simpler because SSLv2 and weak ciphers are simply not allowed at all and don't raise audit warnings.

So at long last I think we can consider SSLv2 dead, at least in our corner of the Internet!

I ran into, and found solutions for, two major gotchas targeting IE 8 with a jQuery-based (and rather JavaScript-heavy) web application.

First is to specify the 'IE 8 Standard' rendering mode by adding the following meta tag:
<meta equiv="X-UA-Compatible" content="IE=8">

The default rendering mode is rather glitchy and tends to produce all sorts of garbage from 'clean' HTML and JavaScript. The result renders slightly different sizes, reports incorrect values from common jQuery calls, etc.

The default rendering also caused various layout issues (CSS handling looked more like IE 6 than IE 7). Also, minor errors (an extra '' tag on one panel) caused the entire panel to not render.

Another issue is the browser is overly lazy about invalidating the cache for AJAX pulled content, especially (X)HTML. This means that though you think you're pulling current data, in reality it keeps feeding you the same old data. This also means that if you use the same exact URL for HTML & JSON data, you must add a parameter to avoid running into cache collisions. IE 8 only seemed to honor 'Cache-control: no-cache' in the header to cause it to behave properly.

On the other side, I've got a big thumbs up for jQuery. I was able to produce a skinned fairly 'heavy' client-side application that works equally well (and looks almost the same) on Firefox, Chrome, Safar, and now IE 8.