devbox:~$ iptables -A OUTPUT -j DROP

A while ago (years), I reluctantly set up ntp on some servers using an IP address for the source server; at the time, using a DNS name in ntp.conf was incompatible with the ntp/ntpd version and I didn’t want to go out of my way to compile it from scratch.

Today, I realized that I’ve been slowly getting bit in the butt, several years later.
Back then, the IP addresses were supposed to be rock solid ntp references. But over the years, and finally about a month ago, they all came offline.

I would not have caught the drift if it wasn’t for my use of pt-heartbeat (mk-heartbeat) and doing a review of our cacti graphs.

This is what pt-heartbeat will show when your NTP service isn’t working

Usually I check them every monday as a routine, but I’ve been so busy for the last several months I haven’t had time to do that. I figured if it hits the fan, our alerts/thresholds will let me know. Which on a few occasions worked as needed for an Apache server.

pt-heartbeat, a tool of the Percona Toolkit has a common table across replicated servers that each one updates a record in the table with a datetime value.

The time difference between the value for server A, replicated to server B is the ‘lag’. The lag can/should be due to temporary spikes in traffic (or intentional delaying). Needless to say my gut sank when I saw that something weird as going on that was causing a small, yet unrecoverable, and linearly increasing lag time. After quickly confirming that SHOW SLAVE STATUS confirmed that everything’s up to date, it was quickly apparent that the mechanisms involved with the graph generation were at fault. Upon a quick side-by-side examination of server A and server B’s ‘heartbeat’ table, it stood out that the times were off by a few seconds from each other.

I restarted the pt-heartbeat daemons and the issue persisted – the next culprit was simple to identify: ntp.

Output of ntpq -p quickly showed that the ntp server hadn’t synchronized for over a month.

Over the years through periodic apt upgrades; the newer version of ntp came through the pipeline and all that was needed was an update of ntp.conf to use a new host (I opted for stratum 1 ‘time.nist.gov’)

Having recently been bitten by the awful default value (10) for max_connect_errors on a production server – I’m having a very hard time coming to terms with who the heck thought this would be a good way to do it.

This type of “feature” allows you to effecitvely DOS yourself quickly with just one misconfigured script – or Debian’s stupid debian-sys-maint account not replicating.

I’ve been thinking about how I could avoid this scenario in the future – upping the limit was a no brainer. But another item of curiosity: How do I know what hosts are on the list?

I’ve recently worked on a customized emailing suite for a client that involves bulk email (shutter) and thought I’d do a write up on a few things that I thought were slick.

Originally we decided to use AWS SES but were quickly kicked off of the service because my client doesn’t have clean enough email lists (a story for another day on that).

The requirement from the email suite was pretty much the same as what you’d expect from SendGrid except the solution was a bit more catered toward my client. Dare I say I came up with an AWESOME way of dealing with creating templates? Trade secret.

Anyways – when the dust settled after the initial trials of the system and we were without a way to deliver bulk emails and track the SMTP/email bounces. After scouring for pre-canned solutions there wasn’t a whole lot to pick from. There were some convoluted modifications for postfix and other knick knacks that didn’t lend well to tracking the bounces effectively (or implementable in a sane amount of time).

Getting at the bounces…

At this point in the game, knowing what bounced can come from only one place: the maillog from postfix. Postfix is kind enough to record the Delivery Status Notification (‘DSN’) in an easily parsable way.

Pairing a bounce with database entries…

The application architecture called for very atomic analytics. So every email that’s sent is recorded in a database with a generated ID. This ID links click events and open reports on a per-email basis (run of the mill email tracking). To make the log entries sent from the application distinguishable, I changed the message-id header to the generated SHA1 ID – this lets me pick out which message ID’s are from the application, and which ones are from other sources in the log.

There’s one big problem though:

Postfix uses it’s own queue ID’s to track emails – this is that first 10 hex digits of the log entry. So we have to perform a lookup as such:

This is a problem because we can’t do two at the same time. The time when a DSN comes in is variable – this would lead to a LOT of grepping and file processing – one to get the queue ID. Another to find the DSN – if it’s been recorded yet.

We use Logstash where I work for log delivery from our web applications. In my experience with Logstash I learned that it is a tool with so much potential for what I was looking for. Progressive tailing of logs and so many built in inputs, outputs and filters for this kind of work it was a no brainer.

So I set up Logstash and three stupid-simple scripts to implement the plan.

Hopefully this is self explanatory to what the scripts take for input – and where they’re putting that input (see holding tank table above)

1

2

LogDSN.php%{QID}{%dsn}

LogOutbound%{QID}{%message-id}

Setting up logstash, logical flow:

Logstash has a handy notion of ‘tags’ – where you can have the system’s elements (input, output, filter) enact on data fragments tagged when they match a criteria.

Full config file (it’s up to you to look at the logstash docs to determine what directives like grep, kv and grok are doing)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

input{

file{

format=>"plain"

path=>"/var/log/maillog"

type=>"maillog"

}

}

filter{

kv{

type=>"maillog"

trim=>"<>"

}

grep{

type=>"maillog"

match=>["status","bounced"]

add_tag=>["bounce"]

drop=>false

}

grep{

type=>"maillog"

match=>["message-id","[0-9a-f]{40}\@dom"]

add_tag=>["send"]

drop=>false

}

grok{

type=>"maillog"

pattern=>"%{SYSLOGBASE} (?<QID>[0-9A-F]{10}): %{GREEDYDATA:message}"

}

}

output{

exec{

tags=>"bounce"

command=>"php -f /path/to/LogDSN.php %{QID} %{dsn} &"

}

exec{

tags=>"send"

command=>"php -f /path/to/LogOutbound.php %{QID} %{message-id} &"

}

}

Then the third component – the result recorder runs on a cron schedule and simply queries records where the message ID and DSN are not null. The mere presence of the message ID indicates the email was sent from the application. the DSN means there’s something to enact on by the result recorder script.

* The way I implemented this would change depending on the scale – but for this particular environment, executing scripts instead of putting things into an AMQP channel or the likes was acceptable.

Where I work we have unfortunately had to skip the 5.4 release of PHP; the release cycle between PHP 5.4 and PHP 5.5 was pretty darn fast and we never got around to replacing APC. We’ve finally got everything up to speed to adopt 5.5 when it hit’s stable release.

I figured I’d fill in some of the blank air by listing (even if a personal memo for myself) the things I find most exciting for the upcoming PHP 5.5 release.

Built in opcode cache and optimizer (Zend optimizer plus) (This is a biggie since APC never saw light of day in PHP 5.4, I suspect many are in the same position as we are with APC…

array_column()

Observation, or a gripe/complaint/whatever:

I’m not quite sure about the password_hash ‘suite’ functionality – it’s not clear what’s finally made it in, I’d assume everything? People that don’t understand hashing and encryption are probably going to be confused a bit more than they were before unless this addition is advocated heavily in the documentation for counterparts (e.g.: the md5 documentation page).

I understand that trivializing the process is beneficial to avoid self inflicted damage, however I always get a tad annoyed when we see things labled as ‘STRONG ENCRYPTION’ since that term is a moving target; I long to see better implementations and standards recognition, e.g.: a FIPS level .

Honorable mention

The generators addition gets an honorable mention – I think it will take some time to trend and the scenarios for use case are probably fairly low to save time on writing your own iterators.

Monolog is perhaps the most popular logging library out there for PHP at the time of this writing. It has a lot of support and a nice balance of features.

Unfortunately I have one gripe to make about the rather closed implementation of the SocketHandler , er, handler?

The problem with the Monolog SocketHandler (as of 1.4.1)

The key problem is that Monolog treats it complete with assumption that say, a TCP port can be connected to. So you can setup your chain of handlers and processors; but with a critical application such as logging, the SocketHandler simply let’s itself into the logger object without testing to see if it can make a TCP connection.

The problem is: there’s no pragmatic way to test if a SocketHandler object can connect; there’s only an isConnected() helper method – but no canConnect() or similar.

The solution…

The solution is a bit less pragmatic than I wanted, because the SocketHandler class has it’s key method: connect() as a private method. Thankfully the class has a ‘mock’ method to at least attempt a socket connection and not affect the state of the object. We can use that to probe if the socket can be opened, and add it as a handler to our logger object accordingly.

Example

PHP

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

<?

classSocketHandlerextends\Monolog\Handler\SocketHandler{

publicfunctioncanConnect(){

if($this->isConnected()){

returntrue;

}

if(($probe=$this->fsockopen()&&is_resource($probe)){

fclose($probe);

returntrue;

}

returnfalse;

}

}

?>

Then, you can have a little bit more assurance that you’ll get your logging to go through the handler without a nasty exception…

PHP

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

<?

$logger=newLogger(

'itn',

array(

newStreamHandler('/path/to/logfile')

)

);

$socketHandler=newSocketHandler('tcp://127.0.0.1:7065');

if($socketLogger->canConnect()){

$socketLogger->pushHandler($socketHandler);

}else{

$logger->alert('SocketHandler connection failed.');

}

?>

A final word…

In a high parity environment; the SocketHandler is still not immune to losing connectivity to the target socket. It would sadly, be better practice to make your own logger wrapper to safely handle logging to a file, and the SocketHandler. Perhaps sometime in the future the SocketHandler can be revised to (optionally) suppress itself from connection failures.

I’ve been saying this for a long time now, Google can’t be trusted. I think it’s becoming commonplace in other blogs to start talking about having an exit strategy. I’ve been planning for a while (starting with the removal of my blog from blogspot).

The Google products I’ve unfortunately come to rely on:

Gmail (personal, and business)

Calendar

Drive

Reader

Analytics

Google Charts API

Google Web Fonts API

There’s something to be said about the flighty nature of these ‘free’ services. While Google and Yahoo have very long track records for their services; I don’t care for Yahoo’s UI, and Google is obviously the reason this is all in question anyways.Lucky for me, I have -very- few sites utilizing Google API’s because I’ve seen this coming from a mile away. That said, replacing Google charts will be a bear because graphs are a PITA to code.

Gmail, Google Calendar, Reader:

To replace Gmail, Google Calendar, Google Reader, I will be using Microsoft Hosted Exchange . At $4/mo – I can get push email to my phone from multiple domains and aliases (Which is what my email structure is anyways). Piece of mind that Microsoft isn’t using my email to target advertisements at me and Outlook (Admit it, Outlook is superior at what it does.) Outlook supports RSS feeds; and I don’t need to go into detail on what Exchange offers for sharing, etc.

Drive:

This one is easy, because I naturally do this anyways: Amazon S3. Dirt cheap for small file management; I don’t need the web UI to edit files; and IF I needed that, I would upgrade to the $6/mo plan for Live 365 for online office. There’s a myriad of programs that make it easy to make Amazon S3 a part of your workflow: Firefox’s “S3Fox” plugin and the Cyberduck app make it painless.

Analytics:

I’m not worried because there’s a dime a dozen of free analytic apps that all do just a good a job, don’t believe me? Look!

Google charts API:

Time to man up and buy a highcharts license. If you do web design you can build that price in per-client, or bite the bullet and get a developer version. Remember: How much is it worth NOT having to go crawling to your client telling them their graphs won’t work anymore because YOU used Google API that may be retired god-knows-when?

Google web fonts API:

To be honest, the only reason this is in use is pure laziness. In that I’m too lazy to DOWNLOAD the stupid .woff files and host them myself. No real issue here, just some busy footwork to remove the dependency.

I came up with a cool usage for the zebra stripe admin tool. In MySQL you can set a custom pager for your MySQL CLI output; so one can simply set it to the zebra stripe tool and get the benefit of alternated rows for better visual clarity.

Something like ‘PAGER /path/to/zebra’ should yield you results like the image below.

Zebra stripe tool used as a MySQL pager

You can always adjust the script to skip more lines before highlighting; you can also modify it if you’re savvy to the color codes to just set the font color instead of the entire background (which may be preferable but not a ‘global’ solution so the script doesn’t do it).

The problem with Atlassian Fisheye starter license:

I love using Atlassian Fisheye at work. It’s a very nice frill to have for a small team especially since it saves us time and adds a very easy, fast way to document the reviews and be open about feedback.

I have one gripe however; the 10 commiter limit (5 repos is bad enough). Our team has 4 developers – so we’re _technically_ 4 committers.

When we first started to use source control (Mercurial), our system setups would have inconsistencies in usernames: “Justin Rovang”, “RovangJu”, “rovangju” are all treated as unique usernames. Add to the fact that after we converted from HG to Git, all of the email addresses associated with those turned into <devnull@localhost> from the conversion script.

Git is sensitive to username AND email address for unique users. So our new set ups would be ‘Justin Rovang <justin.rovang@domain.com>’; but the history that was converted would have ‘Justin Rovang <devnull@localhost>’. So it’s easy to see how quickly even a small team could exceed that 10-commiter limit very fast in that circumstance.

So here’s the rundown, first you need to know what to map to what – so take an inventory of all of the incorrect/out-dated usernames that should point to a more modern/recent one; to do that I used this one-liner:

I want to map those according to the page linked in the subtitle above; so here’s an example .mailmap entry:

1

2

Justin Rovang<Justin.Rovang@domain.com>rovangju<devnull@localhost>

Justin Rovang<Justin.Rovang@domain.com>Rovangju<devnull@localhost>

You can verify the results by running the command above again (git log –format … etc); and you’ll see that the list has changed. This applies to -ALL- git log output, and therefore fixes the ’10 committers’ issue I was having with Atlassian Fisheye and Crucible.

You will lose data if you don’t account for ALL of your data’s ranges upfront. (E.g: MAXVALUE oriented partition). The reason being is how online schema changing tools work; they create a ‘shadow copy’ of the table, set up a trigger and re-fill the table contents in chunks.

If you try to manually apply partitions to a table that has data from years ranging from 2005-2012 (And up), say something like:

And it has data from 2012 or above, MySQL will give you this safety error: (errno 1526): Table has no partition for value 2012 – this is a safeguard!

Now, if you use a hot schema tool, the safeguard isn’t in place; the copying process will happily rebuild the data with the new partitions and completely discard everything not covered by the partition range!

Meaning, this command will successfully finish, and discard all data from 2012+:

Wait, PHP wants to array_merge an array with… itself?

Take another look at this: array_merge(array $a, [ array …]);

If you’re good at reading API’s – you’ll see how … odd this is. Seeing as I just got nipped in the butt by forgetting to have another array to merge into – it’s curious as to why the hell it doesn’t enforce a minimum of two arguments… any guesses? Or should we tack this up as a valid, non-nit-picky pitfall of PHP? Otherwise, what are you merging into? Doesn’t make sense…

I’ve been suckered in by the awesome JQuery Sparkline plugin – I won’t go over how it works or what it does, but rather a quick ‘fix’ for fixing issues with how the output deals with browser resizing. (Long story short, it doesn’t by default).

Here’s a visual of the issue I’m talking about with responsive/resizing layouts

With the sketchy nature of Google, I’m starting to decrease as many dependencies as I can from them. Starting with my blog. A part of it is just a technical test to find out how hard it would be to wean off of the free service.

Overall, there’s few – if any alternatives to WordPress that will let you import your blog from Blogger with such success. There are a few caveats I feel are worth mentioning, however…

Recently I’ve migrated a customer that’s been on Rackspace for 6 years, and paying a handsome penny for it at that. The migration was to Amazon Web Services (AWS) and I sent a friendly reminder to the client to cancel the RS account (9 days in advance to the renewal date).

Here’s how things went down:

RS: “We require a 30-day written notice to cancel your account”.

This is on a host that is on a month-to-month basis and the costs have been on an incline. In fact, the cost was to go up $10/month next month. (Suffice it to say, it’s not much compared to the overall monthly bill).

So I’m thinking to myself, well that’s a crappy “policy”. I give them a ring on behalf of my client and see if we can leverage some flexibility. I simply ask to waiver the 30 day ‘penalty’. Not even a pro-rate for the days unused for the month.

The RS rep is quick to tell me how many people call that dislike that policy and try to get somewhere with it, but they stick to it. At this point I’m thinking, wow – this will be a little challenge.

I explained that the cost was going up and is something that wasn’t agreeable and therefore there is good cause to waive that type of penalty. No go.

We go circular a bit on customer service – I blab a bit about how the competition (Linnode, AWS, etc) allow me to do more than they offer, for cheaper and not have a penalty. I also say that it’s odd that in most circumstances you’ll get a counter offer from a retention specialist (you know, we’ll knock off 10% on your hosting). Still kind of a nod-n-smile go screw yourself attitude from this rep.

Then he says “We value your feedback and it helps us become better”.
I respond: “OH REALLY? You start the call by telling me how many pissed off people are calling about this policy, try to stick it to me as well and then give me a line like that?

Enough is enough – I ask for a manager and exclaim I understand if he “has to say that” but this is an unacceptable situation.

The manger’s response: They won’t deal with me. They’ll only talk with my client. (Who in turn, told the client to hose off in the same manner).

All I can say is this:

Rackspace prices aren’t that hot. Look elsewhere.

The fanatical support thing is cute, but the customer service is pure garbage in the above context. I’ve never been treated like that. I’ve had better luck with credit card companies and land lords than this.

There’s this 30-day thing (Beware!)

If they give you support, they’ll want the root password for your server.

Their SLA is a lie, it reports on the 30 minute interval. Which means they can be down for 29 minutes every hour and not record it as downtime.

Their backup system on dedicated hosting is a bloated, un-tamed mess if you let them manage it, they let the ‘rack’ account on my client’s server exceed 60GB of crap that should be cleaned up. E.g.: backup software updates, provisioning/monitoring tools

The typical responses to these new ‘hipster’ systems are usually transaction/consistency centric – as that’s where the RDBMS systems shine – they can perform wonderfully while being ACID compliant.

Or in the case of Node, refuting the ‘Apache doesn’t have concurrency, node is better’ arguments. I have a hunch 99% of the Node fanboys have a damn clue how capable Apache itself is.

There’s also things like Node.js that rub the seasoned people the wrong way, perhaps it’s the sensationalism without actually proving anything? (Check the first few comments) Or the utterlack of security focus? (That’s what bugs me) – I also think it has to do with their approach to enter the market: guns blazing, criticizing other solutions and hoisting their own as THE single option with more tenacity than appropriate for such an immature project. Guys in the trenches can’t stand that crap, we know it’s just another tool to get the/(‘a’) job done in a particular scenario.

But really, I think about how much time is wasted on these subjects going back and forth, so let’s stop wasting time. Be open minded to the new technologies as tools for a particular job and stop making all or nothing stories out of future tech, like it or not – we all have to share the same space.

In a clear effort to push Android and Chrome, Google is discontinuing iGoogle Nov, 2013.This announcement comes as an early 4th of July surprise from Google.

It’s getting really hard to trust Google with how they bait and switch, and kill projects I know are more popular than they even state.

iGoogle is still new, and they’ve dumped effort into a recent redesign, this reverberates yet again how volatile things are. I hope I haven’t made a crucial mistake in using Gmail as well as instructing clients to use them for business mail.

People who use iGoogle use it as the homepage for their browser. Am I to believe they just want to toss out that un-tapped advertisement revenue (which they never tapped)? That’s how I know it’s a play on Chrome, and unfortunately iGoogle cannot be replaced by all the widget and gadget crap that you can install into Chrome, functionally: yes – but not having a birdseye view of many vectors of information on one page (a dashboard) is a very different deal.

At this point there’s no way I could possibly trust an app infrastructure with Google (with their pricing change history). I’m at a new level of paranoia: How long til Google kills their Web Fonts service? Google Web Toolkit (GWT)? Charts API? Blogger?

I had CPU consumption alerts fire off for ALL of my AWS instances running Percona Server (MySQL).

I couldn’t for the life of me figure it out – I re-mounted all but the root EBS volumes, restarted the services and ensured there was no legitimate load on all of them, yet MySQL stuck to 100% consumption on a single core on all of the machines. I tried everything, lsof, netstat, show processlist, re-mounting the data partitions.

WIERD. It turns out however, AWS was not be the cause of this one, even in light of the recent AZ issue in EAST-1.

This server had many cores so it was still running fine (and it was the weekend) – Seems like bad things happen at the least opportune times, like when you’re out on a date with the wife and the baby is with grandma…

At any rate, it’s amusing to watch the haters hate that clearly don’t understand the concept of AWS and the AZ (Availability Zones). Funny to watch them all scoff and huff about how they run a better data center out of their camper than AWS.

If anything, I want to see a good root cause analysis and maybe some restitution for the trouble.