devbox:~$ iptables -A OUTPUT -j DROP

So let’s say you’re running an aging version of Amazon Linux and don’t want to blow up your system by wedging in yum repos from distributions that aren’t quite in line with the CentOS derived Amazon Linux.Instructions on the web call on users to use Fedora or RHEL yum repos for CentOS users; but on Amazon Linux, you’re kind of twice-removed.

So long-story short, here’s some fodder for those who want the benefits of LetsEncrypt without the fluff of a repo.

My instructions will be for Apache/HTTPD, but you’ll see the key linch-pin item below.

First, start by downloading Certbot by hand:

Download Certbot

Shell

1

2

3

cd/usr/local/bin

wget https://dl.eff.org/certbot-auto

chmoda+xcertbot-auto

Second, back up your Apache HTTPD configuration:

Back up HTTPD

Shell

1

tar-czvf/root/httpd-backup.tar.gz/etc/httpd

Third, test certbot-auto and let it ‘bootstrap’ dependencies:** An error is likely here**

The most frequently asked question for ElasticSearch and security is “how do I require login”?

Once you’ve answered and implemented the answer to that question; a larger, truly more troublesome issue looms. The same principals used to secure ElasticSearch; typically a proxy fronted by Apache/nginx use various auth techniques. If you know what you’re doing, you have different endpoints in that proxy for controlling who can do GET/POST/DELETE requests, possibly pre-determining the index and type.

While reading through the documentation, I was surprised; no, SHOCKED to see that ElasticSearch ships with a security flaw as severe as remote code execution as an intentional feature through the dynamic scripting component of a body payload for ES.

If you’re responsible for running ElasticSearch servers…

You must examine how queries are sent into the server. If your web developers are sending near-verbatim DSL queries to ElasticSearch without any further filtration except auth and index constriction, please read this.

A malicious person could modify the payload directly to read files directly off of your filesystem and serve them up in ES.

The article contains a proof of concept (POC) link – simply download and modify the file and point it to your ES server and see if you’re vulnerable.

I think in most cases, dynamic remote scripting isn’t a big deal to turn off. So I’d strongly suggest following the advice on this page:
script.disable_dynamic:true

Stay tuned for Part 2 of a more obsessive approach to securing ElasticSearch for use on public search interfaces.

A while ago (years), I reluctantly set up ntp on some servers using an IP address for the source server; at the time, using a DNS name in ntp.conf was incompatible with the ntp/ntpd version and I didn’t want to go out of my way to compile it from scratch.

Today, I realized that I’ve been slowly getting bit in the butt, several years later.
Back then, the IP addresses were supposed to be rock solid ntp references. But over the years, and finally about a month ago, they all came offline.

I would not have caught the drift if it wasn’t for my use of pt-heartbeat (mk-heartbeat) and doing a review of our cacti graphs.

This is what pt-heartbeat will show when your NTP service isn’t working

Usually I check them every monday as a routine, but I’ve been so busy for the last several months I haven’t had time to do that. I figured if it hits the fan, our alerts/thresholds will let me know. Which on a few occasions worked as needed for an Apache server.

pt-heartbeat, a tool of the Percona Toolkit has a common table across replicated servers that each one updates a record in the table with a datetime value.

The time difference between the value for server A, replicated to server B is the ‘lag’. The lag can/should be due to temporary spikes in traffic (or intentional delaying). Needless to say my gut sank when I saw that something weird as going on that was causing a small, yet unrecoverable, and linearly increasing lag time. After quickly confirming that SHOW SLAVE STATUS confirmed that everything’s up to date, it was quickly apparent that the mechanisms involved with the graph generation were at fault. Upon a quick side-by-side examination of server A and server B’s ‘heartbeat’ table, it stood out that the times were off by a few seconds from each other.

I restarted the pt-heartbeat daemons and the issue persisted – the next culprit was simple to identify: ntp.

Output of ntpq -p quickly showed that the ntp server hadn’t synchronized for over a month.

Over the years through periodic apt upgrades; the newer version of ntp came through the pipeline and all that was needed was an update of ntp.conf to use a new host (I opted for stratum 1 ‘time.nist.gov’)

Having recently been bitten by the awful default value (10) for max_connect_errors on a production server – I’m having a very hard time coming to terms with who the heck thought this would be a good way to do it.

This type of “feature” allows you to effecitvely DOS yourself quickly with just one misconfigured script – or Debian’s stupid debian-sys-maint account not replicating.

I’ve been thinking about how I could avoid this scenario in the future – upping the limit was a no brainer. But another item of curiosity: How do I know what hosts are on the list?

I’ve recently worked on a customized emailing suite for a client that involves bulk email (shutter) and thought I’d do a write up on a few things that I thought were slick.

Originally we decided to use AWS SES but were quickly kicked off of the service because my client doesn’t have clean enough email lists (a story for another day on that).

The requirement from the email suite was pretty much the same as what you’d expect from SendGrid except the solution was a bit more catered toward my client. Dare I say I came up with an AWESOME way of dealing with creating templates? Trade secret.

Anyways – when the dust settled after the initial trials of the system and we were without a way to deliver bulk emails and track the SMTP/email bounces. After scouring for pre-canned solutions there wasn’t a whole lot to pick from. There were some convoluted modifications for postfix and other knick knacks that didn’t lend well to tracking the bounces effectively (or implementable in a sane amount of time).

Getting at the bounces…

At this point in the game, knowing what bounced can come from only one place: the maillog from postfix. Postfix is kind enough to record the Delivery Status Notification (‘DSN’) in an easily parsable way.

Pairing a bounce with database entries…

The application architecture called for very atomic analytics. So every email that’s sent is recorded in a database with a generated ID. This ID links click events and open reports on a per-email basis (run of the mill email tracking). To make the log entries sent from the application distinguishable, I changed the message-id header to the generated SHA1 ID – this lets me pick out which message ID’s are from the application, and which ones are from other sources in the log.

There’s one big problem though:

Postfix uses it’s own queue ID’s to track emails – this is that first 10 hex digits of the log entry. So we have to perform a lookup as such:

This is a problem because we can’t do two at the same time. The time when a DSN comes in is variable – this would lead to a LOT of grepping and file processing – one to get the queue ID. Another to find the DSN – if it’s been recorded yet.

We use Logstash where I work for log delivery from our web applications. In my experience with Logstash I learned that it is a tool with so much potential for what I was looking for. Progressive tailing of logs and so many built in inputs, outputs and filters for this kind of work it was a no brainer.

So I set up Logstash and three stupid-simple scripts to implement the plan.

Hopefully this is self explanatory to what the scripts take for input – and where they’re putting that input (see holding tank table above)

1

2

LogDSN.php%{QID}{%dsn}

LogOutbound%{QID}{%message-id}

Setting up logstash, logical flow:

Logstash has a handy notion of ‘tags’ – where you can have the system’s elements (input, output, filter) enact on data fragments tagged when they match a criteria.

Full config file (it’s up to you to look at the logstash docs to determine what directives like grep, kv and grok are doing)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

input{

file{

format=>"plain"

path=>"/var/log/maillog"

type=>"maillog"

}

}

filter{

kv{

type=>"maillog"

trim=>"<>"

}

grep{

type=>"maillog"

match=>["status","bounced"]

add_tag=>["bounce"]

drop=>false

}

grep{

type=>"maillog"

match=>["message-id","[0-9a-f]{40}\@dom"]

add_tag=>["send"]

drop=>false

}

grok{

type=>"maillog"

pattern=>"%{SYSLOGBASE} (?<QID>[0-9A-F]{10}): %{GREEDYDATA:message}"

}

}

output{

exec{

tags=>"bounce"

command=>"php -f /path/to/LogDSN.php %{QID} %{dsn} &"

}

exec{

tags=>"send"

command=>"php -f /path/to/LogOutbound.php %{QID} %{message-id} &"

}

}

Then the third component – the result recorder runs on a cron schedule and simply queries records where the message ID and DSN are not null. The mere presence of the message ID indicates the email was sent from the application. the DSN means there’s something to enact on by the result recorder script.

* The way I implemented this would change depending on the scale – but for this particular environment, executing scripts instead of putting things into an AMQP channel or the likes was acceptable.

I came up with a cool usage for the zebra stripe admin tool. In MySQL you can set a custom pager for your MySQL CLI output; so one can simply set it to the zebra stripe tool and get the benefit of alternated rows for better visual clarity.

Something like ‘PAGER /path/to/zebra’ should yield you results like the image below.

Zebra stripe tool used as a MySQL pager

You can always adjust the script to skip more lines before highlighting; you can also modify it if you’re savvy to the color codes to just set the font color instead of the entire background (which may be preferable but not a ‘global’ solution so the script doesn’t do it).

You will lose data if you don’t account for ALL of your data’s ranges upfront. (E.g: MAXVALUE oriented partition). The reason being is how online schema changing tools work; they create a ‘shadow copy’ of the table, set up a trigger and re-fill the table contents in chunks.

If you try to manually apply partitions to a table that has data from years ranging from 2005-2012 (And up), say something like:

And it has data from 2012 or above, MySQL will give you this safety error: (errno 1526): Table has no partition for value 2012 – this is a safeguard!

Now, if you use a hot schema tool, the safeguard isn’t in place; the copying process will happily rebuild the data with the new partitions and completely discard everything not covered by the partition range!

Meaning, this command will successfully finish, and discard all data from 2012+:

Recently I’ve migrated a customer that’s been on Rackspace for 6 years, and paying a handsome penny for it at that. The migration was to Amazon Web Services (AWS) and I sent a friendly reminder to the client to cancel the RS account (9 days in advance to the renewal date).

Here’s how things went down:

RS: “We require a 30-day written notice to cancel your account”.

This is on a host that is on a month-to-month basis and the costs have been on an incline. In fact, the cost was to go up $10/month next month. (Suffice it to say, it’s not much compared to the overall monthly bill).

So I’m thinking to myself, well that’s a crappy “policy”. I give them a ring on behalf of my client and see if we can leverage some flexibility. I simply ask to waiver the 30 day ‘penalty’. Not even a pro-rate for the days unused for the month.

The RS rep is quick to tell me how many people call that dislike that policy and try to get somewhere with it, but they stick to it. At this point I’m thinking, wow – this will be a little challenge.

I explained that the cost was going up and is something that wasn’t agreeable and therefore there is good cause to waive that type of penalty. No go.

We go circular a bit on customer service – I blab a bit about how the competition (Linnode, AWS, etc) allow me to do more than they offer, for cheaper and not have a penalty. I also say that it’s odd that in most circumstances you’ll get a counter offer from a retention specialist (you know, we’ll knock off 10% on your hosting). Still kind of a nod-n-smile go screw yourself attitude from this rep.

Then he says “We value your feedback and it helps us become better”.
I respond: “OH REALLY? You start the call by telling me how many pissed off people are calling about this policy, try to stick it to me as well and then give me a line like that?

Enough is enough – I ask for a manager and exclaim I understand if he “has to say that” but this is an unacceptable situation.

The manger’s response: They won’t deal with me. They’ll only talk with my client. (Who in turn, told the client to hose off in the same manner).

All I can say is this:

Rackspace prices aren’t that hot. Look elsewhere.

The fanatical support thing is cute, but the customer service is pure garbage in the above context. I’ve never been treated like that. I’ve had better luck with credit card companies and land lords than this.

There’s this 30-day thing (Beware!)

If they give you support, they’ll want the root password for your server.

Their SLA is a lie, it reports on the 30 minute interval. Which means they can be down for 29 minutes every hour and not record it as downtime.

Their backup system on dedicated hosting is a bloated, un-tamed mess if you let them manage it, they let the ‘rack’ account on my client’s server exceed 60GB of crap that should be cleaned up. E.g.: backup software updates, provisioning/monitoring tools

The typical responses to these new ‘hipster’ systems are usually transaction/consistency centric – as that’s where the RDBMS systems shine – they can perform wonderfully while being ACID compliant.

Or in the case of Node, refuting the ‘Apache doesn’t have concurrency, node is better’ arguments. I have a hunch 99% of the Node fanboys have a damn clue how capable Apache itself is.

There’s also things like Node.js that rub the seasoned people the wrong way, perhaps it’s the sensationalism without actually proving anything? (Check the first few comments) Or the utterlack of security focus? (That’s what bugs me) – I also think it has to do with their approach to enter the market: guns blazing, criticizing other solutions and hoisting their own as THE single option with more tenacity than appropriate for such an immature project. Guys in the trenches can’t stand that crap, we know it’s just another tool to get the/(‘a’) job done in a particular scenario.

But really, I think about how much time is wasted on these subjects going back and forth, so let’s stop wasting time. Be open minded to the new technologies as tools for a particular job and stop making all or nothing stories out of future tech, like it or not – we all have to share the same space.

I had CPU consumption alerts fire off for ALL of my AWS instances running Percona Server (MySQL).

I couldn’t for the life of me figure it out – I re-mounted all but the root EBS volumes, restarted the services and ensured there was no legitimate load on all of them, yet MySQL stuck to 100% consumption on a single core on all of the machines. I tried everything, lsof, netstat, show processlist, re-mounting the data partitions.

WIERD. It turns out however, AWS was not be the cause of this one, even in light of the recent AZ issue in EAST-1.

This server had many cores so it was still running fine (and it was the weekend) – Seems like bad things happen at the least opportune times, like when you’re out on a date with the wife and the baby is with grandma…

At any rate, it’s amusing to watch the haters hate that clearly don’t understand the concept of AWS and the AZ (Availability Zones). Funny to watch them all scoff and huff about how they run a better data center out of their camper than AWS.

If anything, I want to see a good root cause analysis and maybe some restitution for the trouble.

While it’s an extremely simple hack and covers (dare I say the majority) of MySQL installation version. Let’s not forget to finish reading the entire disclosure:

From the disclosure:

But practically it’s better than it looks – many MySQL/MariaDB builds are not affected by this bug.

Whether a particular build of MySQL or MariaDB is vulnerable, depends on
how and where it was built. A prerequisite is a memcmp() that can return
an arbitrary integer (outside of -128..127 range). To my knowledge gcc builtin memcmp is safe, BSD libc memcmp is safe. Linux glibc sse-optimized memcmp is not safe, but gcc usually uses the inlined builtin version.

As far as I know, official vendor MySQL and MariaDB binaries are not
vulnerable.

Regardless, it’s a stupid simple test to see if you’re vulnerable or not so fire one up!

I just tested 5 gcc compiled hosts (mostly pre-5.5.23) and none of them were vulnerable. But regardless, maybe it’s time to re-compile ;)

I’ve gone over a similar issue like this before regarding the likes of git/hg. While those are developer tools and are less likely to be present on a production machine.

PHP 5.4 is jumping on the bandwagon to include a ‘cute’ little internal server – which is enabled by default.The ‘everything needs a standalone server’ thing is starting to get on my security nerves feel silly.

It has limited use, and most developers will have limited use for it due to it’s lack of mod_rewrite (and equiv.) behavior … The worse part is: You can’t disable it if you want to keep cli (e.g.: no pear!)

Wish I spoke up on the list!

Anywho, here’s a hob-knobbed patch (for PHP 5.4.0RC6) that will change that for you.(GNU/*nix only!) The patch adds a new configure option ‘–disable-cli-server’.

In a nutshell: A modest size POST to almost all PHP versions in the wild (Sans 5.3.9+) are in danger of an extremely simple DoS.

The vulnerability exploits the PHP internal hash table function (responsible for managing data structures) – more specifically: the technique used to ‘hash’ (generate a key for the hash table) the key for a key=>value relationship.

Apache has a built in limit of 8K max request length (that is, maximum size in request URL) by default.
Can the damage from an 8k request (this affects GET) – really cause the mentioned DDoS attack on reasonable hardware?

Additionally – PHP has a limiter on POST data too: max_post_size.
It’s this configuration variable in particular I think should be put in the limelight.

max_post_size is a run-time/htaccess configurable directive that maybe we don’t respect like we should.
Often, administrators (myself included) just tell php.ini to accept a large POST size to allow form based file uploads – It’s not uncommon to see:

1

2

upload_max_filesize=20M

post_max_size=21M# or multitudes more!

– in almost any respectable setup.

Perhaps we should evaluate the underlying effects of this setting; maybe it should be something stupidly low by default (enough to allow a large WYSIWYG CMS article’s HTML and a bit more? 32K?) – and then delegate a higher limit using Apache configuration.

Caveat: these settings are PER DIR meaning:

.htaccess use is limited, you can’t set the php_value in a .htaccess with a URL match – you’re stuck using a context sensitive .htaccess (within a dir) or use thedirective – this won’t work for people using front controllers through a single file on their websites/apps.

Modifying the actual vhost/host configuration is a sound bet – you can do Location/File matching and set these at will; for situated web apps, this may be a feasible decision to take whitelist or blacklist approach on uploader destinations.

More resources:

Here’s the video that thoroughly covers the vulnerability – I’ve shortcut it to their recommended mitigation (outside of polymorphic hashing):

You should know what SOPA * is about between the lines. (Job growth? Puh-leese, the job growth from the .com boom didn’t need SOPA thank you very much!)

Over the last few months I’ve moved a dozen domains off of Godaddy on to others (client’s discretion).If you’re still on the fence, there’s a pretty good run down of good alternative registrars on this blog post.

Apache Bench (AB) is a very powerful tool when used right. I use it as a guideline for how to set up my apache2/httpd.conf files.
All too often I see people boasting that they can get an outrageous number of RPS in AB (the Apace Bench tool).

“OMG, I totally get 3,000 rps in an AWS micro instance!!!” (I’ve seen this on the likes of Serverfault)

Debunking misunderstandings:Concurrency != users on site at same timeRequests per second != users on site at same time

Apache Bench is meant to give a ‘feel of pants’ diagnostic for the page/s you point it to.

Every page on a website is different; and may require a different number of requests to load (resources: css, js, images, html, etc).

Aspects of effective Apache benchmarking:

Concurrent users

Response time

Requests

“Concurrent users” – Have you ever stopped to ask yourself: What the hell is a user? (in the Apache sense) We don’t stop to think about them individually, we just think about them as a ‘request’ or the ‘concurrent’ part of benchmarking.

When a user loads a page, their browser may start X connections at the same time to your server to load resources. This is a complex dynamic, because browsers have a built in limit of how many concurrent requests to make to a host at a given time (a hackable aspect).

So at any given time, let’s say a user = 6 concurrent connections/requests.

“Response time” – What is an acceptable response time? What is a response? In the context of this article, it’s the round trip process involved with the transfer of a resource. This summarizes the intricacies of database queries and programming logic, etc into a measurable aspect for consideration in your benchmarking.

Is 300ms to load the HTML output of a Drupal website acceptable? 400? 500? 600? 700?

How fast does a static asset load? What is the typical size of an asset for your webpages? 10KiB? 20?

This means if a single user comes to load a page, there’s a good chance his/her browser will make 15 requests total.

Another part of this aspect is to be aware that the browser will perform these requests in a ‘trickle’ fashion, meaning one request to get the HTML, then an instant 6 requests (browser concurrency) but the next 8 will happen one at a time once concurrent connections free up.

Putting aspects together:We have to draw an understanding of how these aspects all tie together to determine the start-to-finish load of a typical page.

Let’s say a page is 15 requests (14 images/js/css files and HTML) with an average payload of 10KB of data each.150KB of data.

A user/browser makes 6 simultaneous requests (All completing at slightly different times, ideally, at the same time).

Response time is the metric we’re interested in.

The questions we ask of ab are – given the current configuration and environmental conditions:

What’s the highest level of concurrency I can support loading a given asset in less than X milliseconds

How many requests per second can I support at that given level of concurrency

By attempting to answer these questions, we’ll derive the tipping point of the server – and what it’s bottlenecks are. (This article will not cover determining bottlenecks – just how to get meaning from ab)

CaveatsNaturally these seat of pants numbers are generated on a machine on the same network with plenty of elbow room – making them best case scenarios – in today’s world it’s close enough with how widespread HSI is. It also assumes the network can handle the throughput of the transfers involved with a page. It also assumes all assets are hosted on the same host, nothing cross domain, e.g.: Google Analytics, Web Fonts, etc.

How to testFirst, we need to classify the load. As mentioned a few times, there’s two types of site data: static files, and generated files.

Static files have extremely fast response times, generated files take longer because they’re performing logic, database work and other things.
A browser doesn’t know what to load until it retrieves the HTML file containing the references to other resources – this changes how we look at timings, and most cases, the HTML document is generated.

FirstLet’s simulate a single user/browser loading a page from an idle server.First, we must get the HTML data…

So you can see, this took 29ms, at 299KB/s to generate the HTML from Drupal and send it across the pipe.

So, here’s where things start getting interesting – 2% of the requests were slowed down to 600+ms per request. The exact cause of that is out of the scope of this article – could be IO – regardless, the numbers are still good and it’s clear that this is indicating the start of a tipping point for the host.

This is ‘fuzzy math’ because I’m not simulating the procedural/trickle loading effect of browsers mentioned above (initial burst of 6 connections, and making them busy as they become available at different times to complete the 15 requests). But instead treating it as 6 requests in sets. Someone more mathematically inspired might calculate better numbers – I use this method because it’s my “buffer factor” that I use for variables (bursts of activity, latency changes, etc).

So from this data, we can say the server can sustain 30 constant users given our assumptions.

Phase two: So now we have a ballpark figure of where our server will tip over. We’re going to perform two simultaneous ab sessions to put this to the test, this will simulate both worlds of content: generating content and loading the assets. This is the ‘honing’ phase where we dial down/up our configuration by smaller increments.

5 minutes of load at our proposed 30 users

30 concurrent connections for our generated content page.

180 concurrent connections for static data.

Ding ding ding! We’ve got a tipper!Ok, so as you can see, for the most part, everything was fairly responsive. If the system could not keep up, most of the percentiles will have the awful numbers like the ones shown in the highest 2%. However, overall these numbers should be improved upon to deal with sporadic spikes above our estimates, and other environmental factors that may have a swaying factor in our throughput.

How to interpret:Discard the 100% – longest request; at these volumes it’s irrelevant.However, the 98-99% matter – your goal should be to make the 98% fall under your target response time for the request. The 99% shows us that we’ve found our happy limit.

(Remember, at these levels, theres many many variables involved – and the server should never reach this limit, that’s why we’re finding it out!)

Let’s tone our bench down to supplement 25 users and see what happens …

Wrap up25 simultaneous users may not sound like a lot, but imagine a classroom of 25 students – and they all click to your webpage at the exact same moment – this is the kind of load we’re talking about; to have every one of those machines (under correct conditions) load up the website within 1 second.

To turn that into a real requests per second: 375. (25 users @ 15 requests).

The configuration – workload (website code, images) and hardware (2 1Ghz CPU’s…) are capable of performing this at a constant rate – these benchmarks indicate that the hardware (or configuration) should be changed before it gets to this point to supplement growth. These benchmarks indicate that ~430 pageloads out of 21,492 will take longer than a second to load. In reality, the ebbs and flows of request peaks and valleys make these less likely to happen.

As you can see, the static files are served very fast in comparison to the generated content from Drupal.If this Apache instance was backed by the likes of Varnish, the host would be revitalized to handle quite a bit more (depending on cache retention).