I’ve been in the “rename it” camp for a year, not that I’ve shared that with anyone that would listen. Zoffix provides some good reasons why that is not as easy as it sounds. Drop by his blog for a look at a possible solution.

When Custom Visuals started in 2003, we were working on projects for individual companies with little emphasis on long-term projects utilizing our own servers. That all changed when an environmental company requested a web scraping application for a budget estimation project.

The project started off as a simple request to scrape a few dozen pages from a site and create PDF files for printing. As we discussed the details during an early meeting, I indicated I could programmatically pick out the information the client wanted from the website and drop it into a database for a summary report. Until this point, they had been hiring workers from a temp agency to navigate to the web pages, then save and/or print the page for the full-time workers to review.

After a couple days, I provided a sample report based on a small set of web pages for the client to review. That was a turning point in the project. What started as something that could be done in a week and provide a nice chunk of income, ended up as multiple enhancements over several years, with a high point of navigating more than 10,000 pages daily with 100’s of database-driven reports being sent to 40 or 50 employees across multiple divisions. That chunk of income turned into a steady stream of income as new features and requirements were added over the years.

Along the way, we upgraded to larger disks on the server to accommodate the data being retrieved. The database storage needed for reports was increasing rapidly as early daily reports now spanned longer time periods and included additional parameters. Then we upgraded the server for more speed and kicked in more memory, as well. And we’ve done the same thing a few more times since then, too.

What started off as a quick little project turned into a nice monthly source of income since we were navigating, downloading, analyzing, reporting and storing 100s of GBs of data and using significant processing time to do so. Eventually, it made sense to waive the development costs and just add new features into a monthly fee.

During the middle of this project, we decided this was the best approach for our small business to take. Rather than the feast or famine mode that goes along with projects in chunks, we focused on long-term projects that ran on our own servers to generate streams. We still happily pick up the occasional chunk, but we keep an eye out for steady streams.

There are a lot of terms used that describe essentially the same process of automatically getting or sending data to or from a remote site. The most common phrase is software automation, but you’ll also hear software bot, software agent, virtual robot, or just bot, all referring to the same concept. And it’s a concept Custom Visuals uses all day, every day.

If this blog interests you or your business, please click our Contact page and let us know what you think. You can also find more details on our Services page.

Custom Visuals uses a variety of programs to transfer data to and from internet sites.

Simple Transfer: This is used across all industries and most appropriate when the data exists in one or more files and is easily discernable as the most recent file to be transferred. Think of looking in a directory that contains the last 14 days of files all named with a clear date in the name. Another common use for this is when the same file is rewritten throughout the day and you need to pick up the changes every 5, 15 or 60 minutes.

Build, Send, Receive: This is most common for our financial clients that request a set of stock/instrument prices at specific intervals throughout the day. Since there are thousands of stocks available, and different clients request both different stocks and different frequencies for those stocks throughout the day, we need to determine a subset of what is needed at any given time. That process looks like this:

Identify stocks/instruments needed by all clients at the current moment.

Group requests into a single batch, removing duplicates.

Send the batch to the pricing vendor.

Wait until vendor returns requested data.

Retrieve remote data to local system.

Distribute pricing data to appropriate clients.

Extract subset: Our weather data processing has the ability to cherry pick subsets of data from within a larger file over the internet. For US NWS GFS data, which is one of the core weather data sets used throughout weather forecasting, the need for subset extracts becomes pretty obvious when you realize how much data is at the end of the fire hose when the forecast is run every six hours:

About 340 new files for the highest resolution data of 0.25 degrees.

An average of 350 parameters per file.

An average size of 180 MB per file.

As was the case with the financial data above, different clients want different parameters for their specific needs. One client may want six different temperature values near the surface in the Gulf of Mexico for the next 24 hours, while another may want four different temperatures across the atmosphere, two wind speeds and three humidity values in the Western Atlantic Ocean for the next 12 hours. As each file becomes available we extract only the values needed, extracting a fraction of what would otherwise be over 250 GB per day.

We are always adding new automation services and would like to find out what industry you are in and what your business needs are. If you have two minutes, we would like your input to help with our Research.

If this blog describes you or your business, please click our Contact page and let my company help out.

As you may know, Custom Visuals automates as much as we possibly can as part of our core business. We have a couple thousand unattended tasks running on our servers (co-located and in-house) on behalf of our clients throughout the day (not including the tasks the operating system runs on its own behalf). Some of these are the types of tasks I’m referring to in this blog. The coma-inducing ones that are important, but suck the life out of you or your employees.

If part of your day involves performing the same 10, 20 or 50 steps on Excel spreadsheets, I want to change that into something more productive for you. One of the common processes we manage for clients involves an employee that receives a CSV file that needs to be turned into a formatted Excel file for use by colleagues or clients. Let’s consider an external file with 10,000 rows and 20 columns, and an internal file with 200 rows and 2 columns as our example set. These changes might involve some of the following steps along the way:

Executive summary: dozens of common Excel steps you wouldn’t wish on your worst competitor, but may be requring employee(s) to perform. Skip past the numbered items for an alternative.

Download a CSV file from external website.

Locate a CSV file on internal share.

Open the external CSV file in Excel.

Create a new tab in the existing Excel file.

Open the internal CSV file in the 2nd tab.

Change back to the first tab.

Insert 3 new columns starting at the 4th column.

Type in column names for the three new columns.

Type in a VLookup in the first new column using fields from the 2nd tab.

Copy and paste the VLookup to the remaining rows in the column.

Copy/paste the Vlookup column as values.

Delete the 2nd tab.

Place a red border around every cell in the column that did not have a VLookup value.

Type in a Left function in the next column to extract a subset of another column.

Copy and paste the Left function to the remaining rows in the column.

Enter a Date function in the next column to provide a formatted date from two other columns.

Copy and paste the Date function to the remaining rows in the column.

Set one column to a specific Date format.

Set one column to a specific Numeric format.

Set a second column to a different Numeric format.

Set a Total value for three columns.

Highlight all missing values in a set of columns with a yellow background.

Set the column widths of all columns to the maximum to show all contents.

Set any columns more than 60 characters wide to 60 characters.

Freeze the top row.

Filter the top row.

Create a graph of 3 non-contiguous columns.

Set X and Y axes to specified values.

Save/Close the file in Excel format.

Copy the file to an internal share.

Send an email to two email groups indicating the process is complete.

How long would that take you to accomplish after you got really good at it? 5 minutes? And if you had 5 files with varying sizes and steps to perform each day, how often would mistakes occur? At what point would your mind wander off to its happy place?

Would you prefer to have it done in 10 seconds, instead? Without tying up Excel, an employee or a workstation at all? We have programs processing 200,000+ rows and 50+ columns (50x more data than above) in just a couple minutes. And these programs are performing more steps, validating more data and sending summary emails with clear pass, warning and failure messages.

Of course, not all of our programs are like this. Some are simpler, some are much more complex. Some happen in a fraction of a second, others run for several hours. The point is we do a lot of it. In fact, we’ve done enough of it to become experts at processing data of all kinds.

If this piques your interest, hit our Contact page and find out if we can help.

Related Documents

Introduction

This series was created because we had a problem with a hacker that infiltrated our system and sent out emails in early 2015. After discussing the information with a couple colleagues and their experiences dealing with similar issues I was encouraged to share what I had created and why.

In this part of the series, I’ll provide some of the SQL code for selecting from and inserting into the database. As we process each file and identify attempted break-ins, we build up an SQL insert statement for the database. We use the multi-row insert form for MySQL to do this in a single step. MySQL allows the insert to look like this:

Recall that we are storing the integer equivalent of the dotted-quad IP addresses, so INET_ATON(‘114.4.137.34’) will actually store 1,912,899,874 (without the commas) in the IP field of the table.

Since we run this every 5 minutes for the system.log and every 30 minutes for each of the email, server-side SPAM logs and client-side SPAM logs, we’re running the program every few minutes. Our typical usage shows the vast majority of break-in attempts showing up in the system log, so we run that more often than the others.

We start off each run by selecting the existing records from the database and loading them into a Perl hash (aka associated array), which is a simple table where the IP and Datetime values are joined with commas, like so: '80.90.163.52,2016-08-17 17:04:01’. The database returns about 400,000 records per second from the following query:

SELECT INET_NTOA(IP) AS IP, DT
FROM CustomVisuals.AuthFail

The database typically has around 40,000 records in it, so retrieving the data from the database and converting it to a Perl hash happens in well under one second. In fact the steps listed below all take place in about one second:

Launch program

Check CPU load (program will pause if CPU load is excessive)

Connect to the database

Select 40,000 records from the database

Convert database records to Perl hash

Open/read/close 10,000 lines from log file

Compare contents from log file to database

Update database with new break-in attempts

Update blacklist file with IP addresses and time to block

Update blockedHosts file with IP addresses

Run the afctl command to update the firewall

Purge records older than 90 days from database

Disconnect from the database

Email log file

After updating the database with the new break-in attempts, we build the blacklist and blockedHosts file for the firewall. The blacklist file has a simple format:

Since I originally started writing Part I of this, I’ve incorporated a few other ideas, some useful, some just for my interest. I’ve expanded the AuthFail database to have two more fields, FileType and Line. Some of the attacks are more intense than others, with multiple attempts to break-in per second. With the original setup, those would be reduced to a single entry in the database. By adding the FileType and line number I can now identify every attempt, regardless of how aggressive it may be since it will occur on a different line number of the log file.

I’ve recently added a whois contact information in another table so I can have some information if I decide to send email notifications to the owners of the sites trying to break-in. This could be automated, but I don’t really expect that this would have any impact on deterring the break-ins.

Out of curiosity, I’ve added a table for keeping track of the country where each IP address is located. Here’s the top 10 tally of break-ins by country as of this writing:

US: 10467

SC: 5537

CN: 3836

NL: 3572

ID: 1809

PL: 1202

GB: 1175

KR: 988

FR: 883

CA: 882

Here’s a simple graph of the number of IPs banned over the last couple months. The information is provided over 5 minute intervals.

Related Documents

Introduction

This series was created because we had a problem with a hacker that infiltrated our system and sent out emails in early 2015. After discussing the information with a couple colleagues and their experiences dealing with similar issues I was encouraged to share what I had created and why.

In this episode I’ll describe some of the details of the OS X Firewall and what was needed to make it work correctly on my system. Note that, for whatever reason, configuring the Firewall on my OS X Server is slightly different than what I believe the Apple documentation describes. I don’t want to get into the details about it, mainly because I’m not an expert on this, just an experienced geek taking care of a problem the best way I can. I’ll mention whatever is different along the way without pontificating about it.

OS X Adaptive Firewall Documentation

The majority of what I learned came from three locations, two of which were Apple sites (the discussion forums and the advanced server administration pages) and the other was krypted.com.

Note that I was never able to get the command line tool, /Applications/Server.app/Contents/ServerRoot/System/ Library/CoreServices/AdaptiveFirewall.bundle/Contents/MacOS/hb_summary, to work correctly. Despite clearly blocking IP addresses, hb_summary still indicates the following results:

In the past 23 hours 59 minutes the following hosts were blocked by the Adaptive Firewall
from 2016-03-09 02:52:48 +0000
to 2016-03-10 02:52:47 +0000
Address Count(Total) Last Block Time
0 unique hosts 0 total blocks 0 overall
Count indicates the number of times a host was blocked during this
reporting period. Total indicates the total number of times this host
was blocked in the last week
See the "Security:Firewall Service" section of http://help.apple.com/advancedserveradmin/
for more information about the Adaptive Firewall.

The current Mac OS X Firewall is based on afctl, with the full path to the command being /Applications/Server.app/Contents/ServerRoot/usr/libexec/afctl. There are three options to the program that I make use of: -a [IP address], -w [IP address], -t [time in minutes] and -f. The -a option followed by an IP address adds an offending IP address to be blocked. The -t option is used to specify how long the IP address should be blocked. The -w option is used to whitelist an IP address to prevent it from ever being blocked. The -f option forces afctl to enter a running state.

I’ve added our home IP address and a couple IP addresses that I occasionally access the server from to the whitelist to prevent locking myself out. It’s always a little embarrassing when you manage to lock yourself out of your own server when testing. Thankfully, I’m in good company of many geeks who have accidentally done this. Tip - if you lock your home address out, you can use an iPad with a program like Panic’s Prompt to access your server with a command line and edit the whitelist or blacklist file as needed. Just turn off your Wi-Fi access from the iPad and use a cellular connection, fire up Prompt, connect to your server and make the necessary changes.

In the next episode, I’ll provide the SQL code for both selecting records prior to processing the log files, and the code to insert new records from the files into the database. I’ll also show the steps for specifying the length of time to ban each IP based on the number of attempted break-ins, and the code used to update the IP blacklist, both of which are performed by the afctl commands mentioned earlier.

There are three log files that are scanned for attacks, the system log, the mail log and the amavis log. They all have different formats, but they all contain the IP address of the machine contacting our server in the log line of the failed access attempt. The program looks for a different pattern depending on the file it as accessing.

We’ll start with the system log file. The common break-in attempts look like this in the system.log file.

The top line is an attempt to access through VNC, the screen sharing program built-in to the server. The bottom line is someone trying to login to the system using ssh with a commonly used account name (Oracle is the name of a well known database application).

As different as each of those lines look, they contain a pair of common elements, namely the datetime and the IP address. The program scans lines using a regular expression (regex) to identify these attempts. If the regex matches the line, the IP address is captured at the same time like this for the bottom line, along with an actual line it would match:

What that says is look for a the words “Invalid user”, followed by a space, any string, a space, the word “from” and a set of 4 numbers separated by dots. The “.*” portion equates to any string, which in this case is the username they provided to try and break into the system. The 4 digits separated by dots are commonly known as a dotted-quad, which is the IP address the attempt originated from. And since the regex for the dotted-quad is enclosed in parentheses, that tells the program if the line matches the regex, store the dotted quad in a temporary variable.

The regex used to match the VNC line is a little more complicated, but not by much. You see the same elements where we are looking for specific words, any string, dotted quad, and more strings and specific words. Note the dotted quad is in parentheses again, so it can be stored in a temporary variable.

I also mentioned the datetime being one of the common elements in the line. That is always at the beginning of the line. There is a tiny trade-off in using a slightly more complicated regex to capture the datetime and the IP address vs. just capturing the IP address. Without getting into details, I prefer simpler regexes to match the portions of a line that vary and using a template to match portions of a line that are the same. Since the datetime is always at the beginning of the line and I only care about it if the regex is matched, it’s done as a separate step. This is just my preference, not a required way to parse the line. The template I use for matching the datetime is:

$template = "A3 x A2 x A2 x A2 x A2";
Jan 30 01:24:20

That template reads as, match 3 characters (A3), ignore one character (x), match two, ignore one, match two, ignore one, match two, ignore one, match two. So, anytime I match a regex, I will then use a template to grab the datetime. The year is missing in the log file, so we set it based on what day we are processing and finesse it when we are at year boundaries in December and January. More on that for another post.

The above info is for the system log, so let’s address the mail.log very briefly. Here is a failed email password attempt.

Note the similarities to the system.log. We have a string indicating an authentication failure, which includes the dotted-quad IP address and the datetime it occurred. Here is the regex we use to determine if this is a line we are going to process:

And, finally here’s the log info for amavis, which is our email spam filtering program. We have a much longer line in the log file to parse this time. This is all on one line in the log file, but I’ve broken it into several lines here for your reading enjoyment.

While all of the additional information is interesting, we just need to know if the program identified it as SPAM, and what the IP address is. And if that matches, we’ll grab the date from the front of the line with our template mentioned earlier.

So, that covers our logging files in a nutshell. We’ll look into the commands and configurations for the firewall in our next episode.

Early in 2015 one of the accounts on our server got hacked and was used to send out truckloads of spam emails. It took several hours for me to become aware of the issue, unfortunately. I only realized it once the number of bounces escalated to the point that our server was being blocked by other servers.

The account password was changed and some discussions were held with the user to see what the source of the unauthorized access may have been. The only thing that stood out was the user had been in several cafes with an iPhone the last couple of days, otherwise it was a pretty normal week with usage primarily at work and home. The password was a reasonable length, meaningless and random enough. My guess is the account name and password had been picked up by some sort of Wi-Fi scanner at a cafe a couple days prior to the attack beginning.

In the hours that the attack was in full force, the server’s IP address began to get banned for sending spam, which is how the issue came to my attention. That resulted in me going to several sites to indicate the problem was cleared and I was taking steps to prevent it from recurring.

The first thing I did was turn off any cleartext access to the accounts. The server has an SSL certificate, but a couple of the email accounts, including the compromised one, had cleartext as one of the acceptable connection methods. I also needed to get a lot smarter about the Adaptive Firewall (AF) included with Mac OS X Server. I have a few ebooks on Mac OS X Server that cover AF in some detail, and there are several websites, as well as Apple Discussion boards that cover the topic.

I realized AF was not running correctly. By correctly, I mean it was running, but not looking for anything. I also learned that AF’s default setting is to block an IP address for 15 minutes at a time, then allow the IP to resume accessing the server. I thought that was pretty weak, so I based the blocking time on the number of times the IP address was trying to access the server in a given time. In other words, if attacker 1 had tried 20 times in one hour, they would get a shorter timeout than attacker 2 trying 200 times in one hour. Also, the number of attacks would accumulate over time, so if attacker 1 eventually reached 200 times in 1 day or 60 days, they would eventually end up with the same timeout as attacker 2. Actually more than attacker 2, since they would have been blocked and unblocked when the timeout expired several times over the given time period.

The implementation for this required a database and a program to access the log files where the attacks were being recorded. The database has about the simplest structure you can get, comprising of a column for the IP address, and one for the DateTime of the attack attempt by the IP address. Here’s the MySQL structure statement:

Note that IP is an unsigned int, not a (15 character) string. Though I’m barely aware of it on a regular basis, the “dotted quad” IP addresses are just a human-readable version of an integer, hence an unsigned int for the IP address. And DT is a datetime, of course. The primary key contains both the IP and DT columns. An additional key for DT is added for queries on the datetime alone. MySQL contains the INET_ATON and INET_NTOA translation functions, which are used to handle the IP addresses in either the dotted quad or integer formats, as needed.

In the next part of this series, I’ll set up the log file parsing program.

I gave a presentation at the Chicago Perl Mongers group on 06/25/2015 on using Perl to read and write Excel files. There is a brief discussion on the different modules for handling the 2003 and earlier .xls files, as well as the 2007 and later .xlsx files.

Writing data with different formats for headers and data, as well as including Excel comments is shown. Comments can be used to programmatically highlight items for additional information as shown in the picture.

FInally, creating a chart using data from a random number generator was discussed. Chart parameters for the X and Y axis, such as min and max values, major and minor grids, as well as chart type and size is covered. The chart parameters needed for the chart shown are covered.

As the coding geek behind Custom Visuals, I’m responsible for the application development that takes place under this roof. About half of my livelihood since 1984 has been as a small business application developer and the other half as a software engineer in larger corporations. They both have advantages and disadvantages, and I have no reason to, or intention of, putting down either one. There is one big difference that has always stood out to me and I’ve commented on it with friends and colleagues a few times in the last decade or two.

In the corporate environments I’ve worked in, there is a premium for churning out predictable code to solve problems in acceptable ways using the tools provided. Straying from the preferred path in the corporate environment is not encouraged, which leads to a familiar looking code base, but also higher levels of technical stagnation. Bringing new ways of solving problems, connecting to databases, accessing websites or even changing the logging approach to something that allows easier troubleshooting means that your code sticks out like a sore thumb since it is different from the rest of the company standards. The upside of sticking with the standards is that is anyone in your immediate group, or perhaps throughout the entire organization, can more quickly follow and understand your code and make any necessary changes.

As one of the founders of Custom Visuals, I have a much higher amount of freedom to explore areas that are not placed directly in front of me. Taking the programming language Perl as an example, I can decide to update to a newer version of a CPAN module without worrying how it will affect groups that are not part of my immediate circle. By running a few tests or if I have something in my ever-growing test suite that exercises the new module, I can test and deploy the module in a matter of minutes. This helps to keep the code base modern and remove some of the cruft that inevitably creeps into projects over time.

In the larger scheme, this means I can evaluate entirely new concepts, libraries, languages, technologies, etc. and decide whether it makes it into the project or not. It also means my experience and knowledge of new areas grows, as well. While working on projects within Custom Visuals, I may take more time to explore a new concept before figuring out how to steer a project into what I already know how to use. In the short run, this likely means a single project takes a little longer to develop while some additional learning and experimenting takes place. In the longer run, this means projects have more tools available to make development quicker and more flexible.

From a broader perspective, I have found a nice balance of technical learning and basic implementation in both environments. There is always something to learn in both and there is also a chance that you will stray into the wilderness while implementing a new idea. Knowing when and how to implement new ideas is based on experience.