And now it’s all thishttp://leancrew.com/all-this/
I just said what I said and it was wrong. Or was taken wrong.Sun, 18 Feb 2018 04:19:32 +0000en-UShourly1http://wordpress.org/?v=4.0Timers, reminders, alarms—oh, my!http://leancrew.com/all-this/2018/02/timers-reminders-alarms-oh-my/
Sun, 18 Feb 2018 04:19:32 +0000http://leancrew.com/all-this/2018/02/timers-reminders-alarms-oh-my/
my last post. I was even more shocked to learn about bizarre omission in the HomePod software. I decided to dig into the many ways you can set timed alerts on your Apple devices and how the alert systems vary from device to device. It is, you will not be surprised to learn, a mess.]]>
I was shocked—shocked!—to see people disagree with my last post. I was even more shocked to learn about bizarre omission in the HomePod software. I decided to dig into the many ways you can set timed alerts on your Apple devices and how the alert systems vary from device to device. It is, you will not be surprised to learn, a mess.

Let’s start with the summary. In the table below, I’m comparing the features of the three alert types on iOS: Timers, Alarms, and Reminders. Included in the comparison is how certain features work (or don’t work) on the iPhone, iPad, Watch, Mac,1 and HomePod. Most of the entries for the HomePod are empty because I don’t have one to test, but I’ve included it because it was the device that got me started down this path. Also, there’s that software omission I want to talk about.

Timer

Alarm

Reminder

Number

1

∞︎

∞︎

Name/Description

No

Yes

Yes

Autodelete

Yes

No

Yes

Shared

iPhone

Yes

Yes

Yes

iPad

No

No

Yes

Watch

Yes

Yes

Yes

Mac

No

No

Yes

HomePod

?

?

No

Time left

iPhone

Yes

No

No

iPad

Yes

No

No

Watch

Yes

No

No

Mac

No

No

No

HomePod

?

?

?

Time of

iPhone

No

Yes

Yes

iPad

No

Yes

Yes

Watch

No

Yes

Yes

Mac

No

No

Yes

HomePod

?

?

?

Many of the entries in this table have caveats, so let’s go through it.

The number of alerts that can be set was the starting point for the last post. People want multiple timers in their HomePods. That’s great, but Apple’s never had multiple timers in any iOS device, which is why I’ve always used reminders instead. “Reminders aren’t a substitute for timers!” I’ve been told by several people. I admire your steadfast adherence to your principles, but I need a solution, not a manifesto. (We’ll get to the deficiencies of using reminders as a substitute for timers later in the post.)

Since there’s only one timer, there’s no need for it to have a name or description. So when the timer on your phone/watch/table/speaker goes off, you might have to think a bit before you remember what it’s for. Alarms and reminders don’t have this problem.

I didn’t mention alarms in my last post, but Kirk McElhearn reminded2 me of them. If you’ve only used Clock app’s UI to set an alarm, you may think you have to use a specific time (like 8:55 PM) instead of a relative time (in 20 minutes). But Siri offers another way:

One problem with using alarms as your alert system is that they don’t delete themselves when you dismiss them; they just sit there, inactive, taking up space in your list of alarms until you undertake a second action to remove them from the list. Timers delete themselves upon dismissal, which is certainly more convenient. Reminders almost delete themselves—when you mark a reminder as complete, it gets hidden in the Completed list. I take this as close enough to deletion that I gave Reminders a Yes on the Autodelete line.

One of the biggest advantages to using reminders is that they’re shared via iCloud, which also syncs them to your Mac. This is very convenient if you use reminders during the workday and allow notifications from the Reminders app, which I do. Timers and alarms are not shared; the timer you set on your phone doesn’t appear in the Clock app on your iPad or on your watch. But the watch is special because of its intimate relationship with the phone. Your watch will alert you of a timer or alarm set on your phone, even though it doesn’t appear in the watch’s Timer or Alarms app. The Mac is ignorant of all timers and alarms.

Here’s where we get to the HomePod’s software omission. Even if you set up your HomePod to access your reminders—which, I admit, you may be reluctant to do in some households—the HomePod will not alert you when a reminder comes due. I was first informed of this stunning fact by Holger Eilhard, and it’s been confirmed by others. So I guess you can create a reminder through your HomePod but not be alerted by one. For whatever that’s worth. Because I don’t think it’s worth much, I decided to put a No in the Reminder column for sharing on the HomePod.

A feature many people find essential is getting the time remaining before an alert goes off. I would like to tell these people to chill out, take a Zen approach, that “a watched pot never boils,” but that would only anger folks who seem to be a little on edge already. My blithe assertion that timed reminders is the solution to the lack of multiple timers was based too much on my own use. In the 4+ years I’ve been using reminders for timed alerts, I have never wanted to know how much time was left, but I guess the rest of the world doesn’t slavishly model itself after me.

So if you need to know the time left on an alert, the timer is your only friend. Neither alarms or reminders will give you that. Alarms and timers will give you the time an alert will go off (like 8:55 PM), but you’ll have to do the subtraction yourself, which isn’t convenient.

By the way, although I put a Yes in the “Time of” section for the Watch, my watch has never actually been able to tell me the time a reminder is due when I ask it via Siri. It definitely understands me, and it acts like it’s going to retrieve that information, but it’s never finished the job. I can, of course, see the due time of a reminder using the watch’s Reminders app.

And there are also a couple of problems with asking Siri for the time of a reminder on the phone:

The obvious problem is that the time Siri says is wrong. And it’s been wrong every time I’ve tried this over the past two days.4 For this example, the reminder was set for 3:50 PM, but Siri told me a time six hours earlier. Now, I happen to live six hours away from UTC, so my first thought was that Siri was programmed (stupidly) to respond in universal time. But then I realized the six hour difference was in the wrong direction. 3:50 PM US/Central is 9:50 PM UTC, not 9:50 AM UTC. So Siri’s answer is so bad it isn’t even wrong in an understandable way.

The less obvious problem is Siri’s characterization of my casserole reminder as the “next reminder.” Inexplicably, she uses that phrase even if the reminder you ask about isn’t the next one. Sigh.

After going through this exercise, I will continue to use timed reminders because

they work across all my devices, including the Mac;

their deficiencies regarding the time remaining don’t affect me; and

they don’t require a second action to get them out of the way when completed, unlike alarms.

I’ve said on Twitter that I think Apple intends timed reminders to be the substitute for multiple timers. I still think that, but I’m less certain now than I was a few days ago.

Update Feb 18, 2018 9:22 AM
There’s always more.

First, something I had scribbled in a note but forgot to put in the post: a timer may not sound an alert. If you like to fall asleep listening to music, you may have the Timer’s When Timer Ends setting assigned to Stop Playing.

If that’s the case, the next time you use Siri to set a timer, it won’t make a sound, which probably isn’t what you want.

Second, reader Thomas Shannon has emailed me that alarms go off only at minute markers. So if it’s 9:55:45 and you tell Siri to set an alarm for one minute, it will go off 15 seconds later. I was annoyed to hear this because I looked into this four years ago with regard to reminders and found that their alert times are not restricted to whole minutes. If you tell Siri at 9:55:45 to remind you of something in one minute, the alert goes off at 9:56:45.

I used to tell people the advantage of using Apple products was their consistency across devices and applications. I don’t do that anymore.

You’re right, the Mac isn’t an iOS device, but it does work with Reminders, which can be very handy, so I’m including it. ↩

]]>
Friendly remindershttp://leancrew.com/all-this/2018/02/friendly-reminders/
Fri, 16 Feb 2018 01:02:06 +0000http://leancrew.com/all-this/2018/02/friendly-reminders/
couple of times, going back four years. And I’ve explained the solution. Is this thing on?]]>
My vision of myself as a powerful thinkfluencer in the Apple world took a real beating this week. It seemed as if everyone who got a HomePod was complaining that it couldn’t set multiple timers. This is something I’ve written about a couple of times, going back four years. And I’ve explained the solution. Is this thing on?

Of course, four years ago, I wasn’t talking about the HomePod, I was talking about the iPhone, but the principle is the same. In iOS, the timer function is in the Clock app, and there’s only one. There’s no way to have two timers running simultaneously and no way to give your timer a name that lets you know what it’s for.

But you do have Reminders. They have names and can be set to alarm not only at an absolute time, but also at a relative time:

“Hey Siri, remind me to check the casserole in 20 minutes.”

This works on my iPhone, iPad, and Watch, and I assume—based on this article—that it would work on my HomePod if I had one. This is clearly Apple’s preferred solution to setting mulitple timers, each with a distinct name.

I understand where they’re coming from. If you’re an Amazon Echo user, you’re probably in the habit of saying something like

“Alexa, set a 20-minute timer for the casserole.”

Habits like that are hard to break, especially as you get older.1 But Apple users should be used to the idea that Apple has strong opinions about the right way to use its products and you’re usually better off not bucking the system.

You don’t like cluttering up your Reminders with hundreds of “check the casserole” and “check the tea” items? Even though you typically don’t see completed reminders? There is a solution.

In the past couple of days, the HomePod complaint industry has moved on from multiple timers to white rings. Cheaply made leather circles are already coming onto the market, but I’m going to suggest that high end furniture protection should come from lace doilies with tatting that complements the HomePod’s fabric pattern.

Myke is 30 now, so his brain has lost much of its former plasticity. ↩

]]>
LaTeX contact info through Workflowhttp://leancrew.com/all-this/2018/02/latex-contact-info-through-workflow/
Sat, 10 Feb 2018 21:00:37 +0000http://leancrew.com/all-this/2018/02/latex-contact-info-through-workflow/
I’ve been writing more on my iPad recently; not just blog posts, but reports for work, too. Because I have a lot of helper scripts and macros built up over many years of working on a Mac, writing on the iPad is still slower. But I’m gradually building up a set of iOS tools and techniques to make the process go faster. Today’s post is about a Workflow I built yesterday with advice from iOS automation experts conveyed over Twitter.

For several years, I wrote reports for work using a Markdown→LaTeX→PDF workflow. For most of those years, it was rare for me to have to edit the LaTeX before turning it into a PDF. Recently, though, that rarity has disappeared, mainly because my reports have more tables and figures of varying size that need to be carefully positioned, something that can’t be done in Markdown. A few months ago I decided it would be more efficient to just write in LaTeX from the start. This wasn’t as big a change as you might think. I used to write in LaTeX directly, and the combination of TextExpander and a few old scripts I resurrected got me back up to speed relatively quickly—on the Mac, anyway.

On iOS, most of the TextExpander snippets I built for writing in LaTeX work fine, but the helper scripts, which tend to rely on AppleScript, don’t. One of the scripts I definitely wanted an iOS counterpart for was one that extracted the contact information from a client in a particular format. In my reports, the title page usually includes section for the name, company, and address of the client. This is added in the LaTeX source code by this:

where \client is a LaTeX command I created long ago, and its argument needs the usual LaTeX double backslashes to designate line breaks. Also, ampersands, which are special characters in LaTeX, need to be escaped.

I thought I could whip something up in Workflow, but my limited understanding of Workflow isn’t conducive to whipping. When I first tried to put something together a couple of weeks ago, it looked to me as if I was going to have to painstakingly extract every piece of information from the selected contact, create variables to store them in, and then put those variables together into a new string of text. So I gave up.

Yesterday I decided to ask for help.

I would like to extract from a selected contact a standard name/address block as plain text:

As you can see, I asked for something a bit simpler than what I really wanted, and I was kind of expecting suggestions for an app that would do the trick. But I soon got a response from Ari Weinstein with a Workflow solution:

Since Ari is a co-developer of Workflow, I kind of figured he knew what he was talking about. But I didn’t, and it’s because I didn’t appreciate Workflow’s magic variables. I’ve always thought of Workflow as being almost like a functional language, where each action transforms the data passed to in and sends it along to the next action in turn. That, at least, is what I thought happened when the actions are connected by lines.

Which is why I didn’t understand Ari’s workflow at first. I figured that if it was extracting the Street Address in the second step, there’d be no way for it to get ahold of the Name and Company in the fourth step. What I didn’t appreciate was that there can be side effects the usual view of a workflow doesn’t show you. In this case, the Contact that’s selected in the first step is saved to a magic variable (called “Contact”) that remains available for use in later steps. So the third and fourth steps have access to all the Contact information even after the extraction of the Street Address in the second step.

Ari’s sample is a standard workflow that would have to be run from within Workflow itself or from a launcher app like Launch Center Pro. I was thinking about how to turn it into an Action Extension that could be called from within Contacts when I noticed I had a Twitter reply from Federico Viticci:

His suggestion is set up as an Action Extension that accepts only Contacts and extracts the info from the Workflow Input magic variable. Just what I was going to do.

“My” final workflow, called LaTeX Address, combines what I learned from Ari and Federico and adds some search-and-replace stuff to handle the LaTeX-specific parts:

The first two steps create a text variable named Ret that consists of a single line break. We’ll see why I needed it in a bit.

Steps 3–5 are the Ari/Federico mashup. I couldn’t use Federico’s suggestion to just add Workflow input:Street Address to the end of the block because my contacts usually include the country, even though the country is almost always the US, and I didn’t want that at the end of the block. At some point, I’ll improve this by writing up a filter that deletes the country line only if it’s the US, but this will do until I get another job with a non-US client.

Step 6 escapes the ampersands, and Step 7 adds the double backslashes to the ends of each line. You need four backslashes to get two in the output because regexes need two to produce one. I thought I could use \n at the end of the replacement string to get a line break, but I couldn’t get that to work. Thus, the Ret variable defined at the beginning of the workflow.

Finally, Step 8 puts the text on the clipboard, ready for pasting into a LaTeX document.

My plan is to use this extension in Split View, with my text editor, currently Textastic, on one side and Contacts on the other. When I need to insert the client info, I find it in Contacts, tap Share Contact to bring up the Sharing Sheet, and select the Run Workflow action.

This brings up the list of Workflow Action Extensions that can accept Contacts. I choose LaTeX Address from the list, switch focus back to Textastic, and paste the text block where it belongs. Boom.

I’ll try to remember to look for magic variables the next time I make a workflow. There is a trick to making them visible. When you’re editing a workflow and can insert a variable (magic or otherwise), a button with a magic wand will appear in the special keyboard row.

Tapping it will give you a new view of your workflow, with the magic variables appearing where the workflow creates them.

You don’t need to do this, as all of these variables should appear in the special keyboard row if you keep scrolling it to the right. But I find it easier to understand what they are and where they come from in this view.

Thanks to everyone who had suggestions for me, especially Ari and Federico.

]]>
My feed reading systemhttp://leancrew.com/all-this/2018/02/my-feed-reading-system/
Sun, 04 Feb 2018 18:52:25 +0000http://leancrew.com/all-this/2018/02/my-feed-reading-system/
As promised, or threatened, here’s my setup for RSS feed reading. It consists of a few scripts that run periodically throughout the day on a server I control and which is accessible to me from any browser on any device. The idea is to have a system that fits the way I read and doesn’t rely on any particular service or company. If my current web host went out of business tomorrow, I could move this system to another and be back up and running in an hour or so—less time than it would take to research and decide on a new feed reading service.

For me, this is a very long script, but most of it is just the HTML template. What getfeeds does is go through my subscription list, gather all the articles from those feeds that I haven’t already read, and generate a static HTML file with the unread articles laid out in reverse chronological order. At the end of each article, it puts a button to mark the article as read and a form for adding a link to the article to my account at Pinboard.

Start by noticing that this is a Python 2 script, so Line 2 is a comment that tells Python that UTF-8 characters will be in the source code. We’ll also run into decode/encode invocations that wouldn’t be necessary if I’d written this in Python 3. I suppose I’ll translate it at some point.

Lines 16–19 are a function for adding an article to the database of read items. This is an SQLite database that’s also kept on the server. The database has a single table whose schema consists of just two fields: the blog name and the article GUID. Each article that I’ve marked as read gets entered as a new record in the database. The addItem function runs a simple SQL insertion command via Python’s sqlite3 library.

Lines 21–27 and 29–71 define my subscriptions: two lists of feed URLs, one for JSON feeds and the other for traditional RSS/Atom feeds. A lot of these feeds have gone silent over the past year, but I remain subscribed to them in the hope that they’ll come back to life.

Line 86 sets a parameter in the feedparser library that relaxes some of the filtering that library does by default. There is some danger to this, but I’ve found that some blogs are essentially worthless if I don’t do this. The comments above Line 86 contain links to discussions of feedparser’s filtering.

Lines 89–90 connect to the database of read items (note the fake path to the database file) and create a query string that we’ll use later to determine whether an article is in the database.

Lines 94–95 initialize the list of posts that will ultimately be turned into the HTML page and the n variable that keeps track of the post count.

Lines 102–105 initialize a set of variables used to handle timezone information and the filtering of older articles that aren’t in the database of read items. As discussed in the comments above Line 102 and in my previous post, old articles that aren’t in the database can sometimes appear in a blog’s RSS feed when the blog gets updated.

Lines 108–148 assemble the unread articles from the JSON feeds. For each subscription, the feed is downloaded, converted into a dictionary, and run through to extract information on each article. Articles that are in the database of read items are ignored (Lines 120-121). Articles that aren’t in the database are appended to the posts list, unless they’re more than three days old, in which case they are added to the database of read items instead of to posts (Lines 142–146).

Much of Lines 108–148 is devoted to error handling and the normalization of disparate input into a uniform output. Each item of the posts list is a tuple with

the article date,

the blog name,

the article title,

the article URL,

the article content,

the running count of posts,

the article author, and

the article GUID.

Lines 151–199 do for RSS/Atom feeds what Lines 108–148 do for JSON feeds. The main difference is that the feedparser library is used to download and convert the feed into a dictionary.

Lines 202–203 sort the posts in reverse chronological order. This is made easy by my choice to put the article date as the first item in the tuple described above.

Lines 204–206 generate a dictionary of lists of tuples, toclinks, for the HTML page’s table of contents, which appears at the top of the page. A table of contents isn’t really necessary, but I like seeing an overview of what’s available before I start reading. The keys of the dictionary are the blog names, and each tuple in the list consists of the article’s title and its number, as given in the running post count, n. The number will be used to create internal links in the HTML page.

From this point on, it’s all HTML templating. I suppose I could’ve used one of the myriad Python libraries for this, but I didn’t feel like doing the research to figure out which would be best for my needs. The ol’ format command works pretty well.

Lines 209–225 define the template for each article. It starts with the title (which links to the original article), the date, and the author. The id attribute in the title provides the internal target for the link in the table of contents. After the post contents come two forms. The first has two hidden fields with the blog name and the article GUID and a visible button that marks the article as read. The second form has the same hidden fields, a visible text field for Pinboard tags, and button to add a link to the original article to my Pinboard list. We’ll see later how these buttons work.

Lines 227–236 concatenate all of the posts, though their template, into one long stretch of HTML that will make up the bulk of the body of the page.

Line 239 defines a template for a table of contents entry (note the internal link), and Lines 240–250 then use that template to assemble the toclinks dictionary into the HTML for the table of contents.

The last piece, Lines 253–460, assembles and outputs the final, full HTML file. It’s as long as it is because I wanted a single, self-contained file with all the CSS and JavaScript in it. I’m sure this doesn’t comport with best practices, but I’ve noticed that best practices in web programming and design change more often than I have time to keep track of. Whenever I need to change something, I know it’ll be here in getfeeds.

The CSS is in Lines 257–410 and is set up to look decent (to me) on my computer, iPad, and iPhone. There’s a lot I don’t know about responsive web design, and I’m sure it shows here.

Lines 412–426 and Lines 428–441 define the markAsRead and addToPinboad JavaScript functions, which are activated by the buttons described above. These are basic AJAX functions that do not rely on any outside library. They’re based on what I read in David Flanagan’s JavaScript: The Definitive Guide and, I suspect, a Stack Overflow page or two that I forgot to preserve the links to. There’s a decent chance they don’t work in Internet Explorer, which I will worry about in the next life.

The markAsRead function triggers this addreaditem.py script on the server:

There’s not much to this script. It uses the same addItem function we saw before and a markedItem function uses the same query we saw earlier to check if an item is in the database. Lines 23–30 get the input from the form that called it, check whether that item is already in the database, and add it if it isn’t. There’s some minimal HTML for output, but that’s of no importance. What matters is that if the script returns a success, the markAsRead function changes the color of the button from green to red and the text of the button from “Mark as read” to “Marked!”

Before:

After:

The addToPinboard JavaScript function does essentially the same thing, except it triggers this addpinboarditem.py script on the server:

This script uses the Pinboard API to add a link to the original article. Line 9 defines my Pinboard credentials. Lines 12–16 extract the article and tag information from the form. Lines 19–24 connect to Pinboard and add the item to my list. If the script returns a success, the addToPinboard function changes the color of the button from green to red and the text of the button from “Pinboard” to “Saved!”

Before:

After:

The overall system is controlled by this short shell script, runrss.sh:

Line 3 runs the getfeeds script, sending the HTML output to a temporary file. Line 4 then changes to the directory that contains the temporary file, and Line 5 renames it. The file I direct my browser to is rsspage.html. This seeming extra step with the temporary file is there because the getfeeds script takes several seconds to run, and if I sent its output directly to rsspage.html, that file would be in a weird state during that run time. I don’t want to browse the page when it isn’t finished.

Finally, runrss.sh is executed periodically throughout the day by cron. The crontab entry is

*/20 0,6-23 * * * /path/to/runrss.sh

This runs the script every 20 minutes from 6:00 am through midnight every day.

So that’s it. Three Python scripts, one of which is long but mostly HTML templating, a short shell script, and a crontab entry. Was it easier to do this than set up a Feedbin (or whatever) account? Of course not. But I won’t have to worry if I see that Feedbin’s owners have written a Medium post.

]]>
Feed reader robustificationhttp://leancrew.com/all-this/2018/02/feed-reader-robustification/
Fri, 02 Feb 2018 02:28:24 +0000http://leancrew.com/all-this/2018/02/feed-reader-robustification/
I had a bit of shock this afternoon when I opened my RSS feed reader to see if anything was new.

Not much new, but a lot that’s old. Over 1400 posts from Kieran Healy, holder of the Krzyzewski Chair in Sociological R at the second best basketball university in North Carolina and author of a much-anticipated forthcoming book on how to make good graphs.

What happened? I don’t know for sure, but something in Kieran’s site generation software decided to include every post he’s written in his blog’s RSS feed. It’s an impressive body of work, going back to 2002, but I didn’t have time during my lunch hour to read it all.

checks each article against a SQLite database of articles I’ve already read; and

adds the article to a list if it’s unread;

After going through all the subscriptions, the script sorts the unread articles in alphabetical order and arranges them in a static HTML page on my server, adding a table of contents to the top of the page. The script runs via a cron job a few times an hour from 6:00 am until midnight.

So many of Kieran’s posts appeared today because my database of read posts is relatively young and only the last dozen or so of his articles are in it. It was all the earlier ones that were on my feed reader page.

This is my fault, not Kieran’s. I knew perfectly well when I wrote my script that blogging software will sometimes regenerate its feed with all new GUIDs for each article. When this happens, it makes the articles look new to the feed reader. I’d seen this happen even back when I was using professionally written feed reading apps. What made this especially troublesome for my definitely-not-professionally-written feed reading system was that it’s not equipped with a “Mark all as read” button. Which gave me three choices:

Do the programming to add a “Mark all as read” button, something I will almost never use.

Go through and individually mark all 1400 old posts as read so they get entered into the database and don’t appear again. Fat chance.

Figure out another way to add all these posts to the database.

Change my feed reading script to just ignore articles that are more than a few days old, regardless of whether they’re in the database.

I chose #4 because it was the quickest to implement and should protect me against this kind of thing happening again. Kieran’s older posts disappeared from my feed reading page, and my blog reading went back to normal. Afterward, though, I realized that I could have implemented #3 in combination with #4, ignoring the older articles for the purposes of assembing the feed reading page but adding them to the database of read articles to give me added protection against seeing them pop up again.

I’ll try to get that working in the next day or two and then post the script in its final form. I doubt that many people really want to set up their own feed reading system, but you never know.

]]>
Subplots, axes, Matplotlib, OmniGraffle, and LaTeXiThttp://leancrew.com/all-this/2018/01/subplots-axes-matplotlib-omnigraffle-and-latexit/
Tue, 30 Jan 2018 03:36:43 +0000http://leancrew.com/all-this/2018/01/subplots-axes-matplotlib-omnigraffle-and-latexit/
Matplotlib, I usually write a short post about it to reinforce what I’ve learned and to give me a place to look it up when I need to do it again. In my section properties post from last week, I had a 2×2 set of plots that helped explain which arctangent result I wanted to choose under different circumstances.]]>
When I learn something new in Matplotlib, I usually write a short post about it to reinforce what I’ve learned and to give me a place to look it up when I need to do it again. In my section properties post from last week, I had a 2×2 set of plots that helped explain which arctangent result I wanted to choose under different circumstances.

What was new to me was the use of the pyplot.subplots function to generate both the overall figure and the grid of subplots in one fell swoop. It’s possible that this technique was new to me because the documentation for Matplotlib’s Pyplot API doesn’t contain an entry for subplots.1 I don’t remember where I first learned about it—Stack Overflow would be a good guess—but I’ve since learned that pyplot.subplots is basically a combination of pyplot.figure and Figure.subplots.

Lines 6–10 define the four functions to be plotted. The x values are the same for each and the y values are named according to the quadrant they’re going to appear in. The y values are defined so the moments and product of inertia match the annotations shown in the graph. The actual numbers used in these definitions are less important than their signs and their relative magnitudes, as the plots are intended to be generic.

Line 12 then defines the figure and the array of “axes,” where you have to remember that Matplotlib unfortunately uses that word in a way that doesn’t fit the rest of the world’s usage. In Matplotlib, “axes” is usually treated as a singular noun and refers to the area of an individual plot. After Line 12, the axarr variable is a 2×2 array of Matplotlib axes.

Lines 13–19 then define the subplot in the upper left quadrant (what you learned as Quadrant II in analytic geometry class). Line 19 turns off the usual plot frame, and Lines 17–18 ensure there are no tick marks or labels. Lines 14—15 draw the [x] and [y] axes (here I’m using the normal definition of the word). You’ll notice that I’ve drawn the [x] axis at [y = 2] instead of [y = 0]. I didn’t like the way the graphs looked with the [x] axis lower, so I moved it up. Again, this doesn’t change the meaning behind the graph because it’s generic.

The rest of the lines down through 43 are just repetitions for the the other quadrants. Finally, Line 45 saves the figure to a PDF file that looks like this:

Now it’s time to annotate the figure. In theory, I could do this in Matplotlib, but that’s a lot of programming for something that’s more visual than algorithmic. If I were making dozens of these figures, I’d probably invest the time in annotating them in Matplotlib, but for a one-off it’s much faster to do it in OmniGraffle.

I can open the PDF directly in OmniGraffle and start editing. First, I select the white background rectangle that’s usually included in files like this and delete it. It doesn’t add anything, and it’s too easy to select by mistake. Then I select all the axes (again, the usual definition) and add the arrowheads.

The Edit‣Select‣Similar Objects command is very helpful in selecting repeated elements like this.

After placing red circles at the maxima, it was time to label the axes (yes, usual definition; we’re out of Matplotlib now) and add the annotations. I made the annotations in LaTeXiT, a very nice little program for generating equations to be pasted into graphics programs. I’ve been using it for ages.

LaTeXiT cleverly ties into your existing LaTeX installation, so you can take advantage of all the packages you’re used to having available. I usually have LaTeXiT use the Arev package because I like its sans-serif look in figures.

After adding all the annotations, I export the figure from OmniGraffle as a PNG, run it through OptiPNG to save a little bandwidth, and upload it to the server. If this were a figure for a report instead of the blog, I’d export it as a PDF.

I’ve complained about Matplotlib’s documentation before, so I’ll spare you the rant this time. ↩

]]>
Canvas and my remote iPadhttp://leancrew.com/all-this/2018/01/canvas-and-my-remote-ipad/
Sat, 27 Jan 2018 15:48:07 +0000http://leancrew.com/all-this/2018/01/canvas-and-my-remote-ipad/
an episode of Canvas that gave decent explanation of why I could give up my notebook computer.]]>
My older son’s notebook computer, an Asus bought a couple of years ago, has developed a hinge problem that’s reached the point where he doesn’t want to take it to class for fear of it falling apart. After talking over his needs, we decided he could get through the semester with my old MacBook Air. So I set it up for a new user, moved all of my files to an external disk, and delivered it to him yesterday. Coincidentally, on the drive back up through central Illinois, I listened to an episode of Canvas that gave decent explanation of why I could give up my notebook computer.

You could, of course, argue the every episode of Canvas is an explanation of how you can give up your notebook computer. It’s the podcast in which Federico Viticci and Fraser Speirs cover the software and work habits that allow you to use your iOS devices (especially the iPad) to accomplish things you might otherwise think you need a “real computer” to do. But Episode 52 was especially apropos because it covered SSH clients for iOS, which are the reason I feel comfortable in my current state, without a laptop computer for the first time in maybe 25 years or more.

I held off getting an iPad for several years, not because I thought it was a toy or a “consumption only” device, but because my work habits—lots of scripting and command-line use in a multi-window environment—weren’t aligned with the iPad’s strengths. I like to think I wasn’t an anti-iPad zealot during this time. I saw it as the perfect computer for many people, including my wife. I got her an iPad 2 back in 2011; she hasn’t touched a “real computer” since.

So when Split Screen and the iPad Pro were introduced, my ears pricked up. I got the 9.7″ model in late 2016 and have been slowly figuring out how to work with it. Panic’s Prompt and, more recently, the [mosh]1 client Blink Shell are my key apps. My typical setup is to have one of them on the right in Split Screen, connected to my iMac, while I edit in Textastic on the left. This edit/test system on my iPad is very similar to the BBEdit/Terminal window arrangement I use when working on a Mac.

The irony of using a modern, highly graphical device like the iPad to handle a remote, command-line connection to another computer is not lost on me. I often think back to using the Hazeltine terminal that was in a room around the corner from my graduate school office to connect to a Cyber 175 mainframe. And when my iPad is tethered to my iPhone, it’s not unlike using the Hazeltine’s acoustic coupler.

Mosh, the mobile shell, is a secure connection like SSH, but is designed to handle more gracefully the interrupted communications common to mobile connections. ↩

]]>
Transforming section properties and principal directionshttp://leancrew.com/all-this/2018/01/transforming-section-properties-and-principal-directions/
Fri, 26 Jan 2018 02:26:45 +0000http://leancrew.com/all-this/2018/01/transforming-section-properties-and-principal-directions/
The Python section module has one last function we haven’t covered: the determination of the principal moments of inertia and the axes associated with them. We’ll start by looking at how the moments and products of inertia change with our choice of axes.]]>
The Python section module has one last function we haven’t covered: the determination of the principal moments of inertia and the axes associated with them. We’ll start by looking at how the moments and products of inertia change with our choice of axes.

The formula for area,

[A = \iint\limits_A dx \, dy]

will give the same answer regardless of where we put the origin of the [x\text{-}y] coordinate system or how we orient them. You can see this if you think of the area as being the sum of all the little [dx\, dy] squares in the cross-section.

do not depend on the position of the [x\text{-}y] origin (because [x - x_c] and [y - y_c] measure the horizontal and vertical distances away from the centroid, which is the same for any origin), but do depend on the orientation of the axes. We’ll show how this works by putting the origin at the centroid (which simplifies the math but does not make the results any less general) and comparing the moments and product of inertia for two coordinate systems, one of which is rotated relative to the other.

Note that [\theta] is the angle from the [x] axis to the [\xi] axis and is positive in the counterclockwise direction.1

Because our origin is at the centroid, [x_c = y_c = \xi_c = \eta_c = 0], and we can write the equations for the moments and products of inertia in a more compact form:

Because the tangent function repeats itself every 180°, this expression can be solved with an infinite number of values of [\theta] that are 90° apart from one another. These orientations all look basically the same, except the [\xi] and [\eta] axes swap positions and flip around. For each of them, [I_{\xi\eta} = 0].

Since we’ve written the expression for [I_{\xi\eta}] in terms of [2\theta], lets’s do the same for [I_{\xi\xi}] and [I_{\eta\eta}]. We start by recognizing the double angle formula for sine in each equation:

Now let’s look at how these moments of inertia change with [\theta]. Suppose we wanted to find the [\theta] that maximized (or minimized) the value of [I_{\xi\xi}]? We’d take the derivative of the expression for [I_{\xi\xi}] and set it to zero:

the same thing we got setting [I_{\xi\eta} = 0]. And if we took the derivative of [I_{\eta\eta}] with respect to [\theta] and set it to zero to find the maxima and minima of [I_{\eta\eta}], we’d get the same thing.

These orientations of the axes are known as the principal directions of the cross-section. They give us both a product of inertia of zero and the largest and smallest values of the moments of inertia. (If [I_{\xi\xi}] is at a maximum, then [I_{\eta\eta}] is at a minimum, and vice versa.)

The largest and smallest moments of inertia are commonly called [I_1] and [I_2], respectively. They can be calculated by substituting our solution for [\theta] back into the expressions for [I_{\xi\xi}] and [I_{\eta\eta}], but there’s some messy math along the way. It’s easier to recognize that the maximum and minimum moments of inertia are determined entirely by the second and third terms of [I_{\xi\xi}] and [I_{\eta\eta}], which are in the form

[A \cos\alpha + B \sin\alpha]

This expression can be thought of as the horizontal projection of a pair of vectors, one of length [A] at an angle [\alpha] to the horizontal and the other of length [B] at right angles to [A].

The largest value of this expression will come when the hypotenuse of the triangle, [\sqrt{A^2 + B^2}], is itself horizontal and pointing to the right. The algebraically smallest value will come when the hypotenuse is horizontal and pointing to the left.

Applying this idea to our expressions for [I_{\xi\xi}] and [I_{\eta\eta}], the larger principal moment of inertia will be

The axis associated with the larger principal moment of inertia is called the major principal axis and the axis associated with the smaller principal moment of inertia is called the minor principal axis. These are sometimes called the strong and weak axes, respectively. Whatever you call them, they’ll be 90° apart.

Now let’s look at the principal function from the section module and see how these formulas were used.

Looks like I was careless with the [(I_{yy} - I_{xx})] term and got it backward in the expression for diff, doesn’t it? Also, there seems to be a stray negative sign in the expression for theta. But the principal function does work despite these apparent errors. What we’re running into is the sometimes vexing difference between math and computation.

First, in the formulas for [I_1] and [I_2], the diff term gets squared, so flipping its sign doesn’t matter. Second, the numerical calculation of the arctangent isn’t as straightforward as you might think.

There are two arctangent functions in Python’s math library (and in the libraries of many languages):

atan takes a single argument and returns a result between [-\pi/2] and [\pi/2] (-90° and 90°, but in radians instead of degrees).

atan2 takes two arguments, the [y] and [x] components of a vector directed out from the origin at the angle of interest, and returns a result between [-\pi] and [\pi] (-180° and 180°), depending on which quadrant the vector points toward.

We can’t use atan in our code because it isn’t robust for some inputs. If we tried

python:
theta = atan(2*Ixy/(Iyy - Ixx))/2

as our formula suggests, we’d get divide-by-zero errors whenever [I_{xx} = I_{yy}]. We can’t have that because there are real cross sections of practical importance for which that’s the case. Any equal-legged angle, for example.

But atan2 can be a problem, too, because we need to distinguish between the major and minor principal axes. In particular, I decided that theta should be the angle between the [x] axis and the major principal axis. Using atan2 directly from the formula like this

python:
theta = atan2(2*Ixy, Iyy - Ixx)/2

can return an angle 90° away from what we want.

Using the inertia function developed earlier, the moments and product of inertia of the equal-legged angle we just looked at are

Ixx = 9.4405
Iyy = 9.4405
Ixy = -5.1429

Plopping these numbers into the naive formula above, we get

theta = -0.7854

or -45°. This is the angle from the [x] axis to the weak axis, not the strong axis. The correct answer is 45°, just like the blue line in the figure.

To figure out a way around this problem, let’s plot [I_{\xi\xi}] for the four cases of interest:

[I_{xy} > 0 \quad \text{and} \quad I_{yy} > I_{xx}]

[I_{xy} > 0 \quad \text{and} \quad I_{yy} < I_{xx}]

[I_{xy} < 0 \quad \text{and} \quad I_{yy} < I_{xx}]

[I_{xy} < 0 \quad \text{and} \quad I_{yy} > I_{xx}]

This will let us see what we need for all four quadrants of the atan2 function.

In each of the subplots, successive peaks and valleys of [I_{\xi\xi}] are 90° apart.

We’re looking for the maximum values of [I_{\xi\xi}] that are closest to [\theta = 0], which I’ve marked with the red dots. That means

which we can visualize this way, where the curved arrows represent the angle [2\theta] for each type of result:

By flipping the signs of both arguments of atan2, we get the sign and magnitude of theta we’re looking for. Note that the expression used in the principal function,

python:
theta = atan2(-Ixy, diff)/2

is equivalent to

python:
theta = atan2(-2*Ixy, Ixx - Iyy)/2

because of the way we defined diff. And by flipping the signs of both the numerator and denominator, we’re not changing the quotient or the definition of [\theta]. We’re just choosing which solution of

[\tan 2\theta = \frac{2 I_{xy}}{I_{yy} - I_{xx}}]

is the most useful.

If you’re mathematically inclined, you may recognize the rotation of axes as a tensor transformation and the determination of principal moments of inertia and principal directions as a eigenvalue/eigenvector problem. But writing principal in those terms would have required me to use more libraries than just math. The formulas in principal are simple, even if their derivation can take us all over the map.

Now that we’ve figured out how principal works, what good is it? It can be shown4 that when the loads on a beam are aligned with one of the principal directions, the beam will bend in that direction only. If the loading is not aligned with a principal direction, the beam will bend both in the direction of the load and in a direction perpendicular to it.

For example, if we were using the equal-legged angle above as a beam and hung a vertical downward load off of it, it would bend both downward and to the left. Not the most intuitively obvious result, but true nonetheless.

Everyone who takes an advanced strength of materials class learns the formulas for the principal moments of inertia and their directions, but there’s usually a bit of hand waving to make the math go faster. And, because the strong and weak directions are typically easy to determine by inspection, the details of picking out the correct arctangent value aren’t discussed. But there’s a richness to even the simplest mechanics, I enjoy exploring it. And since computers can’t figure things out by inspection, you can’t gloss over the details when writing a program.

In case you don’t recognize them, the Greek letters[\xi] and [\eta] are xi and eta, respectively. They’re often used for coordinate directions when [x] and [y] are already taken. You’re probably more familiar with theta, [\theta], usually the first choice to represent an angle. ↩

Don’t believe it? Well, I told you it was non-obvious. But go ahead and multiply out the right hand side and see for yourself. ↩

Yes, the product of inertia integral is definitely more complicated if you’re going to do the derivation by hand. So don’t do it by hand. Learn SymPy and you’ll be able to zip through it.

This is entirely too much like those “it can be easily shown” tricks that math textbook writers use to avoid complicated and unintuitive manipulations. If I’m going to claim you can zip through the product of inertia, I should be able to prove it. So let’s do it.

SymPy comes with the Anaconda Python distribution, and that’s how I installed it. I believe you can get it working with Apple’s system-supplied Python, but Anaconda is so helpful in getting and maintaining a numerical/scientific Python installation, I don’t see why you’d try anything else.

If you’ve ever used a symbolic math program, like Mathematica or Maple, SymPy will seem reasonably familiar to you. My main hangup is the need in SymPy to declare certain variables as symbols before doing any other work. I understand the reason for it—SymPy needs to protect symbols from being evaluated the way regular Python variables are—but I tend to forget to declare all the symbols I need and don’t realize it until an error message appears.

That one personal quirk aside, I find SymPy easy to use for the elementary math I tend to do. The functions I use most often, like diff, integrate, expand, and factor, are easy to remember, so I don’t have to continually look things up in the documentation. And the docs are well-organized when I do have to use them.

The problem we’re going to look at is the solution of this integral for a polygonal area:

[\iint\limits_A xy\; dx dy]

We’ll use Green’s theorem to turn this area integral into a path integral around the polygon’s perimeter:

[\iint\limits_A xy\; dx dy = \oint\limits_C \frac{1}{2} x^2 y\; dy]

For each side of the polygon, from point [(x_i, y_i)] to point [(x_{i+1}, y_{i+1})], the line segment defining the perimeter can be expressed in parametric form,

[x = x_i + (x_{i+1} - x_i)\;t][y = y_i + (y_{i+1} - y_i)\;t]

which means

[dy = (y_{i+1} - y_i)\; dt]

Now we’re ready to use SymPy to evaluate and simplify the integral for a single line segment. To make the typing go faster as I used SymPy, which I ran interactively in Jupyter console session, I decided to use 0 for subscript [i] and 1 for subscript [i+1]. Here’s a transcript of the session, where I’ve broken up long lines to make it easier to read:

We start by importing everything from SymPy and defining all the symbols needed. Then we define the parametric equations of the line segment in In[3] and In[4].

In[5] does a lot of work. We define the integrand inside the integrate function and tell it to integrate that expression over [t] from 0 to 1 (i.e., from [(x_0, y_0)] to [(x_1, y_1)]). Note that we didn’t need to explicitly enter the expressions for [x], [y], or [dy]; SymPy did all the substitution for us, including the differentiation.

I called the result of the integration full because it contains every term of the integration. But we learned in the last post that the leading and trailing terms get cancelled out when we sum over all the segments of the polygon. So I copied just the inner terms from full and pasted them into In[7] to define a new expression, called part.

In[8] then factors part to get a more compact expression, and In[9] converts it to a LaTeX expression, so I can render it nicely here:

SymPy didn’t do everything for us. We had to figure out the Green’s function transformation and recognize the cancellation of the leading and trailing terms of full. But it did all the boring stuff, which is its real value.

]]>
Green’s theorem and section propertieshttp://leancrew.com/all-this/2018/01/greens-theorem-and-section-properties/
Wed, 17 Jan 2018 17:07:13 +0000http://leancrew.com/all-this/2018/01/greens-theorem-and-section-properties/
last post, I presented a simple Python module with functions for calculating section properties of polygons. Now we’ll go through the derivations of the formulas used in those functions.]]>
In the last post, I presented a simple Python module with functions for calculating section properties of polygons. Now we’ll go through the derivations of the formulas used in those functions.

The basis for all the formulas is Green’s theorem, which is usually presented something like this:

where [P] and [Q] are functions of [x] and [y], [A] is the region over which the right integral is being evaluated, and [C] is the boundary of that region. The integral on the right is evaluated in accordance with the right-hand rule, i.e., counterclockwise for the usual orientation of the [x] and [y] axes.

The section properties of interest are all area integrals. We’ll use Green’s theorem to turn them into boundary integrals and then evaluate those integrals using the coordinates of the polygon’s vertices.

Area

This is the easiest one, but instead of going through the full derivation here, I’ll refer you to this excellent StackExchange page by apnorton and just hit the highlights.

A note on the indexing: The polygon has [n] vertices, which we’ll number from 0 to [n-1]. The last segment of the boundary goes from [(x_{n-1}, y_{n-1})] to [(x_0, y_0)]. To make this work with the equation, we’ll define [(x_n, y_n) = (x_0, y_0)].

We start by checking the pts list to see if the starting and ending items match. If they don’t, we copy the starting item to the end to fit the indexing convention discussed above. We then initialize some variables and execute a loop, summing terms along the way. Rewriting the loop in mathematical terms, we get

[A = \frac{1}{2} \sum_{i=0}^{n-1}\; x_i y_{i+1} - x_{i+1} y_i]

This doesn’t look like the equation derived from Green’s theorem, does it? But it’s not too hard to see that they are equivalent. Expanding out the binomial product in the earlier equation gives

[x_{i+1} y_{i+1} - x_{i+1} y_i + x_i y_{i+1} - x_i y_i]

As we loop through all the values of [i] from 0 to [n-1], the leading term of one trip through the loop will cancel the trailing term of the next trip through the loop. Here’s an example for a triangle:

After the cancellations, all that’s left are the inner terms, and that’s the formula used in the area function.

The cancellation doesn’t do much for us here, changing from two additions and one multiplication per loop to two multiplications and one addition per loop. But we’ll see this same sort of cancellation in the other section properties, and it will provide greater simplification in those.

Centroid

The centroid is essentially the average position of the area. If a sheet of material of uniform thickness and density were cut into a shape, the centroid would be the center of gravity, the balance point, of that shape. The coordinates of the centroid are defined this way:

Moments and product of inertia

You may be familiar with moments and products of inertia from dynamics, where the terms are related to the distribution of mass in a body. The moments and product of inertia we’ll be talking about here—more properly called the second moments of area—are mathematically similar and refer to the distribution of area across a planar shape.

The moments and product of inertia that matter in beam bending are taken about the centroidal axis (i.e., a set of [x] and [y] axes with the origin at the centroid of the shape). Since we don’t know where the centroid is when we set up our coordinate system, our list of vertex points don’t work off that basis. But we can still calculate the centroidal moments and product of inertia by using these formulas:

]]>
Python module for section propertieshttp://leancrew.com/all-this/2018/01/python-module-for-section-properties/
Tue, 16 Jan 2018 03:34:25 +0000http://leancrew.com/all-this/2018/01/python-module-for-section-properties/
centroid, and the moments of inertia. Most of the time, I can just look these properties up in a handbook, as I did in this post, or combine the properties of a few well-known shapes. Recently, though, I needed the section properties of an oddball shape, and my handbooks failed me.]]>
A lot of what I do at work involves analyzing the bending of beams, and that means using properties of the beams’ cross sections. The properties of greatest importance are the area, the location of the centroid, and the moments of inertia. Most of the time, I can just look these properties up in a handbook, as I did in this post, or combine the properties of a few well-known shapes. Recently, though, I needed the section properties of an oddball shape, and my handbooks failed me.

In the past, I would open a commercial program that had a section properties module, draw in the shape, and copy out the results. But my partners and I stopped paying the license for that program several years ago, so that wasn’t an option anymore. I decided to write a Python module to do the calculations and draw the cross-section.

If the cross section is a polygon, there are formulas for calculating the section properties from the coordinates of the vertices. Most of the formulas are on the aforelinked Wikipedia pages and on this very nice page from Paul Bourke of the University of Western Australia. I’ll explain how and why the formulas work in a later post; for now, we’ll just accept them. For cross sections that aren’t polygons, we can create a close approximation by fitting a series of short straight lines to any boundary curve.

The key data structure is a list of tuples,1 which represent all of the vertices of the polygon. Each tuple is a pair of (x, y) coordinates for a vertex, and the list must be arranged so the vertices are in consecutive clockwise order. This ordering is the result of Green’s theorem, which is the source of the formulas.2

and the PNG file created from Line 7, named skewed.png, looks like this

As you might expect, the x-axis is horizontal and the y-axis is vertical. In addition to the shape itself, the outline function also plots the centroid as a black dot and the orientation of the principal axes. The major axis is the thicker bluish line and the minor axis is the thinner reddish line.

The outline function is the most interesting, in that it isn’t just the transliteration of a formula into Python. Lines 92–95 extract the extreme x and y values, and Line 98 calculates the size of a whitespace border (5% of the larger dimension) to keep the frame of the plot a reasonable distance away from the shape. This also makes it easy to crop the drawing to omit the frame. The ends of the principal axes are calculated in Lines 106–110 to make their lengths 20% of the smaller dimension; the idea is to make them long enough to see but not so long as to be distracting.

As noted in the comments, I chose the axis colors from a colorblind-safe palette given by Martin Krzywinski on this page. He got the palette from a paper by Bang Wong that I didn’t feel like paying $59 for (my scholarship has its limits). To better emphasize which is the major principal axis, I made it thicker.

Mechanical and civil engineers learn how to calculate section properties early on in their undergraduate curriculum, so it’s not a particularly difficult topic, but there is a surprising depth to it. Enough depth that I plan to milk it for three more posts, which I’ll link to here when they’re done.

Strictly speaking, any data structure that indexes like a list of tuples—a list of lists, for example—would work just as well, but because coordinates are usually given as parenthesized pairs, a list of tuples seemed the most natural. ↩

]]>
A small hint of big datahttp://leancrew.com/all-this/2018/01/a-hint-of-big-data/
Sat, 06 Jan 2018 16:59:21 +0000http://leancrew.com/all-this/2018/01/a-hint-of-big-data/
Shortly before Christmas, I got a few gigabytes of test data from a client and had to make sense of it. The first step was being able to read it.

The data came from a series of sensors installed in the some equipment manufactured by the client but owned by one of its customers. It was the customer who had collected the data, and precise information about it was limited at best. Basically, all I knew going in was that I had a handful of very large files, most of them about half a gigabyte, and that they were almost certainly text files of some sort.

One of the files was much smaller than the other, only about 50 MB. I decided to start there and opened it in BBEdit, which took a little time to suck it all in but handled it flawlessly. Scrolling through it, I learned that the first several dozen lines described the data that was being collected and the units of that data. At the end of the header section was a line with just the string

[data]

and after that came line after line of numbers. Each line was about 250 characters long and used DOS-style CRLF line endings. All the fields were numeric and were separated by single spaces. The timestamp field for each data record looked like a floating point number, but after some review, I came to understand that it was an encoding of the clock time in hhmmss.ssss format. This also explained why the files were so big: the records were 0.002 seconds apart, meaning the data had been collected at 500 Hz, much faster than was necessary for the type of information being gathered.

Anyway, despite its excessive volume, the data seemed pretty straightforward, a simple format that I could do a little editing of to get it into shape for importing into Pandas. So I confidently right-clicked one of the larger files to open it in BBEdit, figuring I’d see the same thing. But BBEdit wouldn’t open it.

As the computer I was using has 32 GB of RAM, physical memory didn’t seem like the cause of this error. I had never before run into a text file that BBEdit couldn’t handle, but then I’d never tried to open a 500+ MB file before. I don’t blame BBEdit for the failure—data files like this aren’t what it was designed to edit—but it was surprising. I had to come up with Plan B.

Plan B started with running head -100 on the files to make sure they were all formatted the same way. I learned that although the lengths of the header sections were different, they were collecting same type of data and using the same space-separated format for the data itself. Also, in each file the header and data were separated by a [data] line.

The next step was stripping out the header lines and transforming the data into CSV format. Pandas can certainly read space-separated data, but I figured that as long as I had to do some editing of the files, I might as well put them into a form that lots of software can read. I considered using a pipeline of standard Unix utilities and maybe Perl to do the transformation, but settled on a writing a Python script. Even though such a script was likely to be longer than the equivalent pipeline, my familiarity with Python would make it easier to write.

(You can see from the print commands that this was done back before I switched to Python 3.)

The script, data2csv, was run from the command line like this for each data file in turn:

data2csv file01.dat > file01.csv

The script takes advantage of the way Python iterates through an open file line-by-line, keeping track of where it left off. The first loop, Lines 6–8, runs through the header lines, doing nothing and then breaking out of the loop when the [data] line is encountered.

Line 10 prints a CSV header line of my own devising. This information was in the original file, but its field names weren’t useful, so it made more sense for me to create my own.

Finally, the loop in Lines 12–13 picks up the file iteration where the previous loop left off and runs through to the end of the file, stripping off the DOS-style line endings and replacing the spaces with commas before printing each line in turn.

Even on my old 2012 iMac, this script took less than five seconds to process the large files, generating CSV files with over two million lines.

I realize my paltry half-gigabyte files don’t really qualify as big data, but they were big to me. I’m usually not foolish enough to run high frequency data collection processes on low frequency equipment for long periods of time. Since the usual definition of big data is something like “too voluminous for traditional software to handle,” and my traditional software is BBEdit, this data set fit the definition for me.