Let's say that you've got a lot of numbers that represent bitmasks of some kind, and you want to count how many times each bit is on or off across the entire set. Maybe you're analyzing game positions represented as bitboards for an AI, or trying to find certain types of weaknesses in random-number generators, like in Forge (a successor to crypto-js) or Cryptocat (read the great write-up at Sophos).

So, you write some very straight-forward code to count the bits. It grabs one bitmask. If the lowest-order bit is set, it increments the counter for that bit position. Then, it right-shifts the bitmask and moves to the counter for the next bit. Repeat that for each bit in the mask, then repeat that for each bitmask:

You run your program, and it works correctly, but it's too slow. I'll show you how to speed this up. The technique, which applies to languages like Python or Javascript as well as to C, is both crazy, and crazy-fast!

A circular sector is the part of a circle bounded by two radii and
an arc connecting them. Determining whether a point lies in a
sector is not quite as easy as rectangular bounds-checking. There
are a few different approaches: θ-comparison, line-sides, and
dot-products. Continue reading "Bounds-checking with a circular sector"

I recently finished up the main development effort of a new project. It incorporates a lot of database techniques I'd used in previous projects, and several that I hadn't.

Description

The application is written in Django, but uses SQLAlchemy to the nearly complete exclusion of Django's own ORM. The only thing that touches Django's ORM is the session module.

The view-handling functions perform first-level security checks, input scrubbing and coercion from strings to other types, and reporting errors to the user. These functions then call into logic functions, which perform most of the interaction with the database. Database interactions use SQLAlchemy as basically a wrapper over the DBAPI, using strings for queries and passing parameters positionally.

The database is PostgreSQL (which has an excellent set of features). Database views and database functions do all the heavy lifting; the logic functions themselves mostly use their own parameters to filter or sort view results (at the database), do a simple join or two, or perform data updates. The views do a lot of aggregation, including statistics, modeling, and forecasts; much of the aggregation uses window functions.

Migrations are managed by a small set of simple bash scripts called pg-migrate. Migrations have a single total order, with individual migrations identified by contiguous positive ordinals. A migration script is a single SQL file, named with its ordinal, a hyphen, and some brief descriptive text allowed in filenames, and a .sql extension. The bash scripts automatically wrap individual migrations in transaction blocks during execution.

What went well

Using database views is a huge win for abstraction.

Views are the first level of named abstraction for queries, like functions are for code in general-purpose languages. By giving names to parts of complicated queries, they become much easier to understand.

It's very easy to play with queries at psql's REPL until they're right, then freeze them into views. If you decide that a query's grouping, partitioning, or aggregating must be changed, it's often easier to modify a query than it would be to modify the equivalent nested and sequential loops in non-declarative code.

With views, program logic becomes very simple—just selecting from a view, maybe filtering and ordering, but very little post-processing.

Views don't lose optimization. The query planner essentially expands all views used in a query to derive one base-level query, then applies its optimization heuristics over that. Thus, a database can optimize across named-view boundaries. Typically, a procedural language cannot optimize across calls between different compilation units, and an object-oriented language cannot even optimize across any dynamic dispatch calls (i.e. invoking a method on an object). What might require looping over the data dozens of times in a typically-factored object-oriented solution, when translated into equivalent database views, could very well be accomplished in a single pass over the data.

Migrations were also a big success. Deploys were quick—I could do several a day. Many deploys happened with zero downtime, without using any kind of database or application fail-over.

Migrations were designed to stop at useful points if a human needed to update data before continuing.

What went poorly

I put site-specific data into some early migrations. Major mistake! I had to fix-up early migrations in order to cleanly build databases for new sites.

Modifying existing views is sometimes difficult. Changing the output columns or column-types of a view means you must DROP it first, then CREATE it over again. Otherwise, you can get away with CREATE OR REPLACE. In either case, you must pull up the definition from the last migration (the view's definition in a database dump is annotated to show type information and scope-resolution, but this is too clumsy to edit as a primary source).

If a view must be DROP'd for an update, every view that references it must also be DROP'd then CREATE'd again. Because of this, a migration which changes a single view can get very large by pulling in every dependent view.

In this project, editing views was different from editing the program. The source files for a program are a snapshot of the entire program—to change the program, change the snapshot, then reload the program. Editing the views, on the other hand, required applying changes through a diff (the migration). The snapshot could not be edited in place. Diffs naturally have a different language than snapshots, as they express how to modify something that already exists while maintaining integrity, whereas snapshots express the structure that will exist, ex nihilo.

I also lacked any sort of rigor in defining views near the beginning of the project. Some views unnecessarily included all of the underlying tables' fields. When the structure of those tables changed, the views had to change as well.

Queries in Python are painful. They're all strings, so they don't wrap like code, and can't bind directly with variables.

Paths to improvement

A migration should be named with only a number. Comments at the beginning of the file can provide explanatory text. The migration scripts would be augmented to parse these leading comments and display them to the user, when requested.

Split schema definition into two parts, tentatively called structure and world. Structure would be expressed in migration files, and include operations to alter domains, tables, table constraints, or to update existing data to accommodate changing structure or constraints. It would also include any data that is not site-specific (site-specific data being banned from migrations). World would be a snapshot description of everything else—views, functions, etc.

To perform a migration, the migration scripts would drop everything in world, apply any pending migration scripts, then reload everything in world. Consideration would have to be given to reloading world in the event that a migration fails. Dropping world would also be disadvantageous to migrations that would benefit from defined views in order to update existing data.

The system should allow world to be split into multiple files, for some modicum of modularity. The system would have an option to discover a working load-order and save that order to a file, so that the modules may be reloaded quickly and without trial-and-error, for production migrations.

Views should be classified by type. I have discovered three basic types:

An extension view, which extends a table with some extra, easy-to-calculate or easy-to-join fields. Such views seldom have aggregations, and joins are usually one-to-one. Extension views should include every field of the base table. Code will usually query an extension view instead of its underlying table, to get the extension fields in addition to the base fields. Extension views are named after the underlying table, with _ext appended.

A processing view. These views may perform significant aggregation and joining over several tables, to provide some well-defined, narrowly-scoped, output. Processing views include only the applicable primary keys (or candidate keys) from the underlying tables, plus their own contributions. Processing views are named according to their purpose.

A summary view. Summary views are based on a single table or single extension view, but join information from related processing views. They include most or all fields from all data sources. Summary views are named with an appended _summary.

This scheme maximally decouples processing views, being the most complex, from changes in underlying tables. Extension and summary views have simple definitions, often using * in their select lists, and so are not so annoying to update to match participating tables.

I would also consider using SQLAlchemy's table-definition features. However, it can't be the definitive source from which the database is built (because you can't express migrations there; it's a snapshot language, not a diff language). This would allow better code-formatting and comprehension by IDEs of table and field names and such. This would violate the DRY principle (don't repeat yourself), though.

Style-checkers check code for certain statically-determinable
properties. These properties may affect dynamic behavior (e.g. all
references are definitely initialized before use), or they may be
purely textual (e.g. code is indented in a particular way).

Teams use style-checkers to ensure some typical dynamic and textual
standard in the code base. The idea is that common errors can be
prevented, and that a common textual style will aid readability,
especially between code from different authors.1

In reality, style-checking is far from ideal, especially regarding
textual consistency. Some section of code may have notably
worse readability if wrapped to the project’s standard of 80
columns. Some object allocation may not have a matching
deällocation because it is live for the program’s entire run.

No matter the reason, exceptions to the rules always come up, and
this is where style-checking becomes really difficult: how do you
handle the exceptions? If you do nothing, exceptions will accumulate
until the style-checker’s report is overwhelming. At best, the report
will be ignored; at worst, it will become demoralizing to developers.
In such a situation, style-checking may be worse than no
style-checking at all.

To keep the report positively motivating, you must make a change
somewhere:

You could alter the code to make it compliant with the style
rules. This may directly compromise either the design integrity or
actual readability of the code.

You could alter the style rules so that the case is no longer
exceptional. But the new rule will probably be more complicated, have
more edge cases, and still have other (and new) exceptions. Besides,
which is better: writing rules, or writing code?

You could mark the exception in the code with some kind of
meta-comment, so that the style-checker ignores it on future runs;
but, this adds clutter to the code and runs counter to the goal of
readability.

You could mark the exception somewhere else. This keeps the code
clean, and the developers in control, but this database would have to
be intelligent enough to keep up with code motion as the project
develops. The database should also be version-controlled with the
source that it annotates.

This last approach would yield a “perfect” solution: it would only
ever mark newly-introduced problems in the code. A “conservative,”
approximate solution (one that marks every new problem, and might mark
a few old) is possible if we make the following assumptions:

Code is under version-control;

The style-checker is run before or as part of the process of committing code to the main branch; and,

Genuine problems, when reported, will actually be addressed.

Given these constraints, we can see that a style-checker could
report on errors associated only with code features that are new or
changed in the current version, relative to the latest main-line
version. If this is the case, it would assume that any style
exception in the main-line is there intentionally, and should not be
reported on again (it would have been reported when it was previously
committed)—it would not need to track exceptions in a separate
database. I call this “differential style-checking.”

The code features that the style-checker tracks could be either
lines of text, or nodes of parse-trees. The latter might be more
accurate, but the former might be easier to implement and good enough
in most cases.

Here is a proof-of-concept. It works on only one Python file at a time, and the
output is messy, but it demonstrates the essential principles. pylint version
0.19.0, as packaged in Ubuntu
10.04, is the style-checker, with which I have no experience prior to
writing this.

We see that, according to pylint’s default checks, two new
(potential) problems were introduced: the new method doesn’t have a
docstring, and the new method could be a function outside of a
class.2
Running pylint normally on postgres_backend.py revision
10114 produces 47 warnings. (The entire Trac packages yields 10065.)

During most of college, I had an awesome sleeping habit. My alarm
was set for 7:30 am every day, without exception. And, every day,
without exception, I would wake up at 7:28 am, two minutes ahead of
the alarm.

It didn’t matter when I went to bad the night before. 10:00 pm,
2:00 am, it didn’t matter; though, if I had stayed up late, I would
get sleepy sooner and go to bed earlier next time.

But near the end of college, that magic broke. I started
struggling to maintain a decent wake schedule, waking up closer to
10:00 am, even with a normal day-time job.

Although I still don’t have the amazing abilities I had in college,
I was able to get myself back to a good wake schedule. Here’s how I
do it.

Reinforce your alarm response

On a day that you have some extra time, set your normal alarm for a
few minutes from now. Get in bed, close your eyes, and relax. As
soon as your alarm goes off, jump out of bed. Yes, really: jump.
Don’t just roll out—be active. Repeat this several times.

This is especially useful if you find yourself sleeping through
your alarm, and not remembering that it even went off.

Align your eating schedule

Your sleeping rhythm is affected by several inputs, one of those
being digestion.

Eat breakfast soon after waking. I usually eat something within 10
minutes of waking, although an hour or so might be just as good.
Don’t put off breakfast! It tells your body that this is a time to be
awake.

Don’t eat right before sleeping. If you must eat something, keep
it small and light. Fat and protein extend your digestion, and will
make you sleep in. Avoid simple carbohydrates as well, as they will
keep you awake.

Fast to make drastic adjustments

If your sleep schedule is way off of where you want it to be, skip
dinner. The lack of digestion lets your body sleep more easily, and
you’ll be more likely to eat a larger breakfast in the morning,
kicking your digestion into full gear.

If you constantly struggle to keep your sleep schedule set, you
might consider fasting (dinner) on a regular basis.

Get plenty of sunshine

Sun shining on your skin creates vitamin D and other chemicals that
are essential to effective use of other nutrients, good mood, and even
your sleep cycle.

If you work inside most of the day, take regular breaks to go
outside. Walk around the building at the least, or find some nearby
landscaping that you like and lay down in the sun.

If you really can’t get natural sunlight, many people recommend
full-spectrum lamps as fairly good substitutes.

Have something to look forward to

If you go to sleep with nothing but worries, you’re much more
likely to want to sleep in. The day has nothing positive, so why not
put it off, right?

So, about 30-60 minutes before you go to bed, think of one or two
very positive things for the coming day. Then, make sure to visualize
those as vividly as possible as you fall asleep. When your alarm goes
off in the morning, you’ll wake up excited, even if you’re not
consciously thinking about those things.

Don’t try to come up with something positive once you’re already in
bed—you’re likely to fall asleep before you do, or to not have enough
visualization time for them to sink in well enough. Take the time to
prepare for tomorrow!

You might be in genuinely stressful or negative situations right
now that are near impossible to avoid thinking about. That’s okay;
acknowledge it, but always keep it in terms of some form of
improvement, and make something to look forward to. Your life is your
own, to make better, or not, as you wish.

Be consistent

Set your alarm, get out of bed on time, eat early, enjoy the sun,
and think positive, every single day, even if it’s a weekend, holiday,
or vacation. Good, well-chosen habits are a key to success.