Recently, while working on the RockClimbing.com codebase, I encountered several situations where data needed to be pre-calculated/cached because on-the-fly calculations were too slow.

For example, RockClimbing.com has over 100,000 climbing routes scattered across 35,000 locations.The locations are essentially a tree structure, ranging in specificity from general area to a specific wall: “North America” > “California” > “Yosemite” > “El Capitan” > “North Face”.

As you browse different areas, it’d be nice to see the total number of climbing routes within each area. For example, if you’re viewing North America, on each state we’d like to display the total number of climbing routes in that state.

In other words, for each node, find all child nodes and then sum the routes attached to those child nodes.

There’s a number of ways to map a tree structure to SQL tables. For simplicity, I store the locations as an adjacency list. Adjacency lists make it easy to insert and reorder nodes, although traversing the tree (especially downwards) can be slow because you have to walk through all the nodes in order to identify the children. An index helps, but it’s still not fast, especially when you need to traverse multiple levels deep.

This worked fine with my normal test dataset of 100 locations, but became too slow once I increased the test dataset to 50,000 locations.

Switching to a
while loop would improve things slightly since Python doesn’t support tail recursion. But the bigger problem is that accessing
location.children is expensive. My location model is configured with lazy='dynamic' which forces SQLAlchemy to issue a new query for every location. A better solution is pre-fetching all the locations and then iterating through them in memory, but that’s still a lot of data being unnecessarily transferred between the database and the app.

Another attempt using a Recursive CTE:

Instead of doing all this data transfer, if I can write a database query that traverses the tree, then I could handle most of the filtering work in the database and only return the results. Plus the database C code will be much faster than interpreted Python code.

Traditionally, traversing adjacency lists was impossible with normal SQL and required writing custom database functions. Thankfully, we’re using PostgreSQL which supports Recursive CTEs (a quick introduction) that make it a good deal easier:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

--Find the recursive childlocations under the current node

WITH RECURSIVE children(id,parent_id)AS(

--base case

SELECT location.id ASid,location.parent_id ASparent_id

FROM location

WHERE location.id=37--setting thistoid includes current object

--setting toparent_id includes only children

--combine with recursive part

UNION ALL

SELECT location.id ASlocation_id

,location.parent_id ASlocation_parent_id

FROM location,children

WHERE children.id=location.parent_id

)

TABLE children;

--Now calculate count of routes attached tothese locations

SELECT count(route.id)AScount_recursive_child_items_for_single_loc

FROM route,children

WHERE route.location_id=children.id;

I tested this using a test dataset to 35,000 locations and 100,000 routes. For the most expensive queries near the top of the tree it took about 200ms-250ms of database time to walk the tree and then calculate the cumulative route count. That’s a huge speedup over doing all the work in Python, but still not fast enough to dynamically generate it on every page load, especially since some pages display the route counts for multiple locations, so they’d need to run the query multiple times.

Pre-calculating the result and storing it:

The nice thing about this problem is we don’t require the results to always be 100% up to date. Only a handful of new routes are added every day, and nothing bad happens if someone browsing the website sees a count that is a few hours out of date.

So a better solution is to pre-calculate the result and cache it somewhere.

I decided against adding a
count_recursive_routes column to the
location table because it felt hacky–I’d rather separate the calculated/cached data from the original data. I expect the number of things I want to pre-calculate to grow over time, and I don’t want to keep stacking extra columns on the table. Using an in-memory datastore like Redis is another common solution, but so far we haven’t needed it, and I was hesitant to add the extra complexity to our stack just for this.

Instead, I decided to use a PostgreSQL Materialized View. If you’re not familiar with materialized views, my previous blog post provides a good overview.

Generalizing the Recursive CTE across the entire table:

The recursive CTE solution above calculates the recursive route count for a single location, but for our materialized view, we need to generalize the query so it returns route counts for all locations. On the surface, that sounds easy, but it turned out to be much harder. I finally turned to StackOverflow where Erwin Brandstetter suggested solving it using a PostgreSQL
ARRAY:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

WITH RECURSIVE tree AS(

SELECT id,parent_id,ARRAY[id]ASpath

FROM location

WHERE parent_id ISNULL

UNION ALL

SELECTc.id,c.parent_id,path||c.id

FROM treet

JOIN locationcONc.parent_id=t.id

)

,tree_ct AS(

SELECTt.id,t.path,COALESCE(i.item_ct,0)ASitem_ct

FROM treet

LEFT JOIN(

SELECT location_id ASid,count(*)ASitem_ct

FROM route

GROUP BY1

)iUSING(id)

)

SELECTt.id

,t.item_ct AScount_direct_child_items

,sum(t1.item_ct)AScount_recursive_child_items

FROM tree_ctt

LEFT JOIN tree_ct t1 ON t1.path[1:array_upper(t.path,1)]=t.path

GROUP BYt.id,t.item_ct;

This query took approximately 18 minutes to calculate results for the entire test dataset of 35,000 locations and 100,000 routes. This is plenty fast for something that only needs to run once a day as a background job.

Converting the recursive CTE turned out to be more straightforward than I expected. I incrementally built the various components, then combined them into a single query. As mentioned previously, if you’re using Flask-SQLAlchemy, the query needs to be built using the SQLAlchemy core
select() function, rather than the more typical
session.query().

Anytime you’re translating a complicated SQL query to SQLAlchemy, it’s much easier if you print the individual components to make sure you’re assembling the query correctly. In the below code, I left in a few of these print statements that I used for debugging. You’ll notice that sometimes to see the actual output you need to tell SQLAlchemy to compile the statement using the
postgresql.dialect().

The final print debug statement prints a SQL query that exactly matches the definition of the materialized view (excluding the
CREATE MATERIALIZED VIEW portion. Now we just need to pass this to the create_mat_view() function:

The more I use SQLAlchemy, the more impressed I am with it (and Mike Bayer, the primary developer behind it). Not many ORMs support creating materialized views using a Recursive CTE, the PostgreSQL-specific
ARRAY datatype, and more.

If you’re not familiar with SQL views, they’re basically a temporary table whose contents are the output of another SQL query. Normally the table generated by the view only lasts for the duration of a single containing query, but a materialized view stores (materializes) the results so they can be used in later queries.

Materialized views offer a number of benefits:

Fast and simple retrieval. Retrieving data from a materialized view is just like writing queries against a normal table:

Doesn’t add complexity to your infrastructure because it’s built into your database and requires no additional services.

Drawbacks of materialized views:

Any data you want to store in a materialized view has be the result of another SQL query–you can’t generate the data in your app and then store it in the materialized view.

Like any cache, data can go stale. You need to remember to manually issue a
REFRESH command to your database. Either your app can update the materialized view after it inserts/updates the source data, or if that causes too many updates, just run a cron job that refreshes the materialized view on a regular schedule. You can also use a database trigger, although you’ll need to create a custom
plpgsql function.

You can’t update a single row–only the entire view.

Overall, materialized views in PostgreSQL are a very convenient way to add caching to many applications.

Ways to teach SQLAlchemy about a materialized view:

In order to write ORM queries against my materialized views, I needed to tell SQLAlchemy about the materialized view. Querying a materialized view is very similar to a normal SQL table, so the easiest way to do this is manually create the materialized view in your database. Then, within your app, create a normal SQLAlchemy Table mapping that corresponds to the columns in your materialized view.

While this makes it possible to query the materialized view, it also means that you’ll hit problems when creating/dropping all your tables because SQLAlchemy sees the materialized view as a normal SQL table. While uncommon in production, it’s very common thing to do in a test suite or your local dev instance. Additionally, using a traditional SQLAlchemy table mapping makes it difficult to refresh the materialized view from within your app. I already manage all my database tables and metadata with SQLAlchemy, and wanted the convenience of fully managing my materialized views as well.

Adding custom DDL commands to SQLAlchemy for materialized views:

SQLAlchemy doesn’t have a built-in command for creating views, let alone materialized views. But it is possible to subclass
DDLElement and create your own custom DDL commands. After several iterations (and help from Mike Beyer), here’s what I ended up with:

t=db.Table(name,_mt)# the actual mat view class is bound to db.metadata

forcinselectable.c:

t.append_column(db.Column(c.name,c.type,primary_key=c.primary_key))

db.event.listen(

metadata,'after_create',

CreateMaterializedView(name,selectable)

)

db.event.listen(

metadata,'before_drop',

db.DDL('DROP MATERIALIZED VIEW IF EXISTS '+name)

)

returnt

Security Note: Using string interpolation to assemble SQL commands always makes me nervous about accidentally creating a SQL injection vulnerability. None of these commands include user-input, so we’re fine. However, be careful not to accidentally re-use this code somewhere else where you are accepting user input.

Creating a materialized view using SQLAlchemy:

This lets us create a materialized view by passing a
SELECT statement generated with SQLAlchemy core to our custom
create_mat_view() function.

Here’s an example of how you’d call it–this calculates the total number of reviews and average review rating for each GearItem:

1

2

3

4

5

6

7

8

classGearItemMV(db.Model):

__table__=create_mat_view("gear_item_mv",

db.select(

[GearItem.id.label('id'),

db.func.count(GearReview.id).label('review_count'),

db.func.avg(GearReview.rating).label('review_rating'),]

).select_from(db.join(GearItem,GearReview,isouter=True)

).group_by(GearItem.id))

The above code will emit the following DDL statement to the database:

1

2

3

4

5

6

7

8

9

10

11

12

CREATE MATERIALIZED VEW AS

SELECT

gear_item.id ASid,

count(gear_review.id)ASreview_count,

avg(gear_review.rating)ASreview_rating

FROM

gear_item

LEFT OUTER JOIN

gear_review

ON gear_item.id=gear_review.gear_item_id

GROUP BY

gear_item.id

Sidenote: I’m using the factory pattern to create my Flask app, so I can only pass a
db.select() query to
create_mat_view(). Trying to use the normal
db.session.query() from the SQLAlchemy ORM won’t work because it creates a circular dependency. The Flask-SQLAlchemy
sessionmaker() requires an app context, but the
db object (and table mappings) are processed by Python before the app is instantiated. So when SQLAlchemy tries to setup the table mappings, calling
db.session() calls the Flask-SQLAlchemy
sessionmaker() which throws an exception because there’s no app context. Using
db.select() from SQLAlchemy Core avoids this issue because it doesn’t create a session. Within the selectable itself, it’s fine to use ORM constructs such as object references like
GearReview. This works because SQLAlchemy doesn’t have to create a session in order to resolve what a
GearReview is.

Querying a materialized view using SQLAlchemy:

Since our recipe handles both creating the materialized view and mapping it to an ORM object, querying is very simple:

1

data=db.session.query(GearItemMV).get(row_id)

Alternatively, if you’re using Flask-SQLAlchemy:
GearItemMV.query.get(row_id)

Additionally, this example has a 1:1 mapping between the
GearItem objects and the pre-calculated data about them stored in
GearItemMV, which makes for convenient queries like this:

While convenient, this results in brittle code because we have direct queries against the materialized view littered throughout our codebase. A better way is to map the materialized view data as attributes on the original objects, decoupling the attribute call from the underlying implementation of how the attribute is calculated/cached. This avoids having to change a bunch of controller/view code if we switch caching mechanisms.

Mapping the cached result as an attribute on the original object:

SQLAlchemy supports several ways to map the materialized view output onto the original object.

Because
column_properties are evaluated at import time, this has to be defined after the
GearItemMV definition, it cannot be declared inline with the
GearItem model definition.

There were three main problems with this approach:

Under the covers,
column_property() uses correlated subqueries. Correlated subqueries are slow because the query planner can’t optimize them–it effectively has to run a complete subquery for every row returned by the parent query. So unlike traditional set-based SQL where retrieving additional rows is
O(logn), the work required to retrieve
column_properties is
O(n*m) where
n is number of objects and
m is the number of
column_properties on the objects.
For example, lets say my GearItem class has two column properties
review_count and
review_rating, and I want to retrieve thirty objects from the database. Effectively the database is processing the outer query to return thirty items, plus an additional query for each correlated subquery for each item. You won’t notice this in the SQL query logs because only one query is sent to the database, but if you actually run
EXPLAIN on the query, you’ll see that effectively the one query ballooned into doing the work of 61 queries!

SQLAlchemy can’t take advantage of the pseudo-caching ability of the session since this query doesn’t use
session.query.get(). Normally, once an object has been loaded within a session, querying it again using
db.session.query(object).get(pk_id) will retrieve the existing object from the SQLAlchemy session without re-querying the database. This only happens using the special
get() function which first checks the session’s identity map for the primary key.

Column properties are calculated as soon as an object is loaded. So whenever I load a GearItem, even if I never access the review data, I’m still triggering the expensive correlated subquery.

After a little more research, I switched to mapping the attributes onto the parent object using a SQLAlchemy
@hybrid_property. Like a normal python property, it’s only evaluated on access, plus it provides additional benefits.

Here was my initial attempt:

1

2

3

4

5

6

7

8

9

10

11

12

13

# within GearItem model definition

@hybrid_property

def review_count(self):

# if GearItem object doesn't have ID, it was just created and hasn't been

# flushed to the database yet, so it won't have corresponding entry in

# materialized view

ifself.id:

gi_mv=GearItemMV.query.get(self.id)

# GearItemMV definition uses a LEFT JOIN so if self.id exists, yet

# GearItemMV(self.id) doesn't exist, than we know the materialized view

# needs refreshing

ifgi_mv isnotNone:

returngi_mv.review_count

It works fine for querying individual items. It also caches the GearItemMV object in the session, so if I retrieve
review_count and then later retrieve
review_rating it will only emit one database call.

However, it doesn’t work for querying at the class level (eg,
db.session.query(GearItem).filter(GearItem.review_count<4)). The SQLAlchemy docs suggest handling this using hybrid_property.expression(), but my attempts felt needlessly overcomplicated and inelegant.

After some more thought, I realized that due to the 1:1 relationship between GearItem rows and GearItemMV rows, I could let the ORM handle the mapping as if it were a traditional foreign-key based relationship. This matched how I was mentally thinking about the
original_table:
materialized_view relationship, so it should make it easier to write queries.

The only problem is that materialized views can’t have actual foreign keys in the database.

It turns out that SQLAlchemy lets us define table-to-table relationships that effectively function as foreign-key relationships but only within in SQLAlchemy. They are never actually emitted to the database as foreign key constraints:

1

2

3

4

5

# within GearItem model definition

mat_view=db.relationship('GearItemMV',backref='original',

uselist=False,# makes it a one-to-one relationship

primaryjoin='GearItem.id==GearItemMV.id',

foreign_keys='GearItemMV.id')

This results in a much cleaner
@hybrid_property:

1

2

3

4

5

# within GearItem model definition

@hybrid_property

def count_gear_items(self):

ifself.mat_view isnotNone:# if None, mat_view needs refreshing

returnself.mat_view.count_gear_items

Refreshing a materialized view using SQLAlchemy:

Refreshing a materialized view is straightforward:

1

REFRESH MATERIALIZED VIEW mat_view_name

Unfortunately, this locks the materialized view until the refresh completes. PostgreSQL 9.4 added
CONCURRENTLY to let reads continue at the expense of a longer refresh time:

1

REFRESH MATERIALIZED VIEW CONCURRENTLY mat_view_name

My refresh function:

1

2

3

4

5

6

def refresh_mat_view(name,concurrently):

# since session.execute() bypasses autoflush, must manually flush in order

# to include newly-created/modified objects in the refresh

db.session.flush()

_con='CONCURRENTLY 'ifconcurrently else''

db.session.execute('REFRESH MATERIALIZED VIEW '+_con+name)

For convenience, I turned this into a class method on my materialized views:

1

2

3

4

5

6

7

classMaterializedView(db.Model):

__abstract__=True

@classmethod

def refresh(cls,concurrently=True):

'''Refreshes the current materialized view'''

refresh_mat_view(cls.__table__.fullname,concurrently)

All my materialized views inherit from this MaterializedView instead of the normal declarative base class. So now I can call
GearItemMV.refresh() to refresh that specific view.

Similarly, I also added a function for refreshing all views:

1

2

3

4

5

6

7

def refresh_all_mat_views(concurrently=True):

'''Refreshes all materialized views. Currently, views are refreshed in

You’ll need to tweak this if your materialized views depend on one another, or if your database includes normal (non-materialized) views. Alternatively, you can refresh all your materialized views using a custom plpgsql function.

If you’re following this blog post step-by-step and try these refresh functions, PostgreSQL will complain that refreshing concurrently requires a unique index.

Indexing a materialized view using SQLAlchemy:

Initially I tried creating an index using the normal SQLAlchemy command:

1

Index(GearItemMV.id).create()

However, that threw an exception because SQLAlchemy tries to create the index as part of the
db.metadata creation process, but our materialized view recipe doesn’t actually create the materialized view until after the
db.metadata creation finishes. So we need to tell SQLAlchemy to delay creating indexes on materialized views until after the materialized view is created:

1

2

3

4

5

6

7

# this is located within the create_mat_view() function

# see full example code at end of blog post for details

@db.event.listens_for(metadata,'after_create')

def create_indexes(target,connection,**kw):

foridx int.indexes:

idx.create(connection)

Note that rather than creating a custom event to trigger when the materialized view is created, I instead hooked into the
after_create event for
db.metadata. This is the same event that my create materialized view command hooks onto, so theoretically the create index could happen before the materialized view is created. However, I asked on the SQLAlchemy mailing list, and Mike Bayer said the events are executed in deterministic order based on when they were registered as event listeners. Because the create index call is registered after the create materialized view call, everything should work fine.

A few other notes regarding indexes for materialized views:

IFNOTEXISTS can’t be included in the
CREATE INDEX command for materialized views.

Creating an index uses a SHARE lock, so it doesn’t lockout reads, only writes. Since materialized views are by definition read-only, it’s pointless to create an index using
CONCURRENTLY (which uses a SHARE UPDATE EXCLUSIVE lock that allows writes).

Ansible is a great tool with a lot of flexibility. It’s generally the easiest configuration management tool for new users to start with due to the batteries-included philosophy, straightforward DSL, and daemonless push model.

However, as your infrastructure goals become more complex, the flexibility means it’s less obvious how things should be structured.

For example, it’s very common to reuse the same role across multiple projects, and I’ve talked with many people who handle this by copy/pasting the role in each project. Anytime they make a change to the role, they have to remember to manually update all the projects that have copies of that role, which is tedious and error-prone.

There is a better way.™ A couple of little-known Ansible features can be combined to easily share a single role across multiple projects without duplicating code.

To do this, I stick all my shared roles in a single master folder that gets shared across all my projects. This avoids the tediousness of manual copy/pasting and updating multiple copies of the same role. If you want more granularity, this technique also supports organizing groups of roles into dedicated folders–perhaps one for roles used in work projects and one for person projects.

Than I modify each project’s
ansible.cfg to tell Ansible to look for roles in that master folder in addition to the local project folder.

Sample
ansible.cfg:

1

2

[defaults]

roles_path=~/Code/ansible_roles

Ansible first searches the local project for a role, then searches the
roles_path. You can specify multiple paths by separating them with colons.

By default,
ansible-galaxy install username.rolename will install the role to the
roles_path configured in
ansible.cfg, so that’s pretty much all you need to do.

Occasionally I want to install the role into the specific project and not the master folder. For example, to avoid version conflicts when two roles have role dependencies that require different versions of the same role. In that case, you can use the
-pROLES_PATH or
--roles-path=ROLES_PATH option:

Alternatively, in your project’s
requirements.yml, you can manually specify where you want a role to be installed:

1

2

3

4

5

6

# from galaxy

-src:jeffwidman.elasticsearch

# from private github repo, installing to a relative path

-src:https://github.com/jeffwidman/private_ansible_role

path:vagrant/roles/

If you want to customize things further, there’s currently some discussion about Ansible 2.0 adding support for multiple
ansible.cfg files which would let you easily set
roles_path at varying levels of specificity. Ansible will read
ANSIBLE_CONFIG,
ansible.cfg in the current working directory,
.ansible.cfg in the home directory or
/etc/ansible/ansible.cfg, whichever it finds first.

It’s been a few years since I switched from Bash to ZSH on my personal laptop, and I’ve never regretted it. Out of the box, ZSH isn’t that great, but after customizing it using Prezto it’s an amazing timesaver in a lot of little ways.

Oh-My-ZSH is the most well-known ZSH customization framework, but it’s gotten pretty bloated over the years with 200+ plugins, 140+ themes, etc. With that many plugins, some of them are bound to conflict from time to time, and keeping everything straight can be a bit of a mess. It’s also not uncommon for OMZ users who enable a lot of plugins to report shell startup times > 3 seconds, which is annoying.

Prezto started as a fork of Oh-My-ZSH focused on improving performance, and now is an entirely separate project. The Prezto maintainer is a bit gruff and much more inclined to say ‘no’ than ‘yes’ to pull requests, but I see this as a benefit because it minimizes bloat. 95% of the modules that I wanted were already built-in, and it was easy enough to fork the project to add a custom module for the Atom text editor.

Prezto is well-documented, as each module/plugin has its own Readme. Plus there are plenty of blog posts around discussing how to get started with Prezto. Personally, I found it best to just read the Readme for each module, decide which ones I wanted, and enable them. I use Gnu Stow to symlink the
.zshrc,
.zlogin, and
.zpreztorc to my
~/.dotfiles/zsh/ folder and manage them as part of my dotfiles repo.

I prefer the Sorin theme, as it strikes a good balance of displaying enough information to be useful without showing so much that its distracting. For example, rather than showing the full path, the prompt only displays the first character of the parent folders. This saves space, focuses the eye on the path name of the current folder, but also doesn’t leave you wondering whether you’re in
first_app/requirements/ or
second_app/requirements/. Behind the scenes, there are some nice touches such as retrieving Git information asynchronously in the background so it doesn’t slow down the display of the command prompt.

However, the one thing I perpetually missed in the Sorin theme was the active
virtualenv – I was always typing
which python. So this afternoon I finally sat down and figured out how to customize the theme to display it.

The only tricky part is if you want the virtualenv name on the right hand side, add
$python_info[virtualenv]to RPROMPT in line 83, not line 143. I’m not an expert on ZSH scripting, but as best I can tell, this is because Sorin’s theme retrieves the git info asynchronously in the background, and then overwrites the default
RPROMPT in 143 with the output from line 83 when the git info returns.

Someone on the Flask mailing list asked how to serve multiple Flask apps via uWSGI + Nginx. Anytime you’re working with uWSGI there are multiple ways to do things, but here’s how I do things for RockClimbing.com. I spent several days reading the uWSGI docs + various blog posts around the net, so this should be reasonably correct.

This example shows how to serve multiple Flask apps where each app has its own domain name. If you want to mount multiple Flask apps under a single domain name, see this example in the Flask docs (the pull request hasn’t been merged yet as of time of this writing).

In general, use the uWSGI Emperor, even if you’re only running a single Flask app. The Emperor is a master process that watches over the app(s) to make sure everything is running correctly. Each child app is called a ‘vassal’. If for some reason an app/vassal crashes, then the Emperor will reload it. A number of older blog posts recommend managing your uWSGI apps with supervisord… use the Emperor instead because it offers extra benefits like automatically reloading the app/vassal if you make changes to the vassal’s uWSGI config file.

Create a very basic uWSGI Emperor config called
/etc/uwsgi/emperor.ini:

INI

1

2

[uwsgi]

emperor=/etc/uwsgi/vassals

If you want to pass a particular option to all vassals, you can specify the option in the emperor.ini file using the vassal-set parameter.

Create a simple vassal config file in
/etc/uwsgi/vassals/app_name.ini. By default, the config file name will be used for the vassal process name. I manage my vassals using Ansible, so this config has several Jinja2 variables that look like
{{variable}}. Just manually replace those with what you need.

You can bind the app to either a TCP socket or a Unix socket. Just make sure Nginx is passing requests to the same socket that the vassal is listening on.

For security, specify a non-root Linux user/group for the vassal to run under. Typically you’ll run each vassal as a separate user/group, and then run the Emperor as root so it can start each app process and then drop privileges before actually serving requests.

Since we’re walking through how to run multiple Flask apps, you’ll want to run each under a separate virtualenv to avoid package conflicts. For example, if one Flask app requires sqlalchemy 0.8 and another requires sqlalchemy 0.9, they’ll need to be in separate virtualenvs. uWSGI makes it easy to specify which virtualenv to run the vassal under by passing the
virtualenv parameter.

For Flask, typically the callable is
app, and the
module is the filename where app is defined. You’ll also need to tell uWSGI to cd to the path of the Flask app before trying to import
module.

Lastly, if you do any googling about how to scale Flask + uWSGI, you’ll hit a blog post by David Cramer where he found it better to running multiple uWSGI instances and have Nginx handle the load balancing. The thundering herd problem that David experienced is better solved by setting
thunder-lock=true in your vassal config (or set it globally for all vassals in your
emperor.ini config). It’s better to let uWSGI handle the load balancing rather than Nginx because Nginx doesn’t know which uWSGI processes are busy and just round-robins through them when it sends requests. If instead you let uWSGI handle the load balancing, it will intelligently pass requests to the processes that are free.

There’s also a number of options commented out–those are simply reminders to myself that those options exist, but I don’t currently use them.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

[uwsgi]

socket=127.0.0.1:3031

# for unix sockets, see `chmod-socket` and `chown-socket`

stats=127.0.0.1:9191

# use with uwsgitop:

# http://uwsgi-docs.readthedocs.org/en/latest/StatsServer.html

logto=/var/log/uwsgi/%n.log

uid={{api_rc_flask_linux_user}}

gid={{api_rc_flask_linux_user}}

binary-path=/usr/bin/uwsgi

virtualenv={{api_rc_flask_virtualenv}}

chdir={{api_rc_flask_path}}

module=manage

callable=app

processes=2

threads=4

# to dynamically scale workers see:

# http://uwsgi-docs.readthedocs.org/en/latest/Cheaper.html

enable-threads=true

thunder-lock=true

# http://uwsgi-docs.readthedocs.org/en/latest/OffloadSubsystem.html

offload-threads=1# one per CPU is a basic start

# harakiri = 30

# respawn processes after serving 5000 requests (avoid memory leaks)

max-requests=5000

# clear environment on exit

# vacuum = true

# checks python modules every N seconds for changes and autoreloads

# useful in development, avoid in production

# {% if py_autoreload is defined %}

# py-autoreload = 2

# {% endif %}

# limit-as = 512 # ?? for monitoring memory?

# From the uWSGI mailing list:

# 30-40 MB per worker is pretty normal.

#

# To gain memory you can move to multithreading:

#

# master = true

# processes = 2

# threads = 2

# thread-stacksize = 512

#

# should be good enough. Monitoring memory is a good thing,

# use --reload-on-rss 80 to avoid your app growing up

At this point, test that the Emperor starts and correctly loads the vassals by running
uwsgi--emperor/etc/uwsgi/emperor.ini. In production, I use
systemd to manage the Emperor–the uWSGI docs have an excellent example systemd config file.

Next you need to configure Nginx to pass requests to the proper socket. Here’s an extremely simple Nginx config showing the server and location blocks. I run something a bit more complex in production, but this is easier to understand:

Back when I first started managing my own servers, I had no idea how to tune MySQL. I would read various forum posts where people talked about the necessity of ‘proper tuning’, but when I looked at the official docs, I struggled to differentiate between what really needed tuning versus what were arcane options that only mattered in edge cases.

So here’s the blog post I wish someone else had written.

It’s a longish read, but that’s primarily because I explained why a particular change is suggested… the actual changes are very straightforward and quick. You won’t be a MySQL guru afterwards, but it’ll be good enough for 90% of website admins. And from here you’ll know enough that the official docs will start to make sense.

This is aimed at admins running websites with < 5M pageviews a month off a single server–generally blogs or forums. For example, I run a server for a friend that has a handful of forums getting ~ 3M pageviews a month on a 8GB RAM, 6 core Linode using these tuning recommendations, and the average pageview generally spends < 10ms in MySQL query time.

First, most of these settings will be set in a global
my.cnf file somewhere on your server. Typically
/etc/my.cnf or
/etc/mysql/my.cnf.

Tuning MySQL starts with understanding the table engine. Basically this is how MySQL manages the data underneath the covers, and is typically invisible to the app developer or SQL analyst.

In the early days, MySQL shipped with a table engine called MyISAM, which is blazingly fast for single-user workloads and allowed MySQL to win many database benchmarks that only tested single-user performance. Unfortunately, MyISAM has two huge flaws:

First, it implemented table-level locking for all queries, meaning that if one user had a query running against a particular MyISAM table, no other queries could access that table until the first one finished. This doesn’t matter if you just have a single user running queries, but it causes problems as soon as there are more than a handful of concurrent website visitors.

Second, and more important, MyISAM wasn’t fully ACID compliant. It could (and did) cause data corruption in the event of a random power failure. Most admins never hit this problem, but those that did were royally screwed.

A few years later, Percona developed an alternative table engine called InnoDB. InnoDB was fully ACID compliant, so generally the data corruption issue was fixed. That alone makes it my preferred table engine.

Additionally, InnoDB switched most queries to row-level locking, meaning that only the specific row that a query was looking at would be locked. This allowed multiple queries to run simultaneously against the same table, at the expense of slightly slower throughput on any single query. For websites, InnoDB is almost always a better choice than MyISAM.

There there are some built-in tables that MySQL uses to manage itself–these are MyISAM by default and should not be changed. Only change MyISAM tables in your application’s database. Additionally, there are some special MySQL table types like ‘memory’ that should generally not be swapped over to InnoDB. If an application creates a table using one of these alternative table engines, there’s probably a very good reason for it.

Unfortunately, InnoDB didn’t support full-text indices until MySQL 5.6 (Maria DB 10). Many linux distributions still ship with MySQL 5.5 as the default, so any tables that include full-text indices can’t be migrated to InnoDB until MySQL is upgraded.

Sidenote for Xenforo admins: This is why the
xf_search_index table defaults to MyISAM–it uses a full-text index. If you swap it to InnoDB, be sure to also disable the ‘Enable Delayed Insert SQL Queries’ option under ACP > Performance. Theoretically the type of table engine shouldn’t matter to the client application, but I know of at least once person who started getting weird search results after switching
xf_search_index to InnoDB. Personally I haven’t experienced any problems, and I actually experienced significantly faster updates to the Xenforo search index under InnoDB. That said, my personal recommendation is if your Xenforo forum is big enough that you’re worried about tuning MySQL, you should switch to the official Xenforo Elastic Search plugin. Then since the
xf_search_index isn’t used anymore,
TRUNCATE it to save space. For the
xf_sessions and
xf_sessions_admin tables, InnoDB is fine but even better is setting up session caching using memcache.

You can set the default table type for new MySQL tables using the
default_storage_engine parameter. I strongly recommend setting it to InnoDB. Additionally, it’s easy to swap existing tables from MyISAM over to InnoDB using the SQL command
ALTER TABLE. You will need to stop the web application from accessing the table while it’s being converted–typically I just stop the webserver, as even the largest tables for most blogs/forums generally don’t take more than 20 minutes to swap if you’ve got fast SSDs.

When you alter to InnoDB, make sure to specify the
row_format. Options are
compressed,
dynamic, or the default
compact.
dynamic is preferred,
compact is the default only due to legacy reasons. If you are bottlenecked on RAM/disk, but not CPU, then use
compressed for tables where compression saves a measurable amount of space (ie, lots of text columns)… For Xenforo admins, the
xf_posts table is a good candidate.

In order use the
dynamic row format, you need to change the
innodb_file_format from Barracuda to Antelope.

If possible, try to completely migrate all MyISAM tables in your application databases over to InnoDB. That way, you only need to tune InnoDB cache settings and don’t have to worry about MyISAM cache settings.

However, if you still have tables in your application database that for one reason or another are required to stay on MyISAM, then try to set
key_buffer_size to slightly larger than your largest MyISAM table. If all your application tables are InnoDB, then drop
key_buffer_size down to 64M. Don’t drop it all the way to 0 because the internal MySQL tables will use it since they’re still MyISAM (and shouldn’t be changed).

For InnoDB, the most important variable to change is
innodb_buffer_pool_size. Basically, this is how much RAM MySQL can allocate to the InnoDB cache. For safety, the default is insanely low to make sure that a default MySQL install won’t kill your dev server. However, in production you want this number as high as possible without running out of RAM. Ideally, it’s big enough to fit all your InnoDB tables in RAM. Often that isn’t possible, especially on a VPS where typically the webserver, the db, and Linux’s filesystem cache are all competing for limited RAM. In that case, try to set the InnoDB buffer pool to at least several hundred megabytes.

Storytime: A few months after I bought RockClimbing.com, we migrated to a new host and the site promptly fell over every 10 minutes. I’d rented a pretty beefy dedicated box, and the stats showed we had plenty of unused RAM/CPU capacity. I’d never run my own server before, so it took some digging to realize that the underlying issue was the InnoDB buffer pool was still at 64M. As soon as I increased it to a couple of gigs, the problems disappeared.

For the typical forum or blog, a handful of posts get the majority of the traffic. As long as those posts can fit in the InnoDB buffer pool, you’ll typically be fine. For example, I know of a forum with ~30M posts, with total InnoDB table size of ~20gb, but a buffer pool of only ~1GB. Site works just fine because vast majority of traffic hits the most recent 100K posts.

If you’ve got a InnoDB buffer pool larger than 1 GB, then you’ll also need to adjust the
innodb_buffer_pool_instances. Adding more buffer pools adds a little more concurrency, although it’s a tradeoff between having many small pools versus a few larger pools. The rule of thumb is a minimum of 1 GB per buffer pool instance. However, MariaDB (explained shortly) defaults to an insane 4 or 8 buffer pool instances once you bump over 1 GB
innodb_buffer_pool_size, so be sure to manually set it lower. The benefits of more buffer pools drop off pretty quickly–even on rockclimbing.com (my largest site) where the buffer pool is 12GB, I only have 4 buffer pools.

Lastly, if your version of MySQL/MariaDB supports it, enable
innodb_save_on_shutdown and
innodb_load_on_startup. Basically it pre-warms your cache by saving the cache to disk whenever you shut down MySQL and reading it from disk whenever you start MySQL. That way when you reboot your server MySQL isn’t super slow for the first few pageviews.

Once you swap everything over to InnoDB and then tune the InnoDB Buffer pool, you’re pretty much good for 98% of websites.

What is MariaDB? Another history lesson: MySQL started as open source, then was acquired by Oracle. While Oracle can’t legally make the database closed source, they’ve been trying to close source as much as possible. The guy who started MySQL started a MySQL replacement project called MariaDB. It adds a number of nice features, including a small speed increase.

It’s binary compatible with MySQL, so it shouldn’t cause any issues with your app. It’s also easy to install, just use the pre-packaged RPMs they provide. I’ve been using it on CentOS 7 as a drop-in replacement for MySQL with zero issues.

You’ll want to switch to MariaDB 10, as that’s the equivalent of MySQL 5.6 and adds the full text indices to InnoDB. MariaDB 10 also adds some nifty features like a regex-based search-and-replace. Very handy for updating urls after migrating an old vBulletin forum over to Xenforo.

One MySQL tuning tip that you’ll find in many blog posts is to increase the size of the MySQL Query Cache. However, the benefits are hotly debated, many people say it actually slows things down because in many situations because you’re losing more time checking the cache and finding it doesn’t have what you want than if you just bypassed the cache altogether. I just disable it most of the time.

Keep yourself out of MySQL encoding hell! I’ve been there, it’s not fun. Set the default character set in the mysql server. UTF-8 is probably what you want. MySQL’s UTF-8 is actually pseudo-UTF8, the real UTF-8 is UTF-8MB4. WordPress switched to MB4 in WP 4.2, Xenforo still on UTF-8. You’ll also want to set the default collation, likely to
utf8mb4_unicode_ci .

Sidenote: I hate how MySQL has sloppy engineering. For example, MyISAM not actually being ACID compliant and losing data. UTF-8 not actually being UTF-8. Drives me nuts. If I have the choice, I much prefer PostgreSQL. They may be slower on feature releases, but at least they get it right when they do release something.

Lastly, a security tip. If you’re running MySQL on the same box as your webserver, then make sure MySQL only listens on
localhost and rejects all other connections. For remote access, just SSH in, then access MySQL. Most GUIs, including MySQL Workbench (my favorite) support this type of connection.

Here’s the rather random backstory about how I found myself in Croatia preparing to speak on “business networking.”

Eighteen months ago, Ryan Stephens graciously asked me to do a guest post on his blog about my approach to networking. I don’t pretend that I’m an amazing networker who can solve all your problems. Generally I’m upfront with people that, “I can’t solve your problems, but I might know someone who can help you.” 😉

A number of people tweeted the post, and I received some attention for it. However, I was completely shocked when few weeks later, I received an e-mail from Nenad Maljkovic, saying that he saw a tweet linking to my blog post, and liked it… and would I be interested in speaking in Zagreb at TEDx?

Not every day you get invited half way around the world to speak based on a random guest post on a friend’s blog.

I used to rely heavily on a virtual assistant, and got a lot of questions about the process. This blog post tries to answer them. These days I spend most of my time writing code and very little time managing emails/meetings, so when my last VA graduated college and moved onto a real job, I didn’t look for a new one.

It all works–when my virtual assistant Katie went on vacation for the entire summer, it took thirty minutes of my time to train someone else–fifteen minutes to explain things the new VA didn’t understand from my templates, and fifteen minutes to give her an e-mail on my Google Apps account. And she’d never worked as an assistant before!

To be clear, a virtual assistant isn’t the same as an outsourced Executive Admin. A friend of mine is the EA for the CTO of a public company, and when I asked about her job, she said a good EA does far more than simply schedule meetings:

It is part of my job to sit in on the CTO’s staff meetings, and to know the direction and goals of our company . I am the CTO’s partner and it is my job to make sure that the meetings I allow onto his calendar fit into the bigger picture of what were doing.

What I’ve learned from having a virtual assistant:

How to be very explicit with my instructions

I know better what should be outsourced. It’s almost always faster for me if it’s a one-time thing. But if it’s a repetitive task, it’s probably worth teaching her.

I get a heckuva lot more done–she not only removes time, she removes annoyance–that mental friction that comes from having to do tasks that I downright hate (like scheduling meetings.)

She not only takes care of things for me, she does them better and faster than I ever could. Face it–just as you’re uniquely talented at some things, you’re uniquely flawed in others.

First–what tasks are you going to have them do? How are you going to communicate with them?

I recommend before you start communicating with an actual VA that you take some time and roughly draft out your guidelines for how to manage your calendar, meetings, travel details, and contact and account information. For examples, see the end of this blog post.

Second–who to hire?

For a personal assistant, trust is PARAMOUNT. Katie has full access to my Paypal account, credit cards, calendar, e-mail, etc. She could REALLY mess up my life…

Reliable college students are often a great fit because you only need to pay them more than the minimum wages they’d be paid for washing dishes… $8-$15 an hour depending on whether they’re an independent contractor or part-time employee.

(Despite the naysayers in the comments below, the two college students I’ve hired have said they learned more by working for me than any other boss they’d ever had… plus they valued the flexibility and part-time nature of the job.)

Here are the 10 documents I use. (I just stick them on my personal wiki under a single folder titled “Reference: Assistants”.)

General Information About Being My Assistant–Start Here [Self-explanatory]

How to Add Events to My Calendar [Covers my five calendar categories and reminders]

How to Schedule My Meetings [Addresses the people side of scheduling meetings]

How to Add Someone to Jeff’s Contacts [Self-explanatory]

How to Process My E-mail–Ignore For Now [For now, I find it simplest to handle my own e-mail]

– You not only handle my administrative tasks, you do so better than I ever could have.
– I tell you what I want, and you figure out how to make it happen.
– You handle my calendar, schedule meetings, and occasionally other tasks.
– You do not prevent people from contacting me, but instead force them to clarify rather than “chit-chat”.
– We both teach each other to collaborate better–suggest technology, interpersonal skills, marketing advice, etc.
– Never make it look like I’m soooo important that I need an assistant. (You’re freeing me to focus on what I do best.)

Task information:

– Deadlines are important. ‘Nuf said. (Let me know if a deadline is unrealistic.)
– Unless specified, assume time zone is Pacific Standard Zone.
– When you first start working, I NEED confirmations that you received the task (”On it–will be done at Xpm” is enough).
– When you complete tasks don’t require sending me anything, all I need is an e-mail that says “XXtask done”

– Never masquerade as Jeff.
– If someone wants to talk with me, that’s fine–see the page “How-to Schedule My Meetings”
– Be honest, be tactful, and be yourself.

Decisions:

– You are empowered to make decisions under $50. (Please notify me what you did.)

Billing:

Paypal is preferred. Let me know if you want something different.

E-mail Signature:

(Note: I’m always open to suggestions. I currently list my Google Voice #, up to you whether to change to your personal #. You are welcome to include your personal website and twitter handle as a way of advertising your services.)

Do not create all-day appointments on my day-to-day calendar–either block out the specific time I’m busy (eg, 8am-6:30pm if evening free), or put in followup calendar for reminders. (Otherwise it screws up free/busy information that I share with others).

If duration unspecified–use your best judgment…

Reminders–I never use popups. E-mail preferred. Text message reminders to important meetings. Look over my calendar defaults to understand my preferences. Use your best judgment…

If I’m meeting someone, include the location and their phone number in the title. eg, “Meet Tim (123-123-1234) @ Location”

Do not use “Meet X” for community events–“meet” is a hot button for me meaning I need to be there.

If it’s a physical location that I don’t regularly visit, please put the physical address in the location so I can quickly Google map it. (If e-mail says “Jason’s house” just put that in the location–I’ll know where to go.)

If I e-mail you a link to an event, please put the link in the notes–often these events go under community, and I’ll attend if my schedule’s free (and want to check out the link ahead of time).

Currently, these are my calendars:

Day-to-Day

For day-to-day stuff that I need to attend.

This is the only calendar I share with my family/key friends, so if I’m busy, it needs to be on here. If I may/may not attend, it’ll probably go under community events.

Default alerts: e-mail day before and sms before any scheduled calls and meetings.

Things I may or may not attend. Generally not important, but I want to know about. Basic settings: E-mail alerts 7 days before (I’ll make a decision then)

Followup

This is my tickler calendar to remind me of stuff I need to follow up on at a later date, or decisions I’m postponing. Never put actual events I’m attending here. Always phrase appointments with verbs. (ie, “decide on…, call X about Y, schedule…)

Default reminders: e-mail 3 days in advance and a text message the day of. (Goal: get my e-mail inbox under control, so I don’t have to get texted about stuff)

Birthday-Anniversary

Where I track family/key friend’s birthdays & anniversaries. These should be all-day events in Google Calendar. Default reminders: 7 day e-mail (I’ll decide whether to get a gift) and e-mail that day (so I can call/e-mail them).

Mentors

This is where I stick recurring monthly appointments to call specific mentors in my life. Reminders: e-mail the day of. Try to schedule these appointments at least two days apart.

How to Schedule My Meetings [Addresses the people side of scheduling meetings]

Rule #1–use your best judgment, even in spite of these rules…

(I trust you, and I’ll let you know if I disagree.)

Rhythm:

I start my working day between 8 and 9am

I generally quit work between 6pm-1am

I work 6-10 hours per day

I go to bed around 10:30pm–although not uncommon to pull an allnighter.

I work best in three to five hour chunks–thus I prefer to batch a bunch of meetings/calls together.

As a strong extrovert, I’m normally quite energetic after a face-to-face breakfast/lunch meeting, and like to have the next few hours free to work off that energy

I prefer to keep my evenings (after 5:30pm) unscheduled

My life is full, but my schedule is currently very flexible (I like to attend lunch-time frisbee twice a week)

Optimum schedule: Lunch T/R, breakfast any weekday

You have full access to my calendar, so schedule wherever works best–we’ll refine over time

(I work hard to keep my calendar updated all the time. If something needs to be rescheduled, I’ll let you know.)

Responding to Meeting Requests:

Currently, most meeting requests from others are people reaching out and looking to connect via phone…

I love talking with them, but want to make sure they’re serious about the call.

Ask which topics they would like to discuss, and put in the calendar description notes.

Here’s a sample e-mail I’ve used before:

Name,

This is YourFirstName, Jeff’s virtual assistant–I manage his calendar and schedule meetings.

Jeff said he appreciated your reaching out to him–he would love to chat with you.
He normally finds a 20-30 minute phone call most efficient.

However, Jeff doesn’t want to waste your time, so he asked that you e-mail me 3 potential discussion topics/questions.

Pleasesend me your phone number, and two good days/times to call you this week (include time zone). I’ll set an appointment on Jeff’s calendar and get back to you.

By the way, I checked out your blog–very nice! [totally optional–only if true]

Thanks!

YourFirstName

Scheduling meeting requests I’ve confirmed:

If they respond, email Jeff the confirmed meeting time and location–BCC’ing often works well.

If they ignore you. Send a 2nd e-mail three days later, and ask if they saw it–often, they just forgot to respond. Ask me if you don’t hear back after two e-mails.

When you communicate with anyone, be clear you’re merely facilitating a meeting–you’re not trying to make Jeff seem busy/impressive. Never be pushy.

Here’s a great sample e-mail:

Options:

1. How about <meeting place> at <time>? (You initiate the meeting place.)

All travel related emails–like itineraries, hotel rooms, conference registration numbers, etc–should be labeled with the “Travel Details” label–that label maps to a folder on Jeff’s iPod and phone so that Jeff can easily pull them up on his phone when he needs them.