Use your business data to your advantage with the help of Syncfusion’s new data science offerings. Discover how a custom big data solution can provide your company with valuable predictions about key market trends.

Using Unrestricted Languages

Packt Publishing

Take your skills with PostgreSQL to a whole new level with this fascinating guide to server programming. A step by step approach with illuminating examples will educate you in the full range of possibilities.

Are untrusted languages inferior to trusted ones?

No, on the contrary, these languages are untrusted in the same way that a sharp knife is untrusted and should not be trusted to very small children, at least not without adult supervision. They have extra powers that ordinary SQL or even the trusted languages (such as PL/pgSQL) and trusted variants of the same language (PL/Perl versus PL/Perlu) don't have.

You can use the untrusted languages to directly read and write on the server's disks, and you can use it to open sockets and make Internet queries to the outside world. You can even send arbitrary signals to any process running on the database host. Generally, you can do anything the native language of the PL can do.

However, you probably should not trust arbitrary database users to have the right to define functions in these languages. Always think twice before giving all privileges on some untrusted language to a user or group by using the *u languages for important functions.

Can you use the untrusted languages for important functions? Absolutely. Sometimes, it may be the only way to accomplish some tasks from inside the server. Performing simple queries and computations should do nothing harmful to your database, and neither should connecting to the external world for sending e-mails, fetching web pages, or doing SOAP requests. They may cause delays and even queries that get stuck, but these can usually be dealt with by setting an upper limit as to how long a query can run by using an appropriate statement time-out value. Setting a reasonable statement time-out value by default is a good practice anyway.

So, if you don't deliberately do risky things, the probability of harming the database is no bigger than using a "trusted" (also known as "restricted") variant of the language. However, if you give the language to someone who starts changing bytes on the production database "to see what happens", you probably get what you asked for.

Will untrusted languages corrupt the database?

The power to corrupt the database is definitely there, since the functions run as the system user of the database server with full access to the filesystem. So, if you blindly start writing into the data files and deleting important logs, it is very likely that your database will be corrupted.

Additional types of denial-of-service attacks are also possible such as using up all memory or opening all IP ports; but there are ways to overload the database using plain SQL as well, so that part is not much different from the trusted database access with the ability to just run arbitrary queries.

So yes, you can corrupt the database, but please don't do it on a production server. If you do, you will be sorry.

Why untrusted?

PostgreSQL's ability to use an untrusted language is a powerful way to perform some nontraditional things from database functions. Creating these functions in a PL is an order of magnitude smaller task than writing an extension function in C. For example, a function to look up a hostname for an IP address is only a few lines in PL/Pythonu:

Creating the same function in the most untrusted language, C, involves writing tens of lines of boilerplate code, worrying about memory leaks, and all the other problems coming from writing code in a low-level language. I recommend prototyping in some PL language if possible, and in an untrusted language if the function needs something that the restricted languages do not offer.

Why PL/Python?

All of these tasks could be done equally well using PL/Perlu or PL/Tclu; I chose PL/Pythonu mainly because Python is the language I am most comfortable with. This also translates to having written some PL/Python code, which I plan to discuss and share with you in this article.

Quick introduction to PL/Python

PL/pgSQL is a language unique to PostgreSQL and was designed to add blocks of computation and SQL inside the database. While it has grown in its breath of functionality, it still lacks the completeness of syntax of a full programming language. PL/Python allows your database functions to be written in Python with all the depth and maturity of writing a Python code outside the database.

A minimal PL/Python function

Here, we see that creating the function starts by defining it as any other PostgreSQL function with a RETURNS definition of a text field:

CREATE FUNCTION hello(name text)
RETURNS text

The difference from what we have seen before is that the language part is specifying plpythonu (the language ID for PL/Pythonu language):

$$ LANGUAGE plpythonu;

Inside the function body it is very much a normal python function, returning a value obtained by the name passed as an argument formatted into a string 'hello %s !' using the standard Python formatting operator %:

Data type conversions

The first and last things happening when a PL function is called by PostgreSQL are converting argument values between the PostgreSQL and PL types. The PostgreSQL types need to be converted to the PL types on entering the function, and then the return value needs to be converted back into the PostgreSQL type.

Except for PL/pgSQL, which uses PostgreSQL's own native types in computations, the PLs are based on existing languages with their own understanding of what types (integer, string, date, …) are, how they should behave, and how they are represented internally. They are mostly similar to PostgreSQL's understanding but quite often are not exactly the same. PL/Python converts data from PostgreSQL type to Python types as shown in the following table:

When the PostgreSQL return type is Boolean, the return value will be evaluated for truth according to the Python rules. That is, 0 and empty string are false, but notably f is true.

When the PostgreSQL return type is bytea, the return value will be converted to a string (Python 2) or bytes (Python 3) using the respective Python built-ins, with the result being converted bytea.

For all other PostgreSQL return types, the returned Python value is converted to a string using Python's built-in str, and the result is passed to the input function of the PostgreSQL data type.

Strings in Python 2 are required to be in the PostgreSQL server encoding when they are passed to PostgreSQL. Strings that are not valid in the current server encoding will raise an error; but not all encoding mismatches can be detected, so garbage data can still result when this is not done correctly. Unicode strings are converted to the correct encoding automatically, so it can be safer and more convenient to use those. In Python 3, all strings are Unicode strings.

In other words, anything but 0, False, and an empty sequence, including empty string ' ' or dictionary becomes PostgreSQL false.

One notable exception to this is that the check for None is done before any other conversions and even for Booleans, None is always converted to NULL and not to the Boolean value false.

For the bytea type, the PostgreSQL byte array, the conversion from Python's string representation, is an exact copy with no encoding or other conversions applied.

Writing simple functions in PL/Python

Writing functions in PL/Python is not much different in principle from writing functions in PL/pgSQL. You still have the exact same syntax around the function body in $$, and the argument name, types, and returns all mean the same thing regardless of the exact PL/language used.

A simple function

What you see here is that the PL/Python arguments are passed to the Python code after converting them to appropriate types, and the result is passed back and converted to the appropriate PostgreSQL type for the return value.

Functions returning a record

To return a record from a Python function, you can use:

A sequence or list of values in the same order as the fields in the return record

A dictionary with keys matching the fields in the return record

A class or type instance with attributes matching the fields in the return record

Notice [0] at the end of u = plpy.execute(...)[0] in all the examples. It is there to extract the first row of the result, as even for one-row results plpy.execute still returns a list of results.

Danger of SQL injection!

As we have neither executed a prepare() method and executed a execute() method with arguments after it, nor have we used the plpy.quote_literal() method (both techniques are discussed later) to safely quote the username before merging it into the query, we are open to a security flaw known as SQL injection. So, make sure that you only let trusted users call this function or supply the username argument.

Calling the function defined via any of these three CREATE commands will look exactly the same:

It usually does not make sense to declare a class inside a function just to return a record value. This possibility is included mostly for cases where you already have a suitable class with a set of attributes matching the ones the function returns.

Table functions

When returning a set from a PL/Python functions, you have three options:

Return a list or any other sequence of return type

Return an iterator or generator

yield the return values from a loop

Here, we have three ways to generate all even numbers up to the argument value using these different styles.

The list here is returned by a built-in Python function called range, which returns a result of all even numbers below the argument. This gets returned as a table of integers, one integer per row from the PostgreSQL function. If the RETURNS clause of the function definition would say int[] instead of SETOF int, the same function would return a single number of even integers as a PostgreSQL array.

The next function returns a similar result using a generator and returning both the even number and the odd one following it. Also, notice the different PostgreSQL syntax RETURNS TABLE(...) used this time for defining the return set:

The generator is constructed using a generator expression (x for x in <seq>). Finally, the function is defined using a generator using and explicit yield syntax, and yet another PostgreSQL syntax is used for returning SETOF RECORD with the record structure defined this time by OUT parameters:

CREATE FUNCTION even_numbers_with_yield(up_to int,
OUT even int,
OUT odd int)
RETURNS SETOF RECORD
AS $$
for i in xrange(0,up_to,2):
yield i, i+1
$$ LANGUAGE plpythonu;

The important part here is that you can use any of the preceding ways to define a PL/Python set returning function, and they all work the same. Also, you are free to return a mixture of different types for each row of the set:

As you see, the data returning a part of PL/Pythonu is much more flexible than returning data from a function written in PL/pgSQL.

Running queries in the database

If you have ever accessed a database in Python, you know that most database adapters conform to a somewhat loose standard called Python Database API Specification v2.0 or DBAPI 2 for short.

The first thing you need to know about database access in PL/Python is that in-database queries do not follow this API.

Running simple queries

Instead of using the standard API, there are just three functions for doing all database access. There are two variants: plpy.execute() for running a query, and plpy.prepare() for turning query text into a query plan or a prepared query.

The simplest way to do a query is with:

res = plpy.execute(<query text>, [<row count>])

This takes a textual query and an optional row count, and returns a result object, which emulates a list of dictionaries, one dictionary per row.

As an example, if you want to access a field 'name' of the third row of the result, you use:

res[2]['name']

The index is 2 and not 3 because Python lists are indexed starting from 0, so the first row is res[0], the second row res[1], and so on.

Using prepared queries

In an ideal world this would be all that is needed, but plpy.execute(query, cnt) has two shortcomings:

It does not support parameters

The plan for the query is not saved, requiring the query text to be parsed and run through the optimizer at each invocation

We will show a way to properly construct a query string later, but for most uses simple case parameter passing is enough. So, the execute(query, [maxrows]) call becomes a set of two statements:

The first statement prepares the query, which parses the query string into a query tree, optimizes this tree to produce the best query plan available, and returns the prepared_query object. The second row uses the prepared plan to query for a specific user's superuser status.

The prepared plan can be used multiple times, so that you could continue to see if user bob is superuser.

res = plpy.execute(plan, ["bob"])
print res[0]["usesuper"]

Caching prepared queries

Preparing the query can be quite an expensive step, especially for more complex queries where the optimizer has to choose from a rather large set of possible plans; so, it makes sense to re-use results of this step if possible.

The current implementation of PL/Python does not automatically cache query plans (prepared queries), but you can do it easily yourself.

The values in SD[] and GD[] only live inside a single database session, so it only makes sense to do the caching in case you have long-lived connections.

Writing trigger functions in PL/Python

As with other PLs, PL/Pythonu can be used to write trigger functions. The declaration of a trigger function is different from an ordinary function by the return type RETURNS TRIGGER. So, a simple trigger function that just notifies the caller that it is indeed called looks like this:

Of course, the preceding trigger function is quite useless, as will be any trigger without knowing when and on what data change the trigger was called. All the data needed by a trigger when it is called is passed in via the trigger dictionary, called TD. In TD, you have the following values:

Key

Value

TD["event"]

The event the trigger function is called for; one of the following strings is contained as the event:

INSERT, UPDATE, DELETE, or TRUNCATE

TD["when"]

One of BEFORE, AFTER, or INSTEAD OF.

TD["level"]

ROW or STATEMENT.

TD["old"]

This is the before-command image of the row. For low-level UPDATE and DELETE triggers, this contains a dictionary for the values of the triggering row before the changes have been made by the command. It is None for other cases.

TD["new"]

This is the after-command image of the row. For low-level INSERT and UPDATE triggers, this contains a dictionary for the values of the triggering row after the changes have been made by the command. It is None for other cases.

If you are in a BEFORE or INSTEAD OF trigger, you can make changes to this dictionary and then signal PostgreSQL to use the changed tuple by returning the string MODIFY from the trigger function.

TD["name"]

The trigger name from the CREATE TRIGGER command.

TD["table_name"]

The name of the table on which the trigger occurred.

TD["table_schema"]

The schema of the table on which the trigger occurred.

TD["relid"]

The object identifier (OID) of the table on which the trigger occurred.

TD["args"]

If the CREATE TRIGGER command included arguments, they are available from TD["args"][0] to TD["args"][n-1].

In addition to doing anything you can do in ordinary PL/Python functions, such as modifying data in tables, writing to files and sockets, and sending e-mails, you can also affect the behavior of the triggering command.

If TD["when"] is ("BEFORE", "INSTEAD OF") and TD["level"] == "ROW", you can return SKIP to abort the event. Returning None or OK indicates that the row is unmodified and it is OK to continue. Returning None is also the default behavior for Python if the function does a simple return or runs to the end without a return statement; so, you don't need to do anything for this case.

In case you have modified values in the TD["new"] and you want PostgreSQL to continue with the new values, you can return MODIFY to indicate that you've modified the new row. This can be done if TD["event"] is INSERT or UPDATE, otherwise the return value is ignored.

Exploring the inputs of a trigger

The following trigger function is useful when developing triggers, so that you can easily see what the trigger function is really getting when called:

This function formats all the data passed to the trigger in TD using pprint.pformat, and then sends it to the client as a standard Python info message using plpy.notify. For testing this out, we create a simple table and then put an AFTER … FOR EACH ROW … trigger using this function on that table:

Most of this is expected and corresponds well to the table of TD dictionary values given in the previous table. What may be a little unexpected is the fact that the arguments given in the CREATE TRIGGER statement are all converted to strings, even the NULL. When developing your own triggers, either in PL/Python or any other language, it may be useful to put this trigger on the table as well to check that the inputs to the trigger are as expected. For example, it is easy to see that if you omit the FOR EACH ROW part, the TD['old'] and TD['new'] will be both empty as the trigger definition defaults to FOR EACH STATEMENT.

A log trigger

Now, we can put this knowledge to work and write a trigger that logs changes to the table to either a file or to a special log-collector process over UDP. Logging to a file is the simplest way to permanently log the changes in transactions which were rolled back. If these were logged to a log table, the ROLLBACK command would also remove the log records. This may be a crucial audit requirement for you business.

Of course, this also has the downside. You will be logging the changes that may not be permanent due to the transaction being rolled back, but this is the price you have to pay for not losing the log records.

First, this trigger checks if it already has a logger function defined and cached in the function's local dictionary SD[]. As the same trigger may be used with many different log destinations, the log function is stored under the key constructed as a Python tuple from the trigger function arguments in the CREATE TRIGGER statement. We can not use the TD["args"] list directly as a key, as Python dictionary keys have to be immutable, which a list is not, but a tuple is.

If the key is not present, meaning this is the first call to this particular trigger, we have to create an appropriate log function and store it. To do this, we examine the first argument for the log destination type.

For the udp log type, we create a UDP socket for writing. Then, we define a function, passing in this socket and also the other two trigger arguments as default arguments for the function. This is the most convenient way to create a closure, and to bundle a function with some data values in Python.

For the file type, we just open this file in the append mode (a+) and also create a log function. The log function writes a message to this file and flushes the write, so the data is written to the file immediately and not some time later when the write buffer fills up. The log function created in either of these cases is stored in SD[tuple(TD["args"])].

At this point, we also prepare and save a query plan for getting other data we want to log and save this in SD['env_plan']. Now that we are done with the one-time preparations, we can proceed with the actual logging part, which is really very simple.

Next, we retrieve the logging function (logfunc = SD[args]) and get the row of the other logged data:

env_info_row = plpy.execute(SD['env_plan'])[0]

Finally, we convert all the logged data into one JSON object (log_msg = json.dumps({...})) and then use the logging function to send it to the log, logfunc(log_msg).

And that's it.

Next, let's test it out to see how it works by adding another trigger to our test table we created earlier:

CREATE TRIGGER test_audit_trigger
AFTER INSERT OR UPDATE OR DELETE ON test
FOR EACH ROW
EXECUTE PROCEDURE log_trigger('file', '/tmp/test.json.log');

Any changes to the table done via INSERT, UPDATE, or DELETE are logged into /tmp/test.json.log. This file is initially owned by the same user running the server, usually postgres; so to look at it you need to either be that user or root user, or you have to change the permissions on the file created to allow reading.

If you want to test the UDP logging part, you just have to define another trigger with different arguments:

CREATE TRIGGER test_audit_trigger_udp
AFTER INSERT OR UPDATE OR DELETE ON test
FOR EACH ROW
EXECUTE PROCEDURE log_trigger('udp', 'localhost', 9999);

Constructing queries

PL/Python does a good job of managing values passed to prepared query plans, but a standard PostgreSQL query plan can take an argument in a very limited number of places. Sometimes, you may want to construct whole queries, not just pass values to predefined queries. For example, you can't have an argument for a table name, or a field name.

So, how would you proceed if you want to construct a query from the function's arguments and be sure that everything is quoted properly and no SQL injection would be possible? PL/Python provides three functions to help you with proper quoting of identifiers and data just for this purpose.

The function plpy.quote_ident(name) is meant for quoting identifiers, that is, anything that names a database object or its attribute like a table, a view, a field name, or function name. It surrounds the name with double quotes and takes care of properly escaping anything inside the string which would break the quoting:

And yes, 5" \" is a legal table or field name in PostgreSQL; you just have to always quote it if you use it in any statement.

The DO syntax creates an anonymous block inside your database session. It is a very handy way to run some procedural language code without needing to create a function.

The other two functions are for quoting literal values. The function, plpy.quote_literal(litvalue), is for quoting strings and plpy.quote_nullable(value_or_none) is for quoting a value, which may be None. Both of these functions quote strings in a similar way, by enclosing them in single quotes (str becomes 'str') and doubling any single quotes or backslashes:

The only difference between these two is that plpy.quote_nullable() can also take a value None, which will rendered as string NULL without any surrounding quotes. The argument to both of these has to be a string or a unicode string. If you want it to work with a value of any Python type, wrapping the value in str(value) usually works well.

Handling exceptions

With any bit of code, you need to make sure you handle when errors occur and your PL/Python functions are not an exception.

Before Version 9.1 of PostgreSQL, any error in an SQL query caused the surrounding transaction to be rolled back:

You can manually use the SAVEPOINT attributes to control the boundaries of the rolled-back block, at least as far back as Version 8.4 of PostgreSQL. This will reduce the amount of the transaction that is rolled back:

When the SAVEPOINT foo; command is executed in PL/Python, an SQL error will not cause full "ROLLBACK;" but an equivalent of "ROLLBACK TO SAVEPOINT foo;", so only the effects of commands between SAVEPOINT and the error are rolled back:

hannu=# select * from ttable ;
id
----
1
3
(2 rows)

In Version 9.1, there are two important changes in how PostgreSQL exceptions are handled. If no SAVEPOINT or subtransaction is used, each invocation of plpy.prepare() and plpy.execute() is run in it's own subtransaction, so that an error will only rollback this subtransaction and not all of the current transaction. Since using a separate subtransactions for each database interaction involves extra costs, and you may want to control the subtransaction boundaries anyway, a new Python context manager, plpy.subtransaction(), is provided.

Atomicity in Python

While the subtransactions manage data changes in the PostgreSQL database, the variables on Python side's of the fence live their separate lives. Python does not provide even a single-statement level atomicity, as demonstrated by the following:

As you can see, it is possible that even a single multi-assignment statement can be executed only halfway through. This means that you have to be be prepared to fully manage your Python data yourself. The function, plpy.subtransaction(), won't help you in any way with managing Python variables.

Debugging PL/Python

First, let's start by stating that there is no debugger support when running functions in PL/Python; so it is a good idea to develop and debug a PL/Python function as a pure Python function as much as possible and only do the final integration in PL/Python. To help with this, you can have a similar environment in your Python development environment using the plpy module.

Just put the module in your path and do import plpy before you try running your prospective PL/Pythonu functions in an ordinary interpreter. If you use any of the plpy.execute(...) or plpy.prepare() functions, you also need to set up a database connection before using these by calling plpy.connect(<connectstring>).

Using plpy.notice() for tracking the function's progress

The debugging technology I use most often in any language is printing out intermediate values as the function progresses. If the printout rolls past too fast, you can slow it down by sleeping a second or two after each print.

If you try to use print in a PL/Python function, you will discover that nothing is printed. In fact, there is no single logical place to print to when running a pluggable language inside a PostgreSQL server.

The closest thing to print in PL/Python is the function plpy.notice(), which sends a PostgreSQL NOTICE to the client and also to the server log if log_min_messages is set to value notice or smaller.

If you compared the fact(x) function in Python and PL/Python you noticed an extra line at the beginning of the PL/Python function:

global x

This is needed to overcome an implementation detail that often surprises PL/Pythonu developers; the function arguments are not the function arguments in the Python sense and neither are they locals. They are passed in as variables in the function's global scope.

Using assert

Similar to ordinary Python programming, you can also use Python's assert statement to catch conditions which should not happen:

You will get a message about AssertionError together with the location of the failing line number.

Redirecting sys.stdout and sys.stderr

If all the code you need to debug is your own, the preceding two techniques will cover most of your needs. However, what do you do in cases where you use some third party libraries which print out debug information to sys.stdout and/or sys.stderr?

Well, in this case you can replace Python's sys.stdout and sys.stdin with your own pseudo file object that stores everything written there for later retrieval. Here is a pair of functions. The first of which does the capturing of sys.stdout or uncapturing; if it is called with the argument, do_capture set to false, and the second one returns everything captured:

Thinking out of the "SQL database server" box

We'll wrap up the article on PL/Python with a couple of sample PL/Pythonu functions for doing some things you would not usually consider doing inside the database function or trigger.

Generating thumbnails when saving images

Our first example uses Python's powerful Python Imaging Library (PIL) module to generate thumbnails of uploaded photos. For ease of interfacing with various client libraries, this program takes the incoming image data as a base-64 encoded string:

The Python code is more or less a straight rewrite from the PIL tutorial, except that the files to read the image from and write the thumbnail image to, are replaced with Python's standard file-like StringIO objects. For all this to work, you need to have PIL installed on your database server host.

In Debian/Ubuntu, this can be done by running sudo apt.get install python-imaging. On most modern Linux distributions, an alternative is to use Python's own package distribution system by running sudo easy_install PIL.

Sending an e-mail

The next sample is a function for sending e-mails from inside a database function:

This function formats a message (msg = ""), converts a comma-separated To: address into a list of e-mail addresses (recipients_list = [r.strip()...), connects to a SMTP server, and then passes the message to the SMTP server for delivery.

To use this function in a production system, it would probably require a bit more checking on the formats and some extra error handling in case something goes wrong. You can read more about Python's smtplib at http://docs.python.org/library/smtplib.html.

Summary

In this article, we saw that it is relatively easy to do things way beyond what a simple SQL database server is supposed to support, thanks to its pluggable language's support.

In fact, you can do almost anything in the PostgreSQL server you could do in any other application server. Hopefully, this article just scratched the surface on some of the ideas of what you can do inside a PostgreSQL server.

Alerts & Offers

Series & Level

We understand your time is important. Uniquely amongst the major publishers, we seek to develop and publish the broadest range of learning and information products on each technology. Every Packt product delivers a specific learning pathway, broadly defined by the Series type. This structured approach enables you to select the pathway which best suits your knowledge level, learning style and task objectives.

Learning

As a new user, these step-by-step tutorial guides will give you all the practical skills necessary to become competent and efficient.

Beginner's Guide

Friendly, informal tutorials that provide a practical introduction using examples, activities, and challenges.

Essentials

Fast paced, concentrated introductions showing the quickest way to put the tool to work in the real world.

Cookbook

A collection of practical self-contained recipes that all users of the technology will find useful for building more powerful and reliable systems.

Blueprints

Guides you through the most common types of project you'll encounter, giving you end-to-end guidance on how to build your specific solution quickly and reliably.

Mastering

Take your skills to the next level with advanced tutorials that will give you confidence to master the tool's most powerful features.

Starting

Accessible to readers adopting the topic, these titles get you into the tool or technology so that you can become an effective user.

Progressing

Building on core skills you already have, these titles share solutions and expertise so you become a highly productive power user.