Subscribe to Loriowar

Misprint

How to work with MS Access under Unix

Introduction

Many companies use MS Access even until now.
So, from time to time, someone want to connect to an ancient database. Moreover
they want to extract data from there and to make this within standard
Unix-based server. Of course, this data needed for an web application. And
therefore someone want to use data from MS Access together with data from
modern DB.

So, I want to describe several approach to creature called MS Access.
Therefore, initial task is follows: to establish connect from Ruby on Rails
application or from PostgreSQL (using FDW) to MS Access
database and be able to extract data, preferable, in real time.

Here
I'll try to aggregate all information about this problem and try to
describe unobvious cases. Hope, this will save time for someone.

Conversion into CSV

In the beginning I describe a simple working solution. This works for Ubuntu
14.04 ( and must works for most other Unix system) and not required any special
knowledge, skills or any other magic.

Oddities of isql

Tool called isql is very squeamish and unfriendly. There is lots of
limitation on syntax. For example, if you end a command with semicolon you get
an error. Moreover there is no way to write multiline queries, no hints or tips
with error localization, no normal documentation in case of working with
nonstandard table/column names and so on. Therefore there is an simple shell
called pyodbc-cli. This
can help to evade some inconvenience of isql and concentrate on
querying, not syntax.

Exotic encoding of tables/columns

There is rumors about using 'Changset' argument in case of usage
uncommon code page. Here is example of /etc/odbc.ini this this
options:

For me this argument do nothing. In isql console I can work with
both: Unicode and CP1251mdb files regardless of
'Changset' option. However, in iusql I obtain lots of question
marks (i.e. something like this: ��������) again regardless of
'Changset' option.

Alternative for isql

Alternative variant of isql
console is a mdb-sql from mdbtools package.
For this tool no needs to configure /etc/odbcinst.ini and
/etc/odbc.ini. Console run for certain mdb file through following
command:

mdb-sql /opt/db/MS_Access.mdb

All other description can be found in man page. Single exception: I
can't execute query over DB with Russian table names using mdb-sql.
For standard DB with unicode all works fine.

Ruby/Rails approach

For now (mid 2016), last release of MS Access were in September 22, 2015. But any works with adapters for
ActiveRecord ended near 2008 year. So, lets take a look on available
variants.

I have no idea how to get work this
adapter under Rails 3+. Because there is no appropriate documentation about
installation and configuration of gem and related ODBC-driver. If you
have enough time and skill you can submerge into this gem and try to get them
work with ActiveRecord 3+. In this case - good luck!

More details about syntax and available command of ruby-odbc gem can be
found in ruby-odbc/test
folder on GitHub.

Mdb gem

This gem provide a DSL for working
with mdb files. And this interface looks nice. But gem is just a wrapper for previously described
mdbtools. I.e. under the hood
there is same convert operation from mdb into cvs and further
processing of csv within RAM.

Alternative variant of adapter

There is a commercial Access
ODBC Driver. But there is less information about this. In optimistic case
this adapter can help with advanced queries over Access. I.e. maybe, this
provide full list of statements (LIMIT, AS, GROUP etc) instead of reduced one
in libmdbodbc.so. But this is only a guess. I don't know what
happens in fact. Moreover, there is no any information and/or user opinion
about this adapter. The single advantage is a 14 days trial after registration.
This provide time and ability to understand that is the adapter in reality.

PostgreSQL approach

For PotgreSQL exist an extension OGR. This extension is a part of
GDAL. Which is a translator library for
raster and vector geospatial data formats. I.e. this is huge library and it
provide ability to work with lots of formats. But we interested only in one:
MDB.

Installation

In the beginning, this PG extensions depend from several packages. Here it
is:

Step four: time to create FDW. In org_fwd exists two ways (or formats
in OGR terminology) for connect to MS Access. First way: using standard ODBC.
I.e. using database described in /etc/odbc.ini within specified adapter.
Details about this approach can be found here. Second way: use MDB format from
ORG, provided direct connection to mdb file using Jackcess. Details about MDB format
located here. Below I describe
both ways. One more thing: ORG is extremely powerful tools; an ability to work
with MS Access is a small piece of a huge amount of formats; full list of
supported formats presents on GDAL official site.

ORG ODBC format

This approach use system ODBC and works in similar way as previously
described isql and ruby-odbc, but within database. All available
options for initialize FDW presents on GDAL ODBC driver page. Below I show
only simple example of usage.

So, here an example of usage in format of small piece of pgSQL with commands
output:

$ ogrinfo -al 'ODBC:testdb'
geometry_columns is not a table in this database
Got no result for 'SELECT f_table_name, f_geometry_column, geometry_type FROM geometry_columns' command
INFO: Open of `ODBC:testdb'
using driver `ODBC' successful.

Based on this message, all works fine, but database (i.e. in mdb
file) doesn't contain Geo-format and ORG stuck on finding
geometry_columns table. I can't find how to evade this. But
someone wrote what this approach fine works under Windows. So, if you know
how to get it work with any nongeo mdb, please, say about this (for
example in comments).

Another question is how PG and FDW works with nonlatin (Russian in my case)
names of tables and columns from mdb. From one side: Postgres is
indifferent to table/column names. Wrap them in double quotes and use anything
including special characters. But I don't know is this applied to FDW.

ORG MDB format

This approach based on pure Java library called Jackcess. Due to Java this approach has
no relations with system ODBC and has own driver and connection mechanism.

I spent 3 days and ~20 recompilation of package to reach a working variant.
So, I want to say about some limitation on below description:

this method can help only with mdb files without encryption (i.e.
without password);

due to custom package compilation all specified version, paths and
dependencies can change based on environment (OS type and version, installed
packages, environment variables etc).

Based on source code of GDAL, it
support openjdk-7-jdk too. But I can't build worked package with 7
version.

Then required libgdal-dev package.

sudo apt-get install libgdal-dev

Remember version of installed package. This version related with main GDAL
package. For me this is 1.10.1
version.

Important
moment: MDB support appears in 1.9.0 version and later.

Finally, remove gdal-bin:

sudo apt-get remove gdal-bin

Step two: download jackcess-1.2.2.jar, commons-lang-2.4.jar and
commons-logging-1.1.1.jar and put the 3 JARs in the lib/ext. In my case this is
/usr/lib/jvm/java-6-openjdk-amd64/jre/lib/ext. Required versions of JAR
dependencies can be found in this
utility. For me, all works with any latest version of commons-logging
(1.*), with any other minor version of commons-lang (2.*) and jackcess (1.*).
Errors appear only in case of new major version of jaccess (2.1.4).

Step three: download and configure GDAL.

git clone git@github.com:OSGeo/gdal.gitcd gdal/gdal/git checkout 1.10

Here needs to move to branch with same version as version of installed
libgdal-dev. Otherwise lib and binary become incompatible.

Second variant can help in case if multiple installed version of Java. For
example, in case of installed openjdk-6-jdk and
openjdk-7-jdk.

In the end of configure output must appear 'yes' in front of MDB
support.

Step four: find cup of tea and run compilation.

sudo checkinstall

Go through wizard. Specify 'description' and 'version' for
package and wait ~10min.

As result must appear package with size ~300Mb. If you want to reduce size,
you can manually configure package, remove all redundant parts and make it more
light. But this actions is beyond of scope of this article and hence will be
skipped.

Step five: if you obtain any error during compilation - solve them using
Google and own knowledge.

So, thats all. In the end of this section I want to say about encrypted
mdb files.

If ogrinfo say something like this:

Exception in thread "main" com.healthmarketscience.jackcess.UnsupportedCodecException: Decoding not supported.
Please choose a CodecProvider which supports reading the current database encoding.
at com.healthmarketscience.jackcess.DefaultCodecProvider$UnsupportedHandler.decodePage(DefaultCodecProvider.java:115)

most likely your mdb file is encrypted. In this case you can look at
Jaccess FAQ and
think about contribution in OGR Access driver. As I understand, there is an
project called Jackcess
Encrypt. This project provide CryptCodecProvider, which implements the
Jackess CodecProvider interface and supports some forms of Access database
encryption. But implementation of Access driver from GDAL doesn't use
Jackcess Encrypt and hence doesn't support encrypted files. So, you can
improve GDAL in this direction.

Other FDWs

List of other available FDWs can be found on official PG-wiki.
There you can find ZhengYang/odbc_fdw
with last changes in 2011 and note "Does not compile with PostgreSQL
>= 9.2!". And
CartoDB/odbc_fdw with active development process and support Postgres 9.5+.

Conclusion

Working with Access is painful. In double size in case of Unix environment.
So, first advice: extract data from MS Access into any modern database and you
will get rid of many problems. If no way to drop out MS Access - use Windows
for dealing with it. Microsoft provide drivers and tools for dealing with own
products. Otherwise you have two ways: use transition of data through CSV files
or direct access to MS Access. First way is simple and works "out of the
box". Second way is quite difficult, require a lot of time and has set of
limitations, pitfalls and other unpleasant things. So, select wisely.