DAWG 0.5.1

DAWG

String data in a DAWG (Directed Acyclic Word Graph) may take
200x less memory than in a standard Python dict or list and
the raw lookup speed is comparable. DAWG may be even faster than
built-in dict for some operations. It also provides fast
advanced methods like prefix search.

BytesDAWG and RecordDAWG implementation details

Data is encoded to base64 because dawgdic C++ library doesn’t allow
zero bytes in keys (it uses null-terminated strings) and such keys are
very likely in binary data.

In DAWG versions prior to 0.5 <separator> was chr(255) byte.
It was chosen because keys are stored as UTF8-encoded strings and
chr(255) is guaranteed not to appear in valid UTF8, so the end of
text part of the key is not ambiguous.

But chr(255) was proven to be problematic: it changes the order
of the keys. Keys are naturally returned in lexicographical order by DAWG.
But if chr(255) appears at the end of each text part of a key then the
visible order would change. Imagine 'foo' key with some payload
and 'foobar' key with some payload. 'foo' key would be greater
than 'foobar' key: values compared would be 'foo<sep>' and 'foobar<sep>'
and ord(<sep>)==255 is greater than ord(<any other character>).

So now the default <separator> is chr(1). This is the lowest allowed
character and so it preserves the alphabetical order.

It is not strictly correct to use chr(1) as a separator because chr(1)
is a valid UTF8 character. But I think in practice this won’t be an issue:
such control character is very unlikely in text keys, and binary keys
are not supported anyway because dawgdic doesn’t support keys containing
chr(0).

If you can’t guarantee chr(1) is not a part of keys, lexicographical order
is not important to you or there is a need to read
a BytesDAWG/RecordDAWG created by DAWG < 0.5 then pass
payload_separator argument to the constructor:

>>> BytesDAWG(payload_separator=b'\xff').load('old.dawg')

The storage scheme has one more implication: values of BytesDAWG
and RecordDAWG are also sorted lexicographically.

For RecordDAWG there is a gotcha: in order to have meaningful
ordering of numeric values store them in big-endian format:

If you found a bug in a C++ part please report it to the original
bug tracker.

How is source code organized

There are 4 folders in repository:

bench - benchmarks & benchmark data;

lib - original unmodified dawgdic C++ library and
a customized version of libb64 library. They are bundled
for easier distribution; if something is have to be fixed in these
libraries consider fixing it in the original repositories;

Authors & Contributors

License

Wrapper code is licensed under MIT License.
Bundled dawgdic C++ library is licensed under BSD license.
libb64 is Public Domain.

0.5.1 (2012-10-11)

better error reporting while building DAWGs;

__contains__ is fixed for keys with zero bytes;

dawg.Error exception class;

building of BytesDAWG and RecordDAWG fails instead of
producing incorrect results if some of the keys has unsupported characters.

0.5 (2012-10-08)

The storage scheme of BytesDAWG and RecordDAWG is changed in
this release in order to provide the alphabetical ordering of items.

This is a backwards-incompatible release. In order to read BytesDAWG or
RecordDAWG created with previous versions of DAWG use payload_separator
constructor argument:

>>> BytesDAWG(payload_separator=b'\xff').load('old.dawg')

0.4.1 (2012-10-01)

Segfaults with empty DAWGs are fixed by updating dawgdic to latest svn.

0.4 (2012-09-26)

iterkeys, iteritems and iterprefixes methods
(thanks Dan Blanchard).

0.3.2 (2012-09-24)

prefixes method for finding all prefixes of a given key.

0.3.1 (2012-09-20)

bundled dawgdic C++ library is updated to the latest version.

0.3 (2012-09-13)

similar_keys, similar_items and similar_item_values methods
for more permissive lookups (they may be useful e.g. for umlaut handling);

load method returns self;

Python 3.3 support.

0.2 (2012-09-08)

Greatly improved memory usage for DAWGs loaded with load method.

There is currently a bug somewhere in a wrapper so DAWGs loaded with
read() method or unpickled DAWGs uses 3x-4x memory compared to DAWGs
loaded with load() method. load() is fixed in this release but
other methods are not.