Usage

Help text

$ ./ml2json --help
ml2json sourcedir-or-file(s) [--json-to targetfile]
sourcedir should be a directory with *.mbox files directly in it
(but see --recurse). If a file is given instead, it is assumed to
be an mbox file. Paths to maildirs (with cur and new
subdirectories) or ezmlm archive directories are auto-detected and
treated accordingly.
All options and actions except for --deidentify, --open-message-id
and --open-identify (and --help and --config) can alternatively be
configured from a config file, see --config. Command line options
have precedence. Sourcepath(s) can be configured as a 'sourcepaths'
config key holding an array. Some options can *only* be configured
in the config file.
Sourcepaths can be files ending in .gz or .bz2, in which case they
are uncompressed on the fly. (Note that this leads to different
identify values, though.)
The --json-to, --html-to and --source-to actions can be combined in
the same run.
Actions:
--json-to file
where the JSON output should go (use '-' for stdout).
--html-to dir
create html files in the specified directory
(currently meant for debugging only)
--html-index file
create html file with thread 'index'; --html-to should
be given, too, so that the link targets (currently
assumed to be in the same directory) are available
--source-to dir
create files <dir>/<md5hash>/<number>.txt (where
<md5hash>/<number> is the same string that
--deidentify takes) and link them in the html files
produced with --html-to
--open-message-id message-id
run $BROWSER on the given message that was written
before using --html-to. Needs html-to option again to
know where the files are, but doesn't regenerare files
nor does it run json generation.
--open-identify identifystring
run $BROWSER on the given message that was written
before using --html-to. Needs html-to option again to
know where the files are, but doesn't regenerare files
nor does it run json generation.
--deidentify string
print message identified by the given string, which is
output into the JSON by the json_identify method (as
'identify' field by default) or printed along with
WARN and ERROR messages.
Only works if the generated files are still available.
If --attachment-basedir was given before, it has to
be specified for --deidentify as well.
--show-mbox-path md5hexstring[/number]
show path of original mbox file for a given mbox
identifier (as they are used in identify strings).
--cleanup delete currently used temp directory at /tmp/ml2json*
Options:
--config myconfig.pl
load given file containing perl code, which must end
with a perl hashref; see default configuration in
'./default_config.pl'
for the options that can be set that way. Several
--config options can be given, each subsequent one
overrides options loaded from the previous ones (with
regards to the keys of the top level hash ref). The
previously loaded config can be accessed from
$main::config. Keys in the config file use
underscore to separate words, not '-'.
--verbose show NOTEs in addition to WARNings on stderr
--recurse recurse into subdirectories of sourcedir-or-file
--mbox-glob globstring
glob to use to find files in directories that contain
mboxes. Defaults to '*.mbox'.
--attachment-basedir path
use path instead of a random subdirectory below
'/tmp/ml2json' for the output; the output currently
contains serialized objects as well, but those don't
hurt, do they?
If path is absolute, then the url field for
attachments is output as file:// URI, if it is
relative, it is output as a relative URI (i.e. no
file:// prefix).
--max-thread-duration duration[1]
When encountering emails with no or no known
in-reply-to and references headers, group them into
the same thread according to their subject line as
long as the time span between the first mail of that
subject and the last one doesn't exceed the given
duration. Pass '0' to disable. Default: '1 month'.
--max-date-deviation duration[1]
When mbox separators (lines starting with 'From ')
contain time stamps, and those deviate more than
<duration> from the Date header contained in the mail
(of if there is no Date header in the mail), use the
mbox time stamp instead. Default: off. Note: mbox
time stamps are not necessarily representing the time
when an email was received, could also be when they
were copied around.
--filter-max-age duration[1]
Only output messages which have a date/time more
recent than the current time minus the given duration.
When using this option, the in-reply-to fields, or the
links in the generated pages with --html-to, can point
to emails that are known but not written to the json
stream or html-to directory.
--jobs n
use n instead of the default 2[2] jobs in parallel
[1] duration can be anything that Time::Duration::Parse supports,
like '1 day' or '1d and 5h'; bare numbers are interpreted as
seconds.
[2] default number of jobs is derived from the number of cores on
the machine the program is running on.
(Option names can be shortened when given as command arguments (not
when given in config files) as long as they are unambiguous and that
you accept the risk for future ambiguity.)
(Christian Jaeger <ch@christianjaeger.ch>)

Instructions

collect all mbox files or Maildir/ezmlm directories your archive is comprised of into one or several directories (or directory tree, if you use the --recurse option), possibly using symlinks.

decide upon a base directory where all the unpacked attachments (as well as serialized state--for details see phases) should be stored; if you don't care about accessing the attachments, some directory under /tmp will be choosen. ml2json will create a symlink at ~/.ml2json-tmp which points to that generated directory, so that subsequent runs of ml2json will find it again and can omit part of the work that was already done. If you do care about the generated attachments, specify the --attachment-basedir option.

it's possible to customize what fields are output in the JSON by using a config file. You can also put most commandline parameters into a config file, thus if you want to run a particular conversion repeatedly and consistently, consider doing that. The file default_config.pl has the defaults; this file is not meant to be edited, instead, create a new file that shadows the config keys that you want to set, then pass the path to it to the --config option. An example can be found in ml2json-list-generate/config.pl which is used to generate the list archive. Read about --config in ./mk2json --help.

run ./mk2json sourcedir --json-to targetfile, perhaps with the additional options of your choice (in particular you need to use the --mbox-glob option, if the files are not named according to the default mbox glob pattern, *.mbox; use * to make it look at all files).

the temp / attachments dir is structured as follows:

$attachment_basedir/$hash_of_mailbox_path/$i/<files>

$hash_of_mailbox_path is the mailbox_path_hash config setting applied to the path to the mailbox; this to shorten down the path to something that won't ever conflict. This does not necessarily hide the original path: it's both possible to determine the original path by using MD5 hash crackers if the default hash config setting isn't changed, and if the $attachment_basedir/$hash_of_mailbox_path/__meta file is still present, the path can be read from it.

If the cache_dir option is set, then no __meta files will be created within $attachment_basedir (making it possible to use it cleanly for serving in public html archives).

If you want to know which mailbox path a particular hash originated from, use the ml2json --show-mbox-path option.

$i is a string indicating the position of the email in the mailbox, or $o-$p in case of ezmlm archive dirs, where $o is the ezmlm subdir and $p the message file name within the subdir (both being non-negative integers, $p possibly with a leading zero).

You can run ml2json --deidentify "$hash_of_mailbox_path/$i" to make it print the original message string (as it was cut out of the mbox file, or copied from a Maildir file).

optionally, to clean up the generated temporary / attachments files, run ml2json with the --cleanup option; if you gave the --attachment-basedir option before, it has to be given again, otherwise ml2json will just look at ~/.ml2json-tmp (or do nothing if not present).