Links

Share project

Miller is a command line tool that combines features from sed, awk, cut, join, and sort. It works best on name-indexed CSV input, and thus allows easy cutting, sorting or filtering on column names. It allows pretty-printing of tabs/columns, format conversions, can be used in shell pipes like simpler tools, adds some SQL-like querying features, and is similar in spirit to "jq" the JSON queryer.

Recent Releases

5.3.007 Jan 2018 22:45minor feature:
Comment strings in data files: mlr --skip-comments allows you to filter out input lines starting with #, for all file formats. Likewise, mlr --skip-comments-with X lets you specify the comment-string X. Comments are only supported at start of data line. mlr --pass-comments and mlr --pass-comments-with X allow you to forward comments to program output as they are read.
.
The count-similar verb lets you compute cluster sizes by cluster labels.
.
While Miller DSL arithmetic gracefully overflows from 64-integer to double-precision float (see also here), there are now the integer-preserving arithmetic operators.+.-../.// for those times when you want integer overflow.
.
There is a new bitcount function: for example, echo x=0xf0000206 mlr put ' y=bitcount( x)' produces x=0xf0000206,y=7.
.
mlr -T is an alias for --nidx --fs tab, and mlr -t is an alias for mlr --tsvlite.
.
The mathematical constants π and e have been renamed from PI and E to M_PI and M_E, respectively. (It's annoying to get a syntax error when you try to define a variable named E in the DSL, when A through D work just fine.) This is a backward incompatibility, but not enough of us to justify calling this release Miller 6.0.0.
.
As noted here, while Miller has its own DSL there will always be things better expressible in a general-purpose language. The new page Sharing data with other languages shows how to seamlessly share data back and forth between Miller, Ruby, and Python. SQL-input examples and SQL-output examples contain detailed information the interplay between Miller and SQL.
.
raised a question about suppressing numeric conversion. This resulted in a new FAQ entry How do I suppress numeric conversion?, as well as the longer-term follow-on which will make numeric conversion happen on a just-in-time basis.
.
To my surprise, csvlite format options weren t listed in mlr --help or the manpage. This has been.
.
Documentation for auxiliary commands has been expanded, including with

5.2.014 Jun 2017 06:25minor feature:
The stats1 verb now lets you use regular expressions to specify which field names to compute statistics on, and/or which to group by. Full details are here.
.
The min and max DSL functions, and the min/max/percentile aggregators for the stats1 and merge-fields verbs, now support numeric as well as string field values. (For mixed string/numeric fields, numbers compare before strings.) This means in particular that order statistics -- min, max, and non-interpolated percentiles -- as well as mode, antimode, and count are now possible on string-only (or mixed) fields. (Of course, any operations requiring arithmetic on values, such as computing sums, averages, or interpolated percentiles, yield an error on string-valued input.).
.
There is a new DSL function mapexcept which returns a copy of the argument with specified key(s), if any, unset. The motivating use-case is to split records to multiple filenames depending on particular field value, which is omitted from the output: mlr --from f.dat put 'tee "/tmp/data-". a, mapexcept( *, "a")' Likewise, mapselect returns a copy of the argument with only specified key(s), if any, set. This resolves #137.
.
A new -u option for count-distinct allows unlashed counts for multiple field names. For example, with -f a,b and without -u, count-distinct computes counts for distinct pairs of a and b field values. With -f a,b and with -u, it computes counts for distinct a field values and counts for distinct b field values separately.
.
If you build from source, you can now do./configure without first doing autoreconf -fiv. This resolves #131.
.
The UTF-8 BOM sequence 0xef 0xbb 0xbf is now automatically ignored from the start of CSV files. (The same is already done for JSON files.) This resolves #138.
.
For put and filter with -S, program literals such as the 6 in x = 6 were being parsed as strings. This is not sensible, since the -S option for put and filter is intended to suppress numeric conversion of record data, not p

5.1.015 Apr 2017 03:16minor feature:
JSON arrays: as described here, Miller being a tabular data processor isn't well-position to handle arbitrary JSON. (See jq for that.) But as of 5.1.0, arrays are converted to maps with integer keys, which are then at least processable using Miller. Details are here. The short of it is that you now have three options for the main mlr executable:
The new mlr fraction verb makes possible in a few keystrokes what was only possible before using two-pass DSL logic: here you can turn numerical values down a column into their fractional/percentage contribution to column totals, optionally grouped by other key columns.
.
The DSL functions strptime and strftime now handle fractional seconds. For parsing, use S format as always; for formatting, there are now 1S through 9S which allow you to configure a specified number of decimal places. The return value from strptime is now floating-point, not integer, which is a minor backward incompatibility not worth labeling this release as 6.0.0. (You can work around this using int(strptime(...)).) The DSL functions gmt2sec and sec2gmt, which are keystroke-savers for strptime and strftime, are similarly modified, as is the sec2gmt verb. This resolves #125.
.
A few nearly-standalone programs -- which do not have anything to do with record streams -- are packaged within the Miller. (For example, hex-dump, unhex, and show-line-endings commands.) These are described here.
.
The stats1 and merge-fields verbs now support an antimode aggregator, in addition to the existing mode aggregator.
.
The join verb now by default does not require sorted input, which is the more common use case. (Memory-parsimonious joins which require sorted input, while no longer the default, are available using -s.) This another minor backward incompatibility not worth making a 6.0.0 over. This resolves #134.
.
mlr nest has a keystroke-saving --evar option for a common use case, namely, exploding a field by value across records.
.
The DSL referenc

5.0.001 Mar 2017 19:05minor feature:
Line endings (CRLF vs. LF, Windows-style vs. Unix-style) are now autodetected. For example, files (including CSV) with LF input will lead to LF output unless you specify otherwise.
There is now an in-place mode using mlr -I.
You can now define your own functions and subroutines: e.g. func f(x, y) return x2 + y2 .
New local variables are completely analogous to out-of-stream variables: sum retains its value for the duration of the expression it's defined in; @sum retains its value across all records in the record stream.
Local variables, function parameters, and function return types may be defined untyped or typed as in x = 1 or int x = 1, respectively. There are also expression-inline type-assertions available. Type-checking is up to you: omit it if you want flexibility with heterogeneous data; use it if you want to help catch misspellings in your DSL code or unexpected irregularities in your input data.
There are now four kinds of maps. Out-of-stream variables have always been scalars, maps, or multi-level maps: @a=1, @b 1 =2, @c 1 2 =3. The same is now true for local variables, which are new to 5.0.0. Stream records have always been single-level maps; is a map. And as of 5.0.0 there are now map literals, e.g. "a":1, "b":2 , which can be defined using JSON-like syntax (with either string or integer keys) and which can be nested arbitrarily deeply.
You can loop over maps -- *, out-of-stream variables, local variables, map-literals, and map-valued function return values -- using for (k, v in...) or the new for (k in...) (discussed next). All flavors of map may also be used in emit and dump statements.
User-defined functions and subroutines may take map-valued arguments, and may return map values.
Some built-in functions now accept map-valued input: typeof, length, depth, leafcount, haskey. There are built-in functions producing map-valued output: mapsum and mapdiff. There are now string-to-map and map-to-string functions: splitnv, splitkv, splitnvx, splitkv

4.4.013 Aug 2016 03:15minor feature:
Mlr step -a shift allows you to place the previous record's values alongside the current record's values: http://johnkerl.org/miller/doc/reference.html#step.
Mlr head, when used without the group-by flag (-g), stops after the specified number of records has been output. For example, even with a multi-gigabyte data file, mlr head -n 10 hugefile.dat will complete quickly after producing the first ten records from the file.
The sec2gmtdate verb, and sec2gmtdate function for filter/put, is new: please see http://johnkerl.org/miller/doc/reference.html#sec2gmtdate and http://johnkerl.org/miller/doc/reference.html#Functions_for_filter_and_put.
Sec2gmt and sec2gmtdate both leave non-numbers as-is, rather than formatting them as (error). This is particularly relevant for formatting nullable epoch-seconds columns in SQL-table output: if a column value is NULL then after sec2gmt or sec2gmtdate it will still be NULL.
The dot operator has been universalized to work with any data type and produce a string. For example, if the field n has integers, then instead of typing mlr put ' name = "value:".string( n)' you can now simply domlr put ' name = "value:". n'. This is particularly timely for creating filenames for redirected print/dump/tee/emit output.
The online documents now have a copy of the Miller manpage: http://johnkerl.org/miller/doc/manpage.html.
inside filter/put, x=="" was distinct from isempty( x). This was nonsensical; now both are the same.

4.3.004 Jul 2016 06:45minor feature:
Interpolated percentiles are now available using mlr stats1 -i or mlr merge-fields -i. Non-interpolated percentiles are the default. The former resemble R's type=7 quantiles and the latter resemble R's type=1 quantiles. See also http://johnkerl.org/miller/doc/reference.html#stats1 and http://johnkerl.org/miller/doc/reference.html#merge-fields.
Markdown-tabular output format is now available using --omd: please see http://johnkerl.org/miller/doc/file-formats.html#Markdown_tabular and #106.
For files using CSV input as well as CSV output, there is now a --quote-original option which outputs fields with quotes if they had them on input. The was-quoted flag isn't tracked on derived fields, e.g. if fields a and b were quoted on input, then in mlr put ' c = a. b the c field won't be quoted on output. As such, this option is most useful with mlr cut, mlr filter, etc. The use-case from the original feature request #77 (comment) is in trimming down a huge CSV file in order to facilitate subsequent in-memory processing using spreadsheet software.
The cookbook at http://johnkerl.org/miller/doc/cookbook.html has been extended significantly.
You can now set a MLR_CSV_DEFAULT_RS=lf environment variable if you're tired of always putting --rs lf arguments for your CSV files: http://johnkerl.org/miller/doc/file-formats.html#CSV/TSV/etc.
The printn and eprintn commands for mlr put are identical to print and eprint except they don't print final newlines.
It is now an error if boundvars in the same for-loop expression have duplicate names, e.g. for (a,a in *) ... results in the error message mlr: duplicate for-loop boundvars "a" and "a".
The strptime function would announce an internal coding error on malformed format strings; now, it correctly points out the user-level error.
Percentiles in merge-fields were not working. This was ; also, the lacking unit-test cases which would have caught this sooner have been filled in.
Miller's CSV output-quoting was non-RFC-compliant: double-q

4.1.012 Jun 2016 11:45minor feature:
For-loops over key-value pairs in stream records and out-of-stream variables.
Loops using while and do while
.
Break and continue in for, while, and do while loops.
If-elif-else statements.
Nestability of all the above, as well as of existing pattern-action blocks.
Computable field names using square brackets, e.g. a. b = a b
.
Type-predicate functions: isnumeric, isint, isfloat, isbool, isstring
.
Commenting using pound signs.
The new print and eprint allow formatting of arbitrary expressions to stdout/stderr, respectively.
In addition to the existing dump which formats all out-of-stream variables to stdout as JSON, the new edump does the same to stderr.
Semicolon is no longer required after closing curly brace.
Emit @ and unset @ are new synonyms for emit all and unset all.
.
Unset now exists.
Mlr -n is synonymous with mlr --from /dev/null, which is useful in dataless contexts wherein all your put statements are contained within begin/end blocks.
in 4.0.0, mlr put -v '@a 1 2 = b; new=@a 1 2 ' mydata.tbl would crash with a memory-management error.
Http://johnkerl.org/miller/doc/reference.html#If-statements_for_put.
Http://johnkerl.org/miller/doc/reference.html#While_and_do-while_loops_for_put.
Http://johnkerl.org/miller/doc/reference.html#For-loops_for_put.
Http://johnkerl.org/miller/doc/reference.html#Field_names_for_filter.
Http://johnkerl.org/miller/doc/reference.html#Field_names_for_put.
Http://johnkerl.org/miller/doc/reference.html#Functions_for_filter_and_put.
Http://johnkerl.org/miller/doc/reference.html#Semicolons,_newlines,_and_curly_braces_for_put.
Http://johnkerl.org/miller/doc/cookbook.html.

3.5.005 Apr 2016 03:17minor feature:
Mlr nest is a companion to mlr reshape which was introduced in Miller 3.4.0: it allows unpacking key-value pairs which are nested within field values, and repacking them. Please see http://johnkerl.org/miller/doc/reference.html#nest.
Mlr shuffle is a simple output-record permutor: http://johnkerl.org/miller/doc/reference.html#shuffle.
Mlr repeat can be used as a data-generator, to expand a few input records (or even a single one) into arbitrarily many. This is particularly useful in conjunction with pseudorandom-number generators. As well, it can be used to reconstruct individual samples from data which have been count-aggregated, so that statistics such as mode, percentiles, etc. may be computed on them. Please see http://johnkerl.org/miller/doc/reference.html#repeat.
Mlr put and mlr filter now accept a -f filename option, so that the DSL expression may be placed within a file instead of being typed out on the command line when desired. Please see http://johnkerl.org/miller/doc/reference.html#put and http://johnkerl.org/miller/doc/reference.html#filter.
Put/filter DSL string literals now may include t, ", etc.: e.g. mlr put ' out = left. " t". right'.
There is now a typeof function for the put/filter DSLs: mlr put ' xtype = typeof( x)'. This is occasionally useful for deging type-conversion questions.
You may now do mlr --nr-progress-mod 1000000... to get something printed to stderr every 1000000th input record, and so on. For long-running aggregations on large input file(s), this can provide reassurance that processing is indeed proceeding apace. Example:
Mlr cat -n had a wherein it counted zero-up while its documentation claimed it counted one-up. Now it counts one-up as documented.

3.4.015 Feb 2016 03:15minor feature:
JSON is now a supported format for input and output. Miller handles tabular data, and JSON supports arbitrarily deeply nested data structures, so if you want general JSON processing you should use jq. But if you have tabular data represented in JSON then Miller can now handle that for you. Please see the reference page and the FAQ.
Reshape is a standard data-processing idiom, now available in Miller: http://johnkerl.org/miller/doc/reference.html#reshape.
Incidentally (not part of this release, but new since the last release) Miller is now available in FreeBSD's package manager: https://www.freshports.org/textproc/miller/. A full list of distributions containing Miller may be found here.
Miller is not yet available from within Fedora/CentOS, but as a step toward this goal, an SRPM is included in this release (see file-list below).
Regex captures 0 through 9: http://johnkerl.org/miller/doc/reference.html#Regex_captures.
Ternary operator in expression right-hand sides: e.g. mlr put ' y = x 0.5 ? 0 : 1'.
Boolean literals true and false.
Final semicolon is now allowed: e.g. mlr put ' x=1; y=2;'.
Environment variables are now accessible, where environment-variable names may be string literals or arbitrary expressions: mlr put ' home = ENV "HOME" ' or mlr put ' value = ENV name '.
While records are still string-to-string maps for input and output, and between then statements, types are preserved between multiple statements within a put. Example: mlr put ' y = string( x); z = y. y' works as expected, without requring mlr put ' y = string( x); z = string( y). string( y)' as before.
Mixed-format join, e.g. CSV file joined with DKVP file, was incorrectly computing default separators (IRS, IFS, IPS). This resulted in records not being joined together.
Segmentation violation on non-standard-input read of files with size an exact multiple of page size and not ending in IRS, e.g. newline. (This is less of a corner case than it sounds: for example, leave a long-running pro

3.3.212 Jan 2016 06:05minor feature:
Bootstrap sampling in mlr bootstrap: http://johnkerl.org/miller/doc/reference.html#bootstrap. Compare to reservoir sampling in mlr sample: http://johnkerl.org/miller/doc/reference.html#sample.
Exponentially weighted moving averages in mlr step -a ewma: principally useful for smoothing of noisy time series, e.g. finely sampled system-resource utilization to give one of many possible examples. Please see http://johnkerl.org/miller/doc/reference.html#step.
"Horizontal" univariate statistics in mlr merge-fields, compared to mlr stats which is "vertical". Also allows collapsing multiple fields into one, such as in_bytes and out_bytes data fields summing to bytes_sum. This can also be done easily using mlr put. However, mlr merge-fields allows aggregation of more than just a pair of field names, and supports pattern-matching on field names. Please see http://johnkerl.org/miller/doc/reference.html#merge-fields for more information.
isnull and isnotnull functions for mlr filter and mlr put.
stats1, stats2, merge-fields, step, and top correctly handle not only missing fields (in the row-heterogeneous-data case) but also null-valued fields.
Minor memory-management improvements.

3.2.230 Dec 2015 03:16minor feature:
RFC-CSV read performance is dramatically improved and is now on par with other formats; read performance for all formats is slightly improved as well.
Variable names can now be escaped, using curly braces if there are special characters in the input-data field names. Example: mlr put ' bytes.total = bytes.in + bytes.out '. See also #77 where this was requested.
Compressed I/O is now supported, using built-in compatibility with local system tools: http://johnkerl.org/miller/doc/reference.html#Compression. See also #77 where this was requested.
mlr uniq is now streaming (bounded memory use, functionality in tail -f contexts) when possible: i.e. when -n and -c are not specified.
Thorough valgrind-driven testing has been used to tighten memory usage. This is mostly an invisible internal improvement, although it has a slight across-the-board performance improvement as well as allowing Miller to handle even larger files in limited-memory contexts.

3.1.006 Dec 2015 03:15minor feature:
Portability (affecting the CSV-RFC reader) for the Debian packaging request: https://.debian.org/cgi-bin/report.cgi?=800074. The latter greatly increases the number of platforms on which Miller has been validated.
Mlr decimate: http://johnkerl.org/miller/doc/reference.html#decimate.
Integer-preservation feature for mlr top and mlr stats1 with percentiles: If inputs are integers then corresponding outputs will be so as well (unless -F, which forces all-float output).
Mlr histogram now has a --auto option for autocomputing lower and upper limits: http://johnkerl.org/miller/doc/reference.html#histogram.
Mlr uniq and mlr count-distinct now have a -n flag to show only the counts of distinct values, rather than listing all distinct values: http://johnkerl.org/miller/doc/reference.html#uniq http://johnkerl.org/miller/doc/reference.html#count-distinct.
The strlen function correctly handles UTF-8 string data.

3.0.101 Dec 2015 13:25minor feature:
Miller has always supported scientific notation in field values, e.g x=1e6. However, it had never supported scientific notation in DSL literals, e.g. mlr put ' y = x + 1e6. This release that.
Additionally, mlr bar now has a ---auto flag which holds all records in memory and computes limits from the data, so you don't have to compute them separately and pass them in via --lo and --hi.

2.3.228 Oct 2015 05:25minor feature:
Mlr stats1 and stats2 now support a -s feature in which means, linear regressions, etc. evolve record-by-record as new records appear over time. This is particularly useful in tail -f contexts. See also http://johnkerl.org/miller/doc/reference.html#stats1 and http://johnkerl.org/miller/doc/reference.html#stats2.
Mlr filter now supports a -x flag to negate the sense of the filter: instead of editing logic expressions e.g. from mlr filter ' x 10 x 20' to mlr filter ' x = 10 x 20'. See also http://johnkerl.org/miller/doc/reference.html#filter.
In the event a CSV file lacks header lines, you can use mlr --implicit-csv-header to add positional header 1,2,3.... You can also convert those to desired text using mlr label. See also http://johnkerl.org/miller/doc/reference.html#label.
Heterogeneity support is improved for sort, stats1, stats2, step, head, tail, top, sample, uniq, and count-distinct. See also #79.
Mlr stats2 now has a logistic-regression feature, but I recommend treating it as experimental until some numerical-stability involving my naïve Newton-Raphson solver are worked out -- namely, it doesn't converge in all cases.

2.0.028 Aug 2015 03:15minor feature:
--csv will be still be compliant by default, but RS/FS will be programmable: you'll be able to handle TSV or what have you, with double-quote support.
RS/FS/PS for all formats will be able to be multi-character, e.g. you'll be able to use CRLF for DKVP format which will resolve #19.
Read-performance for CSV will be optimized for performance.
Double-quoting will be supported in DKVP as well as in CSV.