Fold is like the builtin function reduce except that it works in
parallel. Fold takes two binary operator functions, one to reduce each
partition of our dataset and another to combine results between
partitions

The key function determines how to group the elements in your bag.
In the common case where your bag holds dictionaries then the key
function often gets out one of those elements.

>>> defkey(x):... returnx['name']

This case is so common that it is special cased, and if you provide a
key that is not a callable function then dask.bag will turn it into one
automatically. The following are equivalent:

>>> b.foldby(lambdax:x['name'],...)>>> b.foldby('name',...)

Binops

It can be tricky to construct the right binary operators to perform
analytic queries. The foldby method accepts two binary operators,
binop and combine. Binary operators two inputs and output must
have the same type.

Binop takes a running total and a new element and produces a new total:

>>> defbinop(total,x):... returntotal+x['amount']

Combine takes two totals and combines them:

>>> defcombine(total1,total2):... returntotal1+total2

Each of these binary operators may have a default first value for
total, before any other value is seen. For addition binary operators
like above this is often 0 or the identity element for your
operation.

split_every

Group partitions into groups of this size while performing reduction.
Defaults to 8.

Elements are only taken from the first npartitions, with a
default of 1. If there are fewer than k rows in the first
npartitions a warning will be raised and any found rows
returned. Pass -1 to use all partitions.

compute : bool, optional

Whether to compute the result, default is True.

warn : bool, optional

Whether to warn if the number of elements returned is less than
requested, default is True.

Index will not be particularly meaningful. Use reindex afterwards
if necessary.

Parameters:

meta : pd.DataFrame, dict, iterable, optional

An empty pd.DataFrame that matches the dtypes and column names
of the output. This metadata is necessary for many algorithms in
dask dataframe to work. For ease of use, some alternative inputs
are also available. Instead of a DataFrame, a dict of
{name:dtype} or iterable of (name,dtype) can be provided.
If not provided or a list, a single element from the first
partition will be computed, triggering a potentially expensive call
to compute. This may lead to unexpected results, so providing
meta is recommended. For more information, see
dask.dataframe.utils.make_meta.

columns : sequence, optional

Column names to use. If the passed data do not have names
associated with them, this argument provides names for the columns.
Otherwise this argument indicates the order of the columns in the
result (any names not found in the data will become all-NA
columns). Note that if meta is provided, column names will be
taken from there and this parameter is invalid.

Write dask Bag to disk, one filename per partition, one line per element.

Paths: This will create one file for each partition in your bag. You
can specify the filenames in a variety of ways.

Use a globstring

>>> b.to_textfiles('/path/to/data/*.json.gz')

The * will be replaced by the increasing sequence 1, 2, …

/path/to/data/0.json.gz/path/to/data/1.json.gz

Use a globstring and a name_function= keyword argument. The
name_function function should expect an integer and produce a string.
Strings produced by name_function must preserve the order of their
respective partition indices.

Compression: Filenames with extensions corresponding to known
compression algorithms (gz, bz2) will be compressed accordingly.

Bag Contents: The bag calling to_textfiles must be a bag of
text strings. For example, a bag of dictionaries could be written to
JSON text files by mapping json.dumps on to the bag first, and
then calling to_textfiles :

Absolute or relative filepath(s). Prefix with a protocol like s3://
to read from alternative filesystems. To read from multiple files you
can pass a globstring or a list of paths, with the caveat that they
must all have the same protocol.

NOTE: corresponding partitions should have the same length; if they do not,
the “extra” elements from the longer partition(s) will be dropped. If you
have this case chances are that what you really need is a data alignment
mechanism like pandas’s, and not a missing value filler like zip_longest.