Dask.Bag implements a operations like map, filter, fold, and
groupby on collections of Python objects. It does this in parallel and in
small memory using Python iterators. It is similar to a parallel version of
PyToolz or a Pythonic version of the PySpark RDD.

By default dask.bag uses dask.multiprocessing for computation. As a
benefit Dask bypasses the GIL and uses multiple cores on Pure Python objects.
As a drawback Dask.bag doesn’t perform well on computations that include a
great deal of inter-worker communication. For common operations this is rarely
an issue as most Dask.bag workflows are embarrassingly parallel or result in
reductions with little data moving between workers.

Some operations, like groupby, require substantial inter-worker
communication. On a single machine, dask uses partd to perform efficient,
parallel, spill-to-disk shuffles. When working in a cluster, dask uses a task
based shuffle.

These shuffle operations are expensive and better handled by projects like
dask.dataframe. It is best to use dask.bag to clean and process data,
then transform it into an array or dataframe before embarking on the more
complex operations that require shuffle steps.

Bag is the mathematical name for an unordered collection allowing repeats. It
is a friendly synonym to multiset. A bag or a multiset is a generalization of
the concept of a set that, unlike a set, allows multiple instances of the
multiset’s elements.

list: ordered collection with repeats, [1,2,3,2]

set: unordered collection without repeats, {1,2,3}

bag: unordered collection with repeats, {1,2,2,3}

So a bag is like a list, but it doesn’t guarantee an ordering among elements.
There can be repeated elements but you can’t ask for the ith element.