Why would you prefer KeyBag over Hash in this case, considering that it’s a bit more code? Well, it does a better job of saying what you mean, if what you want is a positive Int-valued Hash. It actually enforces this as well:

That’s Niecza’s error; Rakudo’s is less clear, but the important point is you get an error; Perl 6 detects that you’ve violated your contract and complains.

And KeyBag has a couple more tricks up its sleeve. First, four lines to initialize your KeyBag isn’t terribly verbose, but Perl 6 has no trouble getting it down to one line:

my %words := KeyBag.new(slurp.comb(/\w+/).map(*.lc));

KeyBag.new does its best to turn whatever it is given into the contents of a KeyBag. Given a List, each of the elements is added to the KeyBag, with the exact same result of our earlier block of code.

If you don’t need to modify the bag after its creation, then you can use Bag instead of KeyBag. The difference is Bag is immutable; if %words is a Bag, then %words{$word}++ is illegal. If immutability is okay for your application, then you can make the code even more compact:

my %words := bag slurp.comb(/\w+/).map(*.lc);

bag is a helper sub that just calls Bag.new on whatever you give it. (I’m not sure why there is no equivalent keybag sub.)

Bag and KeyBag have a couple more tricks up their sleeve. They have their own versions of .roll and .pick which weigh their results according to the given values:

Passed two filenames, this makes a Bag from the words in the first file, a Set from the words in the second file, uses the set difference operator (-) to compute the set of words which are only in the first file, sorts those words by their frequency of appearance, and then prints out the top ten.

This is the perfect point to introduce Set. As you might guess from the above, it works much like Bag. Where Bag is a Hash from Any to positive Int, Set is a Hash from Any to Bool::True. Set is immutable, and there is also a mutable KeySet.

Between Set and Bag we have a very rich collection of operators:

Operation

Unicode

“Texas”

Result Type

is an element of

∈

(elem)

Bool

is not an element of

∉

!(elem)

Bool

contains

∋

(cont)

Bool

does not contain

∌

!(cont)

Bool

union

∪

(|)

Set or Bag

intersection

∩

(&)

Set or Bag

set difference

(-)

Set

set symmetric difference

(^)

Set

subset

⊆

(<=)

Bool

not a subset

⊈

!(<=)

Bool

proper subset

⊂

(<)

Bool

not a proper subset

⊄

!(<)

Bool

superset

⊇

(>=)

Bool

not a superset

⊉

!(>=)

Bool

proper superset

⊃

(>)

Bool

not a proper superset

⊅

!(>)

Bool

bag multiplication

⊍

(.)

Bag

bag addition

⊎

(+)

Bag

Most of these are self-explanatory. Operators that return Set promote their arguments to Set before doing the operation. Operators that return Bag promote their arguments to Bag before doing the operation. Operators that return Set or Bag promote their arguments to Bag if at least one of them is a Bag or KeyBag, and to Set otherwise; in either case they return the type promoted to.

Please note that while the set operators have been in Niecza for some time, they were only added to Rakudo yesterday, and only in the Texas variations.

A bit of a word may be needed for the different varieties of unions and intersections of Bag. The normal union operator takes the max of the quantities in either bag. The intersection operator takes the min of the quantities in either bag. Bag addition adds the quantities from either bag. Bag multiplication multiplies the quantities from either bag. (There is some question if the last operation is actually useful for anything — if you know of a use for it, please let us know!)

I’ve placed my full set of examples for this article and several data files to play with on Github. All the sample files should work on the latest very latest Rakudo from Github; I think all but most-common-unique.pl and bag-union-demo.pl should work with the latest proper Rakudo releases. Meanwhile those two scripts will work on Niecza, and with any luck I’ll have the bug stopping the rest of the scripts from working there fixed in the next few hours.

A quick example of getting the 10 most common words in Hamlet which are not found in Much Ado About Nothing:

Like this:

LikeLoading...

Related

This entry was posted on December 13, 2012 at 3:50 pm and is filed under 2012. You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.