# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
=head1 NAME
Lucy::Docs::Tutorial::Analysis - How to choose and use Analyzers.
=head1 DESCRIPTION
Try swapping out the PolyAnalyzer in our Schema for a RegexTokenizer:
my $tokenizer = Lucy::Analysis::RegexTokenizer->new;
my $type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,
);
Search for C, C, and C before and after making the
change and re-indexing.
Under PolyAnalyzer, the results are identical for all three searches, but
under RegexTokenizer, searches are case-sensitive, and the result sets for
C and C are distinct.
=head2 PolyAnalyzer
What's happening is that PolyAnalyzer is performing more aggressive processing
than RegexTokenizer. In addition to tokenizing, it's also converting all text to
lower case so that searches are case-insensitive, and using a "stemming"
algorithm to reduce related words to a common stem (C, in this case).
PolyAnalyzer is actually multiple Analyzers wrapped up in a single package.
In this case, it's three-in-one, since specifying a PolyAnalyzer with
C<< language => 'en' >> is equivalent to this snippet:
my $case_folder = Lucy::Analysis::CaseFolder->new;
my $tokenizer = Lucy::Analysis::RegexTokenizer->new;
my $stemmer = Lucy::Analysis::SnowballStemmer->new( language => 'en' );
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
analyzers => [ $case_folder, $tokenizer, $stemmer ],
);
You can add or subtract Analyzers from there if you like. Try adding a fourth
Analyzer, a SnowballStopFilter for suppressing "stopwords" like "the", "if",
and "maybe".
my $stopfilter = Lucy::Analysis::SnowballStopFilter->new(
language => 'en',
);
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
analyzers => [ $case_folder, $tokenizer, $stopfilter, $stemmer ],
);
Also, try removing the SnowballStemmer.
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
analyzers => [ $case_folder, $tokenizer ],
);
The original choice of a stock English PolyAnalyzer probably still yields the
best results for this document collection, but you get the idea: sometimes you
want a different Analyzer.
=head2 When the best Analyzer is no Analyzer
Sometimes you don't want an Analyzer at all. That was true for our "url"
field because we didn't need it to be searchable, but it's also true for
certain types of searchable fields. For instance, "category" fields are often
set up to match exactly or not at all, as are fields like "last_name" (because
you may not want to conflate results for "Humphrey" and "Humphries").
To specify that there should be no analysis performed at all, use StringType:
my $type = Lucy::Plan::StringType->new;
$schema->spec_field( name => 'category', type => $type );
=head2 Highlighting up next
In our next tutorial chapter, L,
we'll add highlighted excerpts from the "content" field to our search results.