At our library perhaps owning to a larger than average number of biblios a full re-index takes an unacceptable amount of time complete (> 24h). We also had an issue with indexing becoming increasingly slower when new mappings are added. After some profiling using NYTProf it became clear most of this overhead is in Catmandu::Store::ElasticSearch and Catmandu::MARC. After giving it some thought the simplest way to resolve this issue actually seemed to be to replace these libraries with Koha-specific code, since the functionality provided is actually not that hard to re-implement in a more efficient manner. Due to the complexity of Catmandu optimizing these libraries would most likely be more challenging (and some parts are not actually possible to optimize because of limitations owing to the architecture of Catmandu/Fix). Main benefits include:
1) Increased indexing performance (about twice as fast, six times as fast if comparing time spent in update_index()), due to more efficient json-conversion and fewer Elasticsearch requests.
2) With Catmandu indexing speed decreases as more mappings are added, with the alternative algorithm indexing is kept more or less constant no matter how many mappings you add.
3) Neglectable indexing start-up time. For example we have an issue with the book drop machine, each return taking a couple of seconds because of the catmandu start-up overhead.
4) More transparent code and less complexity compared with Catmandu.
With this patch the largest bottleneck is instead Marc::Record::as_xml_record, to use marc21 as serialization format would probably be a lot faster but still chose marc-xml because of the binary format length limitation (which could be exceeded with many items). Still, I will probably try to look into faster marc-xml serialization options in the future to address this.
I also attach profiling results with and without the patch applied.

Created attachment 70201[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Add alternative optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index by running the rebuild_elastic_search.pl
with the -d flag: `koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Enable ExperimentalElasticSearchIndexing system preference
(found under Global System preferences -> Administration -> Search Engine).
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).

Created attachment 70202[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Add alternative optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index by running the rebuild_elastic_search.pl
with the -d flag: `koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Enable ExperimentalElasticSearchIndexing system preference
(found under Global System preferences -> Administration -> Search Engine).
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library

Created attachment 70230[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Add alternative optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index by running the rebuild_elastic_search.pl
with the -d flag: `koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Enable ExperimentalElasticSearchIndexing system preference
(found under Global System preferences -> Administration -> Search Engine).
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library

Created attachment 70325[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Add alternative optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index by running the rebuild_elastic_search.pl
with the -d flag: `koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Enable ExperimentalElasticSearchIndexing system preference
(found under Global System preferences -> Administration -> Search Engine).
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library

It breaks my authority indexing (Unimarc) :
substr outside of string at /home/koha/perl5/lib/perl5/MARC/File/XML.pm line 563.
Use of uninitialized value $enc in string eq at /home/koha/perl5/lib/perl5/MARC/File/XML.pm line 565.
Use of uninitialized value $enc in string eq at /home/koha/perl5/lib/perl5/MARC/File/XML.pm line 565.
Use of uninitialized value $enc in string eq at /home/koha/perl5/lib/perl5/MARC/File/XML.pm line 567.
Use of uninitialized value $enc in concatenation (.) or string at /home/koha/perl5/lib/perl5/MARC/File/XML.pm line 570.
Unsupported UNIMARC character encoding [] for XML output for unimarc; 100$a -> 20180313 frey50 at /home/koha/perl5/lib/perl5/MARC/File/XML.pm line 570.

Ok, have not tried it for unimarc-records myself, but from the error messages it appears the perl MARC-library expects subfield 100a to contain encoding information, and that it's emtpy. Right now not really sure if this is a MARC:::File::XML issue, data issue, or a bug in the patch. It could perhaps be worked around if added option for which serialization format to use (binary marc in addition to marc-xml). Opted for marc-xml mainly because of the length-limitation, we have some records with too many items so will not fit into binary marc.

Created attachment 73153[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Add alternative optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index by running the rebuild_elastic_search.pl
with the -d flag: `koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Enable ExperimentalElasticSearchIndexing system preference
(found under Global System preferences -> Administration -> Search Engine).
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library

Authorities is still patchy with ES for Unimarc.
Nevertheless 100$a is not an empty field in my test sample. The way it is filled follows most french university libraries common rules (sudoc : http://documentation.abes.fr/sudoc/formats/unmb/zones/100.htm) which for this part... are not compliant to the Unimarc standard... It has 24 characters where Unimarc says it should have 35...

I like this approach quite a bit. A couple of comments:
1.) Incomplete fields should be handled more gracefully. I'm getting a set of these warnings:
substr outside of string at /home/ere/kohacommunity/Koha/SearchEngine/Elasticsearch.pm line 356.
This is with the test records from https://github.com/joubu/koha-misc4dev/tree/master/data/sql/marc21/1611/after_17196
2.) es_id is missing from the indexed records compared to the Catmandu indexing code. Is this intentional?
3.) I think this, when done, should just replace the Catmandu-based indexing code. Since the ES support itself is somewhat experimental, it would make sense to switch once and for all.
4.) Make sure to document the dependency on Search::Elasticsearch. Would it be possible to use the v6 module? It says it supports ES 5 too. I had to downgrade it on my system for the patch to work.
5.) Booleans should be indexed as true/false. I'm seeing deprecation notices for suppress and onloan (this is wrong in the old code too, but could as well be done correctly here).
6.) The Catmandu version seems to create way more __sort fields, but perhaps it's a bug in the Catmandu version?
7.) The patch doesn't apply cleanly, so there could be also something I screwed up while fixing it manually.

Thanks
> 1.) Incomplete fields should be handled more gracefully...
I agree. Have ignored this since only results in a warning, but could probably be done in better way. Some mappings ranges in the mappings YAML are off by one (mostly the ff*-fields I think), this might aggravate the issue.
> 2.) es_id is missing from the indexed records compared to the Catmandu indexing code. Is this intentional?
I think this is something Catmandu adds, and I think I just left it out since doesn't seems to be needed for anything.
> 3.) I think this, when done, should just replace the Catmandu-based indexing code. Since the ES support itself is somewhat experimental, it would make sense to switch once and for all.
That would be wonderful. I would actually love to get rid of all Catmandu-dependencies altogether. Indexing is probably the heaviest part, the rest should be quite trivial to replace with Search::Elasticsearch (which Catamndu uses internally).
> 4.) Make sure to document the dependency on Search::Elasticsearch. Would it be possible to use the v6 module? It says it supports ES 5 too. I had to downgrade it on my system for the patch to work.
I did not document it since Catmandu depends on it, but it will need to be done if Catmandu is no longer a dependency. To use v6 you just have to change,
client => "5_0::Direct" to client => "6_0::Direct". If I'm not mistaken, I think we are running ES 6 with the "5_0::Direct" line (and works, but probably not optimal). The reason I went for 5.0 was that was the version Koha was using at the time of the initial version of the patch (I think).
> 5.) Booleans should be indexed as true/false. I'm seeing deprecation notices for suppress and onloan (this is wrong in the old code too, but could as well be done correctly here).
I agree. This is a problem in Koha master as well I think (but there is a bugzilla issue that takes care of it). It would be easy to fix so will make sure to do so.
> 6.) The Catmandu version seems to create way more __sort fields, but perhaps it's a bug in the Catmandu version?
Yes, I think this is a bug in Koha master, but will look into to it.
> 7.) The patch doesn't apply cleanly, so there could be also something I screwed up while fixing it manually.
I think you got it right, most of the above I have run into myself, though a bit strange that you had to downgrade to 5.0 to get it to work? Can verify tomorrow that it really is 6.x we are running.

About the version, I was wrong, seams we are running 5.4. I would assume it would be as easy to just change the version string to get it working with 6 (but not sure how 6 handles _all fields and other deprecated things used by Koha). We run a quite heavily patched version of Koha where those issues have been fixed.

Created attachment 75109[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Add alternative optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index by running the rebuild_elastic_search.pl
with the -d flag: `koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Enable ExperimentalElasticSearchIndexing system preference
(found under Global System preferences -> Administration -> Search Engine).
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library

Created attachment 75271[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Add alternative optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index by running the rebuild_elastic_search.pl
with the -d flag: `koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Enable ExperimentalElasticSearchIndexing system preference
(found under Global System preferences -> Administration -> Search Engine).
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library

Fixed the offset issue (which I should have done sooner as was a very minor fix).
Search::Elasticsearch had been added to Perl dependencies somewhere else and was included with the rebase.
es_id not needed as far as I can tell.
Fixed boolean deprication warning and introduced "value_callback" mapping option which can later be made more dynamic, for example let type implementations define callbacks instead of hard coded in get_marc_mapping_rules.

Created attachment 75272[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Add alternative optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index by running the rebuild_elastic_search.pl
with the -d flag: `koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Enable ExperimentalElasticSearchIndexing system preference
(found under Global System preferences -> Administration -> Search Engine).
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library

@Ere Maijala you where also correct about the default sort setting which is enabled for true or undefined (which I think is incorrect, but better to open a separate issue about that and this patch complying with current behavior).

(In reply to Nicolas Legrand from comment #11)
> It breaks my authority indexing (Unimarc) :
>
> substr outside of string at /home/koha/perl5/lib/perl5/MARC/File/XML.pm line
> 563.
> Use of uninitialized value $enc in string eq at
> /home/koha/perl5/lib/perl5/MARC/File/XML.pm line 565.
> Use of uninitialized value $enc in string eq at
> /home/koha/perl5/lib/perl5/MARC/File/XML.pm line 565.
> Use of uninitialized value $enc in string eq at
> /home/koha/perl5/lib/perl5/MARC/File/XML.pm line 567.
> Use of uninitialized value $enc in concatenation (.) or string at
> /home/koha/perl5/lib/perl5/MARC/File/XML.pm line 570.
> Unsupported UNIMARC character encoding [] for XML output for unimarc; 100$a
> -> 20180313 frey50 at /home/koha/perl5/lib/perl5/MARC/File/XML.pm line
> 570.
I had forgotten about this issue, I will try to look into it, it would be easier if I had some test data which produced this error. So if you have a marc file would be helpful to attach or mail it to glasklas@gmail.com.

Created attachment 75353[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Add alternative optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index by running the rebuild_elastic_search.pl
with the -d flag: `koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Enable ExperimentalElasticSearchIndexing system preference
(found under Global System preferences -> Administration -> Search Engine).
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library
Signed-off-by: Ere Maijala <ere.maijala@helsinki.fi>

Ok, nice! Are you referring to adding Search::Elasticsearch in PerlDependencies? It has already been added it seams, so don't need to handle that dependency. If it is deemed the best option I can refactor the patch to remove the preference and use this as the only indexing option, that would not take long. Would be nice with some more feedback regarding how to handle this.

Yes, I didn't know it was already added. That's good then.
I believe that at this point it doesn't make sense to keep two different indexing methods when it's obvious that the old one has performance issues in this use and the new one doesn't have any obvious downsides. Especially so since the Elasticsearch support hasn't been stable yet. I have some general indexing improvements that I'd be much happier to base on the new mechanism rather than the Catmandu one. That's just me, though, and I'm not sure how something like this is to be decided.

Yes, you are probably right. Initially I provided it as a preference since was more uncertain if stable or not and did not want to cause regressions when the current indexing was working. But now I'm a lot more confident that at least all serious issues have been ironed out, and we have been running this in production with ~2 million biblios since the beginning of April. As you say as Elastic is not stable yet it's probably a good to replace indexing logic with this so that possible issues can be caught by more people (the possible unimarc issue for example).

Can you send something to the koha-dev mailing list and add this as a topic for the next Developers meeting David?
In a situation like this we should check for differing opinions or see if there are concerns others might have, and then we can vote at the meeting to move ahead.
I think this the way to go, one less dependency makes it easier to maintain

No problem, I can do that, what should that mail contain? In the first comment I described the main motivation behind the patch, should I write more about the technical details how the code works? Also, even with this patch Catmandu is still used for searching, but that part I think it will be easy to replace with Search::ElasticSearch client to be able to drop Catmandu-dependency completely. Since this patch in it's current state seams pretty stable, it would perhaps be better to do this in a separate issue later if this one gets merged.

Indeed, lets leave other removal to a new bug and not disturb things here.
The email should just highlight the proposed change, the performance boost, and future maintainability. You can point back to this bug for more details, just give an overview that lets other developers know the situation

(In reply to David Gustafsson from comment #27)
[...]
> I had forgotten about this issue, I will try to look into it, it would be
> easier if I had some test data which produced this error. So if you have a
> marc file would be helpful to attach or mail it to glasklas@gmail.com.
Hey David,
my colleague Séverine Queune retested it, it worked well this time and it took 30% less time than usual!
We're not well verse on how it should work, but so far, having something that works and works faster seems pretty convenient :). Thanks!

Ok, good! I wrote in the first post that indexing is twice as fast. The reason it was faster in my benchmark is that we have added more mappings and have about 40% more than the Koha defaults. Catmandu will get slower the more mappings but with the patch the time will stay pretty constant, so the performance difference will be more apparent the more mappings are added. I'm also a little bit dissatisfied with the slow serialization performance with marc-xml, so will probably add an option for using base64 encoded binary marc as serialization format instead. This will probably increase performance significantly as serialization now probably is the largest performance hog. I'm also working on fixing the tests and will probably be able to complete that today.

Created attachment 75428[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Add alternative optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index by running the rebuild_elastic_search.pl
with the -d flag: `koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Enable ExperimentalElasticSearchIndexing system preference
(found under Global System preferences -> Administration -> Search Engine).
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library

Created attachment 75455[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Add alternative optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index by running the rebuild_elastic_search.pl
with the -d flag: `koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Enable ExperimentalElasticSearchIndexing system preference
(found under Global System preferences -> Administration -> Search Engine).
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library

Added preference to choose serialization format for records in index. Now providing also base64 encoded binary marc as a more performant but possibly less robust (will not support very large records) option.

I'd leave the alternative serialization out for now and add a new bug about it. It might make sense to e.g. use the binary format as long as the record is small enough but switch to alternative format for large records.

Ok! MARCXML is still the default format, so previous behaviour is maintained if preference is left as is. The performance difference was also as big as hoped on the whole, so the largest benefit might be smaller index. I could remove it though quite confident it does not introduces any new bugs. Also I accidentally removed you sign off, but since have changed some things since then perhaps best to sign off again in that case. What I should have done is add the new changes as new commits, but wanted a clean diff to make it easier to review as a single unit, but in this case new commits would probably have been better.

I'll review as soon as possible.
Regarding the format of the full record, there are some other options that might be worth considering:
- Store as JSON making it possible to do advanced MARC queries
- Don't store the full record but get it from the database. There are obvious downsides but also upsides to this.

David,
$serialization_format needs a default value in the code, I think (but see below). Otherwise, unless you go and save the ElasticsearchMARCSerializationFormat pref once, the search results show up empty and you'll get a lot of "uninitialized value" warnings from the following line:
if ($serialization_format eq 'base64ISO2709') {
I also believe that "$record->as_usmarc()" will carp and set the record length to 99999 if the record ends up being too long. If you made reading the index autodetect between ISO2709 and MARCXML, you could use MARCXML as the fallback format when ->as_usmarc() carps. So the ISO2709 format would be used whenever possible and MARCXML would only be used for large records. And you could get rid of the ElasticsearchMARCSerializationFormat pref. This would be my preferred method at the moment, since ISO2709 saves time and space but large records need to be supported.

Thanks for the review and good suggestions. I agree the hybrid method seems the best, could also save a marc_format so format does not have to be detected.
I have also considered the two option you mentioned, storing the whole marc record as json or skip storing it all together. The later could possibly be an option (for those wanting to keep index size small), the former would be really and would like to see implemented so could be the subject for a future bug. The catamandu json serialization format is pretty useless for querying and analysis, but it probably not take that much work to write a marc to json direct conversion which allowed you to make queries like 100.a:"Tolkien, J. R. R.". I would also like to use Kibana for analysis and for that to be useful you would have t have the marc data available as structured json.

(In reply to David Gustafsson from comment #44)
> Ok! I think these were the versions:
>
> libcatmandu-marc-perl 0.215-1
> libcatmandu-perl 1.0304-2~kohadev1
> libcatmandu-store-elasticsearch-perl 0.0507-1~kohadev1
The versions are rather old. So far nobody asked about newer versions or complained about their performance. The changelog for Catmandu lists performance improvements in later versions. I have packaged newer versions in a new 'experimental' repository. I hope it has all dependencies.
deb http://debian.koha-community.org/koha experimental main
I don't have time to run tests or try the patches at the moment, but maybe you could run your benchmarks with the newer versions to see if it makes a difference?

(In reply to Nick Clemens from comment #34)
> Can you send something to the koha-dev mailing list and add this as a topic
> for the next Developers meeting David?
>
> In a situation like this we should check for differing opinions or see if
> there are concerns others might have, and then we can vote at the meeting to
> move ahead.
>
> I think this the way to go, one less dependency makes it easier to maintain
I am not necessarily a fan of dropping dependencies. I think it is part of Perl to use other modules instead of reinventing everything. Regarding Catmandu, the developers are very dedicated and we will profit from future developments without investing Koha developer time, while a change to our own solution might help with performance now but might not be touched again for a long time.

(In reply to Mirko Tietgen from comment #56)
> I am not necessarily a fan of dropping dependencies. I think it is part of
> Perl to use other modules instead of reinventing everything. Regarding
> Catmandu, the developers are very dedicated and we will profit from future
> developments without investing Koha developer time, while a change to our
> own solution might help with performance now but might not be touched again
> for a long time.
Ok, I had some issues upgrading the packages, so I included the latest versions in a custom lib directory instead using the "use lib" pragma. I these are the timings I got:
Old versions (catmandu 1.0304, catmandu-marc 0.215, catmandu-store-elasticsearch 0.0507)
real 2m2.277s
user 1m8.052s
sys 0m3.084s
New versions: catmandu 1.09, catmandu-marc 0.09, catmandu-store-elasticsearch 0.0507)
real 1m56.413s
user 1m5.020s
sys 0m2.848s
Patch:
real 0m44.444s
user 0m21.784s
sys 0m2.400s

And with what this is the latest versions:
(catmandu 1.09, catmandu-marc 1.231, catmandu-store-elasticsearch 0.0507)
real 1m59.575s
user 1m6.496s
sys 0m3.252s
Also a correction catmandu-marc 0.09 in the previous reply should of course be 1.09.

(In reply to Nicolas Legrand from comment #37)
> (In reply to David Gustafsson from comment #27)
> [...]
> > I had forgotten about this issue, I will try to look into it, it would be
> > easier if I had some test data which produced this error. So if you have a
> > marc file would be helpful to attach or mail it to glasklas@gmail.com.
>
> Hey David,
>
> my colleague Séverine Queune retested it, it worked well this time and it
> took 30% less time than usual!
>
> We're not well verse on how it should work, but so far, having something
> that works and works faster seems pretty convenient :). Thanks!
Are you sure it only took 30% less? The last timings point more in the direction of a 2.5x speedup.

We had an issue with our new Koha build (where some mappings had been changed) causing crashes when saving biblios etc since the previous version of this patch tried to update the server mappings when indexing new items. Now changed this to more robust and easier to debug handling of mapping synchronization by introducing persistent index status. I'm aware that abusing Context::preference for this purpose, if there are any suggestions regarding better persistent variable storage they are welcome.

Created attachment 76612[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Implement optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index without this patch by running the
rebuild_elastic_search.pl with the -d flag:
`koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Apply this patch.
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library
Signed-off-by: Ere Maijala <ere.maijala@helsinki.fi>

Created attachment 78091[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Implement optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index without this patch by running the
rebuild_elastic_search.pl with the -d flag:
`koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Apply this patch.
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library
Signed-off-by: Ere Maijala <ere.maijala@helsinki.fi>

From my understanding I looks like the Catmandu-version is wrong, since concaternation of subfields (if that it what is does) will screw up tokinization. I see that subfields are concatenated in bug 20244, but I don't understand the motivation behind this. What I can gather from the comments is: "The join option is needed in case there are fields without punctuation. Otherwise Catmandu will join the fields together without any space.". Is this something Catmandu does even with the patch? It looks good to me from the examples above.
Regarding the input I'm not 100% sure, but I think the second variant (Koha-specific) is preferable.

Yes, I'm aware of that. The problem as I see it is that subfields don't have any defined order in which the information becomes meaningful as a sentence. So proximity matching could result in false positives in this case, if I'm not missing something.

Yes, some subfields are like that, but in general the information in the different subfields cannot be automatically concatenated in a meaningful way. So my point is that is more chance than anything else that 245a and 245b can be joined. For other fields and subfields it will probably not be the case. I think that if one wants to do this it would better to invent some special form of mapping-syntax to be used in mappings.yaml where you could define that for example 245a and 245b (and other special cases) should be joined together for example. Something like this:
- facet: ''
marc_field: '245a+245b'
marc_type: marc21
sort: ~
suggestible: ''
But that would be a little bit challenging to implement with regards to how the mappings are currently processed.

Apologies:
The:
"author__suggestion" : {
"input" : [
"Hanna Johansen",
"JohansenHanna"
]
},
is the Ctamendu one
@David:
Concatenation with space is important. In unimarc, authors are stored in 7xx tags where $a is the lastname and $b is the firstname. We want data being indexed like "First Last" or "Last First". And this is useful for facets. We definitely don't want a facet with the lastname and an other with the first.
Bug 20389 is talking about that and suggests to use indexing plugin like we did with solr implementation in Koha. May be a solution

(In reply to Ere Maijala from comment #91)
> If the fields don't belong together, you can always define them alone as is
> currently done in many cases. It's just that you should also be able to say
> that you want 245ab as a whole.
Sure, I can buy that. But that is not the way the mappings are currently defined in most places. For example Personal-name is mapped to "100abcdefghjklmnopqrstvxyz" and would have to be split up into one mapping for each subfield, like 100a, 100b, 100c, 100d etc, or all the information will be join into a string that does not make much sense. This will make search behavior harder to predict, potentially cause unexpected exact/proximity matches and thus more difficult to optimize.

@Alex Arnaud I agree concaternation of subfields could be useful, but I ague that how this is done has to be defined in the mappings. If we concatenate all the subfields, many existing fields where concatenation is not desirable will be concaternated as a result.

(In reply to David Gustafsson from comment #95)
> (In reply to Ere Maijala from comment #91)
> > If the fields don't belong together, you can always define them alone as is
> > currently done in many cases. It's just that you should also be able to say
> > that you want 245ab as a whole.
>
> Sure, I can buy that. But that is not the way the mappings are currently
> defined in most places. For example Personal-name is mapped to
> "100abcdefghjklmnopqrstvxyz" and would have to be split up into one mapping
> for each subfield, like 100a, 100b, 100c, 100d etc, or all the information
> will be join into a string that does not make much sense. This will make
> search behavior harder to predict, potentially cause unexpected
> exact/proximity matches and thus more difficult to optimize.
I just realized that splitting up the fields would not make any difference with the Koha-specific implementation, it makes no distinction if subfield mappings are defined together in one mapping or separately. I would guess that it would be much more efficient to introduce concaternation as an exception, with special syntax (as there appears to be only a few mappings that have use for it), and keep default behavior as it is.

(In reply to David Gustafsson from comment #97)
> I just realized that splitting up the fields would not make any difference
> with the Koha-specific implementation, it makes no distinction if subfield
> mappings are defined together in one mapping or separately. I would guess
True, but in my version in bug 20244 it does make a difference.
What comes to the indexing of 100 etc. I can't see the problem with indexing the whole field as an entity. Since everything in it is in a single field, it's bound to be something belonging together. If, however, that turns out to be a problem, it would be way better to be able to define e.g. that 100abcde is one field and the rest is another.

(In reply to David Gustafsson from comment #96)
> @Alex Arnaud I agree concaternation of subfields could be useful, but I ague
> that how this is done has to be defined in the mappings. If we concatenate
> all the subfields, many existing fields where concatenation is not desirable
> will be concaternated as a result.
I think the mapping system was designed for that. I.e
author: 700ba
# we get 700$a and 700$b concatenated
And when concatenation is not wanted:
author: 700b
author: 700a
# We get 700$a and 700$b in 2 different array elements

(In reply to Alex Arnaud from comment #99)
> (In reply to David Gustafsson from comment #96)
> > @Alex Arnaud I agree concaternation of subfields could be useful, but I ague
> > that how this is done has to be defined in the mappings. If we concatenate
> > all the subfields, many existing fields where concatenation is not desirable
> > will be concaternated as a result.
>
> I think the mapping system was designed for that. I.e
>
> author: 700ba
> # we get 700$a and 700$b concatenated
>
> And when concatenation is not wanted:
>
> author: 700b
> author: 700a
> # We get 700$a and 700$b in 2 different array elements
In that case there are MANY mappings that are defined in the wrong way.

(In reply to David Gustafsson from comment #102)
[...]
> > I think the mapping system was designed for that. I.e
> >
> > author: 700ba
> > # we get 700$a and 700$b concatenated
> >
> > And when concatenation is not wanted:
> >
> > author: 700b
> > author: 700a
> > # We get 700$a and 700$b in 2 different array elements
>
> In that case there are MANY mappings that are defined in the wrong way.
True, but we recently learned the hard way we need those kind of concatenation in the zone 7xx. We have cases of "authorfirstname authorlastname" that don't work and should. Rewriting mappings is not expansive compare to this.

(In reply to David Gustafsson from comment #102)
> (In reply to Alex Arnaud from comment #99)
> > (In reply to David Gustafsson from comment #96)
> > > @Alex Arnaud I agree concaternation of subfields could be useful, but I ague
> > > that how this is done has to be defined in the mappings. If we concatenate
> > > all the subfields, many existing fields where concatenation is not desirable
> > > will be concaternated as a result.
> >
> > I think the mapping system was designed for that. I.e
> >
> > author: 700ba
> > # we get 700$a and 700$b concatenated
> >
> > And when concatenation is not wanted:
> >
> > author: 700b
> > author: 700a
> > # We get 700$a and 700$b in 2 different array elements
>
> In that case there are MANY mappings that are defined in the wrong way.
Why? This is the behavior we had before. In my opinion, the point here is not to discuss
how we should format data for elasticsearch but make indexing process faster. So, the first
step is to have at least the same feature/data with koha-specific code we had with Catmendu libraries and, once
we are done, check if we already have time saving.

(In reply to Nicolas Legrand from comment #103)
> (In reply to David Gustafsson from comment #102)
> [...]
> > > I think the mapping system was designed for that. I.e
> > >
> > > author: 700ba
> > > # we get 700$a and 700$b concatenated
> > >
> > > And when concatenation is not wanted:
> > >
> > > author: 700b
> > > author: 700a
> > > # We get 700$a and 700$b in 2 different array elements
> >
> > In that case there are MANY mappings that are defined in the wrong way.
>
> True, but we recently learned the hard way we need those kind of
> concatenation in the zone 7xx. We have cases of "authorfirstname
> authorlastname" that don't work and should. Rewriting mappings is not
> expansive compare to this.
Hello Nicolas,
Any exemples of what doesn't work ?

> Why? This is the behavior we had before. In my opinion, the point here is
> not to discuss
> how we should format data for elasticsearch but make indexing process
> faster. So, the first
> step is to have at least the same feature/data with koha-specific code we
> had with Catmendu libraries and, once
> we are done, check if we already have time saving.
I don't think the two can be separate. If the indexing method doesn't allow something that's clearly needed, any optimization of the indexing code needs to take it into account. Alex's comment #83 also implies that concatenation did work with Catmandu, and even if it didn't, it would have been easy to tweak the rules to fix it.
Your version of indexing would be faster than Catmandu even if subfields in a single field would be processed in rule order. But if all this becomes too complicated, I think it would be better to put any optimization on the back burner and work on the current code instead to fix its issues. E.g. I can easily resurrect the previous patch in bug 20244 for several improvements.

(In reply to Alex Arnaud from comment #105)
> (In reply to Nicolas Legrand from comment #103)
> > (In reply to David Gustafsson from comment #102)
> > [...]
> > > > I think the mapping system was designed for that. I.e
> > > >
> > > > author: 700ba
> > > > # we get 700$a and 700$b concatenated
> > > >
> > > > And when concatenation is not wanted:
> > > >
> > > > author: 700b
> > > > author: 700a
> > > > # We get 700$a and 700$b in 2 different array elements
> > >
> > > In that case there are MANY mappings that are defined in the wrong way.
> >
> > True, but we recently learned the hard way we need those kind of
> > concatenation in the zone 7xx. We have cases of "authorfirstname
> > authorlastname" that don't work and should. Rewriting mappings is not
> > expansive compare to this.
>
> Hello Nicolas,
>
> Any exemples of what doesn't work ?
Hi Alex,
On Bulac catalog, you can search for "Ariane Eissen".
You'll have 2 results, but we possess 3 books she wrote.
In the third record, "Ariane" is preceded by "sous la direcetion d' " in 200$f so with the apostophe, the name of the author is considered as "d'Ariane".
This is quite an obscur problem to us, because zone 7xxx is correctly indexed and authorities are present, so we could have found this third record using both "Ariane" and "d'Ariane".
I just test with the form you mentionned on comment 94 and it works : I've got 3 results searching "eissenariane".
Could that exemple help you ?

(In reply to Ere Maijala from comment #106)
> > Why? This is the behavior we had before. In my opinion, the point here is
> > not to discuss
> > how we should format data for elasticsearch but make indexing process
> > faster. So, the first
> > step is to have at least the same feature/data with koha-specific code we
> > had with Catmendu libraries and, once
> > we are done, check if we already have time saving.
>
> I don't think the two can be separate. If the indexing method doesn't allow
> something that's clearly needed, any optimization of the indexing code needs
> to take it into account. Alex's comment #83 also implies that concatenation
> did work with Catmandu, and even if it didn't, it would have been easy to
> tweak the rules to fix it.
>
> Your version of indexing would be faster than Catmandu even if subfields in
> a single field would be processed in rule order. But if all this becomes too
> complicated, I think it would be better to put any optimization on the back
> burner and work on the current code instead to fix its issues. E.g. I can
> easily resurrect the previous patch in bug 20244 for several improvements.
At took a look at it and it seams to be quite easily fixed with the optimized indexing. I think you join all the subfields in your patch, but think can post a patch by tomorrow (or the next day) where only some (defined in mappings) subfields can be joins, without much increased complexity or impact on performance.

And I would not say that concaternation currently works if the result is, "JohansenHanna", "MaquetYves-Marie" or "BehndKathi" for example (missing space). Probably easy to correct, but could be improved further with the optimized indexing as mentioned above.

Subfields to be joined (with space) can be defined within parenthesis in "marc_field" like:
facet: ''
marc_field: 245(ab)cdef
marc_type: marc21
sort: 1
suggestible: '1'
An alternative, perhaps more flexible, way of defining them could instead be:
facet: ''
marc_field: 245cdef
subfields_join:
- subfields: ab
separator: ''
marc_type: marc21
sort: 1
suggestible: '1'
Perhaps sometimes you don't want space as an separator, if the subfields are already formatted to be directly concatenated for example.
Tests should also be added, and I can probably do that tomorrow. Also had an error involving missing normalizer "my_normalizer". Removed it from field_config.yaml and worked. Don't know if this is a bug and in that case if it has been reported. If wanting to try out this patch that line probably needs to be removed.

Is there a real use case for not adding the space? Obviously it's not needed if it's there already, but that can be done automatically.
I'm asking because conditional logic here gets a bit more complicated when you take optional subfields into consideration. What I mean is that the rule
marc_field: 245(ab)cdef
is clear when both a and b exist, but what if there's only acd?

(In reply to Ere Maijala from comment #113)
> Is there a real use case for not adding the space? Obviously it's not needed
> if it's there already, but that can be done automatically.
>
> I'm asking because conditional logic here gets a bit more complicated when
> you take optional subfields into consideration. What I mean is that the rule
>
> marc_field: 245(ab)cdef
>
> is clear when both a and b exist, but what if there's only acd?
Perhaps not, but It would be a major thing to add so if people have use for it it could be added at low cost. If fields does not exist they will be skipped, so if there is only acd, (ac) will become a, and c and d added as separate values.

(In reply to Alex Arnaud from comment #105)
> (In reply to Nicolas Legrand from comment #103)
> > (In reply to David Gustafsson from comment #102)
> > [...]
> > > > I think the mapping system was designed for that. I.e
> > > >
> > > > author: 700ba
> > > > # we get 700$a and 700$b concatenated
> > > >
> > > > And when concatenation is not wanted:
> > > >
> > > > author: 700b
> > > > author: 700a
> > > > # We get 700$a and 700$b in 2 different array elements
> > >
> > > In that case there are MANY mappings that are defined in the wrong way.
> >
> > True, but we recently learned the hard way we need those kind of
> > concatenation in the zone 7xx. We have cases of "authorfirstname
> > authorlastname" that don't work and should. Rewriting mappings is not
> > expansive compare to this.
>
> Hello Nicolas,
>
> Any exemples of what doesn't work ?
Hey Alex
https://koha.bulac.fr/cgi-bin/koha/opac-MARCdetail.pl?biblionumber=914403
Try to find this book while taping the name of the co-author "Ariane Eissen".
Her name appears fully in the field 200, but the lexer don't separate words on apostrophes, so she's indexed as "d'Ariane Eissen". This leave us with the 702 to find her.

(In reply to Ere Maijala from comment #116)
> The apostrophe is a separate issue and should be handled by modifying the
> analysis chain config in Elasticsearch. I haven't tested it yet, but ES has
> an elision token filter for just this purpose:
> https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-
> elision-tokenfilter.html. Anyway, I'd suggest we discuss and make any
> changes in a separate bug.
Absolutely, I only talked about it to give some context. The important part for me is that the 7xx fields should be used to find an author while typing Name + Surname and that it doesn't yet.

Maybe I'm missing something, but this is still not quite working as I was expecing. With "245ab" the subfields are still being split to multiple fields, and I have to use "245(ab)" to put them in a single field.

Yes, that is the way it is meant to work. Since there are only a few cases where fields need to be joined it is better to make those the exception, and the default behavior to split into multiple values.

Created attachment 78690[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Implement optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index without this patch by running the
rebuild_elastic_search.pl with the -d flag:
`koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Apply this patch.
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library
Signed-off-by: Ere Maijala <ere.maijala@helsinki.fi>

Quick (naive?) questions:
- Why do you loop on MARC fields instead of the mappings?
- What are "joined fields", do we really to index everything?
- Can you submit an example (record in input will output this document)?

Answers as far as I can tell:
(In reply to Jonathan Druart from comment #138)
> Quick (naive?) questions:
> - Why do you loop on MARC fields instead of the mappings?
Going through MARC fields is slower than going through rules and when the number of rules increases, the difference is even more pronounced.
> - What are "joined fields", do we really to index everything?
There are examples in the comments, but the simple case is 245 where you'd want to index subfields a, b, n and p as a complete string to enable proper sorting, ranking, proximity.

Forgot to mention that joining is also important for faceting.
There are some examples of what you get in t/Koha/SearchEngine/Elasticsearch.t. It obviously depends on the rules. But here's a short example of the 245 field.
Input:
245 $aMain title / $bSubtitle
Rule:
title = 245(ab)
Result:
title: Main title / Subtitle
Without joining:
Input:
245 $aMain title / $bSubtitle
Rule:
title = 245ab
Result:
title: Main title
title: Subtitle

(In reply to Ere Maijala from comment #130)
> Oh, I see the light now. This is useful also for indexing all subfields
> without having to list all of them. Looks good to me then. We should just
> make sure to document how the rules work.
Yes, exactly. I was going to make this point but you made it for me :) Besides being cumbersome it would also make the mappings.yaml quite huge and difficult to read, and mappings slower to fetch from database. Also, the mappings are already defined in a way that you need to make minimal changes with this approach. If joing was the default it would have to be largely rewritten.

(In reply to Jonathan Druart from comment #138)
> Quick (naive?) questions:
> - Why do you loop on MARC fields instead of the mappings?
> - What are "joined fields", do we really to index everything?
> - Can you submit an example (record in input will output this document)?
About the first one. This is actually one of the tricks employed to make this much faster than catmandu. If you look at the "field" subroutine, it loops through all the fields on each call. So if we where to instead loop through the fields in mapping.yaml, and call fields for each one (like catmandu does), this is quite expensive to do. If we instead group all mappings by field stored in a hash, and loop through fields just once (by calling "field" instead) we get a huge performance gain since hash key lookup is constant time. If we add more mappings, only the parsing of mappings and creation of the rules hash will take longer, but that only has to be done once per batch so will have almost no impact on performance.

(In reply to David Gustafsson from comment #142)
> (In reply to Ere Maijala from comment #130)
> > Oh, I see the light now. This is useful also for indexing all subfields
> > without having to list all of them. Looks good to me then. We should just
> > make sure to document how the rules work.
>
> Yes, exactly. I was going to make this point but you made it for me :)
> Besides being cumbersome it would also make the mappings.yaml quite huge and
> difficult to read, and mappings slower to fetch from database. Also, the
> mappings are already defined in a way that you need to make minimal changes
> with this approach. If joing was the default it would have to be largely
> rewritten.
Indeed, this is useful for configuring mappings faster. I agree.
syntax like 245(ab) seems good and "user friendly" to me. Agree too.
But i'm wondering about updates. There's currently some libraries in production with Elasticseach. Will they need to change the rules themselves or will we provide an update script that will change the rules automatically ?

If bug 19707 was to be accepted into Koha, new mappings could also be added that way, even if automatic syncing comes with it's own problems (old mappings are not purged since can't differentiate between those and those added by end user).

(In reply to David Gustafsson from comment #149)
> I will also try to have a look whether the structure of auto-suggest
> "input"-fields is correct or not.
I don't know if it works or not, i don't how to test (is this used in Koha anywhere?), but the structure is still different from Catmandu results.

First QA things here:
Pod is missing for this subs:
get_elasticsearch
marc_records_to_documents
get_marc_mapping_rules
bulk_index
create_index
index_exists
update_mappings
index_status
index_status_ok
index_status_recreate_required
index_status_reindex_required
You removed ElasticsearchMARCSerializationFormat syspref, previously created, but it is still used in Koha/SearchEngine/Elasticsearch.pm
===================================
Questions:
I'm wondering what the "index status" feature is for. index_status sub is always used with a parameter (to set it) but never without parameter.
So it seems the status is never used, right? Also, status is stored in a systempreference and can only be set by the API (not by a librarian via UI). Can we actually say that this should be a systempreference?
I still have concerns about update/migration. As you changed the way we write mapping, could you provide a connection between old and new mapping syntax ? This could be useful for creating migration script (thing i'll surely do for some libraries using ES in production). And it could be a good thing to provide a (at least short) documentation about how to write mapping since we use our own syntax instead of the already documented catmandu fix language.

Forget the point about status feature (except for the syspref). It's a bad reading from me.
Another thing however: I'd like to have unit tests on the code that decides to set the marc_format. If possible with a normal record, a large record and a non valid record.

I'm hoping to get back to QAing this soon.. have you had a chance to take a look at the feedback to date yet David?
It would be great to get this moving again and try and push it before 18.11 feature slush.

Ok! I will prioritize this then and will try to get this done beginning of next week.
Ere Maijala: Thanks for the offer! I think it will not take me to long, so if Monday/Tuesday is not too late I think I can manage.

(In reply to David Gustafsson from comment #160)
> Created attachment 80971[details][review] [review]
> Bug 19893: Add pods, remove syspref, add tests for serialization format
>
> Add missing pods, remove obsolete syspref and add test for serialization
> format for records exceeding max record size
Thanks for the patch! :) Couple comments (I have to go now, continuing review hopefully later):
1. There is typo databse and missing to. "Using mappings stored in databse convert" -> "Using mappings stored in database to convert"
2. It appears there are many different changes but they have been put to one commit. It would make it easier to review if the changes were in their own commits with their own commit messages.

Regarding the other changes, to make review easier, they where mainly:
Rename marc_records_to_documents to _marc_records_to_documents to make "private".
Split all the index_status_<status> methods into is_index_status_<status> (to check whether index has this status), and set_index_status_<status> (to set this status for index), since the previous method definition was not very intuitive (call without argument to get, call with "1" to set).

(In reply to David Gustafsson from comment #165)
> Regarding the other changes, to make review easier, they where mainly:
>
> Rename marc_records_to_documents to _marc_records_to_documents to make
> "private".
>
> Split all the index_status_<status> methods into is_index_status_<status>
> (to check whether index has this status), and set_index_status_<status> (to
> set this status for index), since the previous method definition was not
> very intuitive (call without argument to get, call with "1" to set).
Thanks for the description but the problem is I cannot connect these descriptions to the pieces of code easily. It would be really awesome if you could split the patch in to as many as required for the patches to do one thing. Tests should be at least ok to split to its own patch?
However, I keep testing your patch in the current state and will update you if there is something.

Comment on attachment 80971[details][review]Bug 19893: Add pods, remove syspref, add tests for serialization format
Review of attachment 80971[details][review]:
-----------------------------------------------------------------
::: Koha/SearchEngine/Elasticsearch/Indexer.pm
@@ +307,4 @@
> if ( !$self->store ) {
> my $params = $self->get_elasticsearch_params();
> $self->store(
> + Catmandu::Store::Elasticsearch->new(
This line is changed and it somehow creates an unnoticable character that is different from the one being imported so deletion of biblio records doesn't work. This change of the line needs to be reverted.

(In reply to Joonas Kylmälä from comment #168)
> Comment on attachment 80971[details][review] [review]
> Bug 19893: Add pods, remove syspref, add tests for serialization format
>
> Review of attachment 80971[details][review] [review]:
> -----------------------------------------------------------------
>
> ::: Koha/SearchEngine/Elasticsearch/Indexer.pm
> @@ +307,4 @@
> > if ( !$self->store ) {
> > my $params = $self->get_elasticsearch_params();
> > $self->store(
> > + Catmandu::Store::Elasticsearch->new(
>
> This line is changed and it somehow creates an unnoticable character that is
> different from the one being imported so deletion of biblio records doesn't
> work. This change of the line needs to be reverted.
Good catch, that was because I replaced some occurences of "ElasticSearch" with "Elasticsearch" in comments (since that is how it is spelled), and accidently searched and replaced that one. Fixed and attached new patch.

Comment on attachment 78690[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Review of attachment 78690[details][review]:
-----------------------------------------------------------------
::: Koha/SearchEngine/Elasticsearch.pm
@@ +309,5 @@
> + my $serialization_format = C4::Context->preference('ElasticsearchMARCSerializationFormat');
> +
> + my @record_documents;
> +
> + sub _process_mappings {
This sub could be moved out of the marc_records_to_documents sub so that it could be then also documented
@@ +397,5 @@
>
> my $marcflavour = lc C4::Context->preference('marcflavour');
> my @rules;
>
> + sub _field_mappings {
Also I think this sub could benefit from documentation and not being inside get_marc_mapping_rules as there is plenty of other things going already in get_marc_mapping_rules so it would make it easier to follow the code.
I tried to make a follow-up patch with documentation but it's going pretty slowly since I have to now gather the intel what each variable actually contains.
$mappings: a tuple (the name of the field in elasticsearch index, "options" or rules to modify the field data)
$record_document: document to put to Elasticsearch index
...

I agree It might benefit from documentation, it should (or even can) not be moved outside though, since I think there are some closures within. It is also a specific and private helper-method, thus belongs inside the method.

(In reply to David Gustafsson from comment #173)
> I agree It might benefit from documentation, it should (or even can) not be
> moved outside though, since I think there are some closures within. It is
> also a specific and private helper-method, thus belongs inside the method.
I couldn't see any closures in the two subroutines I mentioned (and with "closure" I mean they don't refer to any variables in the outer subroutine). The search functionality and indexing seemed to work fine also when they are in the top level (I tested). Anyways, I put this topic on the Koha developer meeting list and maybe we will get a better consensus on how to do this.

(In reply to Joonas Kylmälä from comment #174)
> (In reply to David Gustafsson from comment #173)
> > I agree It might benefit from documentation, it should (or even can) not be
> > moved outside though, since I think there are some closures within. It is
> > also a specific and private helper-method, thus belongs inside the method.
>
> I couldn't see any closures in the two subroutines I mentioned (and with
> "closure" I mean they don't refer to any variables in the outer subroutine).
> The search functionality and indexing seemed to work fine also when they are
> in the top level (I tested). Anyways, I put this topic on the Koha developer
> meeting list and maybe we will get a better consensus on how to do this.
The general consensus at the meeting was that seemed like ti could be a better function on its own and that the nesting was confusing in this case and didn't seem to be a traditional closure. The subroutine should be moved to the top level

I can have a look at this an move the subs out, I confused them with another case where closures where used, so no closures thus they work also outside. I don't really agree this is an improvement since the subs are not supposed to be used outside of the subroutines in which they are defined, but it's no big deal. Don't think Koha::FieldMappings is the right way to put them though sinceis an internal data structure used by _process_mappings, that has no purpose outside of that subroutine.

(In reply to Jonathan Druart from comment #176)
> Or moved to Koha::FieldMappings.
> Tests must be provided in either case.
Koha::FieldMappings is not really related to the MARC to Elasticsearch mappings that we are talking here, since Koha::FieldMappings doesn't have rules/mappings whether the field or combination of fields is a boolean, sortable, and so on. The Elasticsearch.pm module is already object oriented and the functions we are now talking about are just helper functions that don't necessarily need to be object oriented. So I'm all in for the current solution David is suggesting (but little refactoring is needed).

(In reply to Joonas Kylmälä from comment #181)
> Hi,
>
> could you still split the changes to their own commits?
I was considering your previous comment:
(In reply to Joonas Kylmälä from comment #166)
> (In reply to David Gustafsson from comment #164)
> > Created attachment 80983[details][review] [review] [review]
> > Bug 19893: Fixed some typos
>
> Please squash this to the previous patch as a new commit created by it is
> not necessary
> (https://wiki.koha-community.org/wiki/
> Development_IRC_meeting_27_September_2017)
which I interpreted as patch was supposed to be squashed in. As I no longer have the specific git commits in my local git repo I would have to rebuild them from the different iterations of patches in this issue and I would rather not do that due time constrains and risk of screwing things up. I also think there might be a benefit to review the patch as a whole since you don't have to account for errors made in previous commits that might have been corrected in a later one.

If one big commit is fine for everybody else then I can pass it through this time. And if you come across the situation again where you have to change the commit or split them up you can do git reset --soft HEAD~1 and then just write new commit messages and in the end check with git diff whether the code is still identical to the situation before you started.
I will do the actual code review hopefully by the end of this week.

Comment on attachment 78690[details][review]Bug 19893 - Alternative optimized indexing for Elasticsearch
Review of attachment 78690[details][review]:
-----------------------------------------------------------------
::: Koha/SearchEngine/Elasticsearch.pm
@@ +333,5 @@
> + }
> + push @{$record_document->{$target}}, $_data;
> + }
> + }
> + foreach my $record (@{$records}) {
Instead of stacking multiple for loops here, we can move this block to its own function, maybe called marc_record_to_document. This should make it easier to understand the code as the for loop would be then named and code doesn't go horizontally off the screen, right?

Comment on attachment 81284[details][review]Bug 19893: Add pods, remove syspref, add tests for serialization format
Review of attachment 81284[details][review]:
-----------------------------------------------------------------
::: Koha/SearchEngine/Elasticsearch.pm
@@ +484,5 @@
>
> +Get mappings, an internal data structure later used by L<_process_mappings($mappings, $data, $record_document)>
> +to process MARC target data, for a MARC mapping.
> +
> +The returned C<$mappings> is to to be confused with mappings provided by C<_foreach_mapping>, rather this
Should be:
The returned C<$mappings> is not to be confused with mappings provided by C<_foreach_mapping>, rather this
(notice also the extra whitespace)
@@ +515,5 @@
> +Elasticsearch document target field type.
> +
> +=item C<$range>
> +
> +An optinal range as a string on the format "<START>-<END>" or "<START>",
Should be:
An optional range as a string in the format "<START>-<END>" or "<START>",
@@ +517,5 @@
> +=item C<$range>
> +
> +An optinal range as a string on the format "<START>-<END>" or "<START>",
> +where "<START>" and "<END>" are integers specifying a range that will be used
> +for extracting a substing from MARC data as Elasticsearch field target value.
substing -> substring
@@ +522,5 @@
> +
> +The first character position is "1", and the range is inclusive,
> +so "1-3" means the first three characters of MARC data.
> +
> +If only "<START>" is provided only one character as position "<START>" will
as position -> at position

Thanks for the patch!
Instead of croak or die the exceptions need to be used, e.g. Koha::Exceptions::Exception->throw(""). Please do a git diff origin/master..HEAD to see all the dies and croaks (assuming HEAD contains your patches). And the many for loops I mentioned are in the subroutine marc_records_to_documents Koha/SearchEngine/Elasticsearch.pm – though take this as an optional thing to fix as I know this bug is blocking quite many other things.
Continuing still code review... About installer/data/mysql/atomicupdate/bug_19893_elasticsearch_index_status_sysprefs.sql – a) is it a good idea to have these as a syspref (even though the user cannot see them)? b) if it's a good idea then the sysprefs should also be in installer/data/mysql/sysprefs.sql because otherwise in a new Koha installation the syspref will be missing.
In the file Koha/SearchEngine/Elasticsearch.pm: s/string on the format/string in the format/
Then we also need a test plan for this code. Like what steps need to be taken to index authorities and biblios and what should be the expected result.
Unit tests (https://wiki.koha-community.org/wiki/Coding_Guidelines#PERL17:_Unit_tests_are_required_.28updated_Apr_26.2C_2017.29): at least decode_record_from_result is missing one
If all the above mentioned things get fixed then I'm probably ready to sign-off but I still need to test this code actually before that.

(In reply to Joonas Kylmälä from comment #197)
> Then we also need a test plan for this code. Like what steps need to be
> taken to index authorities and biblios and what should be the expected
> result.
There was one already in the first commit, so forget this.

(In reply to Joonas Kylmälä from comment #197)
> Thanks for the patch!
>
> Instead of croak or die the exceptions need to be used, e.g.
> Koha::Exceptions::Exception->throw(""). Please do a git diff
> origin/master..HEAD to see all the dies and croaks (assuming HEAD contains
> your patches). And the many for loops I mentioned are in the subroutine
> marc_records_to_documents Koha/SearchEngine/Elasticsearch.pm – though take
> this as an optional thing to fix as I know this bug is blocking quite many
> other things.
>
> Continuing still code review... About
> installer/data/mysql/atomicupdate/
> bug_19893_elasticsearch_index_status_sysprefs.sql – a) is it a good idea to
> have these as a syspref (even though the user cannot see them)? b) if it's a
> good idea then the sysprefs should also be in
> installer/data/mysql/sysprefs.sql because otherwise in a new Koha
> installation the syspref will be missing.
>
> In the file Koha/SearchEngine/Elasticsearch.pm: s/string on the
> format/string in the format/
>
> Then we also need a test plan for this code. Like what steps need to be
> taken to index authorities and biblios and what should be the expected
> result.
>
> Unit tests
> (https://wiki.koha-community.org/wiki/Coding_Guidelines#PERL17:
> _Unit_tests_are_required_.28updated_Apr_26.2C_2017.29): at least
> decode_record_from_result is missing one
>
> If all the above mentioned things get fixed then I'm probably ready to
> sign-off but I still need to test this code actually before that.
I for some reason read it as "die" should be replaced with "croak". I have not replaced them with exceptions instead.
I think it's a bit of an anti-pattern to put a block of code into a function if only invoked in one location, so would prefer to leave the code as it is.
About the syspref, sadly there is no persistent variable store in Koha that I know of. In other places where this is needed (like Koha version number), it is stored in the syspref-table.
Fixed the grammatical error.
I don't know how much time I have to spend on unit tests. I think there is close to 100% code coverage trough the other tests, but I recently checked this with cover and there are some minor execution-paths that are never reached. Which I could fix. If I where to write unit tests for all new helper-methods added it could take some time.

(In reply to David Gustafsson from comment #207)
> Created new commit, if should be squashed I can do that if requested.
Seems separate enough from the rest of the non-signed-off-by commits to justify its own commit. Will you fix the title "Bug 19893 - Alternative optimized indexing for Elasticsearch" too so I can do the final testing and sign-off the patches if all goes well.

Created attachment 81957[details][review]Bug 19893: Alternative optimized indexing for Elasticsearch
Implement optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index without this patch by running the
rebuild_elastic_search.pl with the -d flag:
`koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Apply this patch.
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library
Signed-off-by: Ere Maijala <ere.maijala@helsinki.fi>

Created attachment 82143[details][review]Bug 19893: Alternative optimized indexing for Elasticsearch
Implement optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index without this patch by running the
rebuild_elastic_search.pl with the -d flag:
`koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Apply this patch.
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library
Signed-off-by: Ere Maijala <ere.maijala@helsinki.fi>

Thanks for your hard work David and Joonas! I've gone through the changes and tested once more, and I believe we're ready for QA. I did a very minor rebase caused by bug 18316 to mappings.tt where just another message had been added.

Created attachment 82256[details][review]Bug 19893: Alternative optimized indexing for Elasticsearch
Implement optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index without this patch by running the
rebuild_elastic_search.pl with the -d flag:
`koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Apply this patch.
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library
Signed-off-by: Ere Maijala <ere.maijala@helsinki.fi>
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>

Fantastic effort everyone.. thank you all for the perseverance and continual refinement of these patches and responses to feedback!
I have added the Sponsored-by lines to each of the followup patches and added two very very minor QA followups.
Passing QA, again, well done and thank you to everyone involved!

(In reply to Martin Renvoize from comment #243)
> Fantastic effort everyone.. thank you all for the perseverance and continual
> refinement of these patches and responses to feedback!
>
> I have added the Sponsored-by lines to each of the followup patches and
> added two very very minor QA followups.
>
> Passing QA, again, well done and thank you to everyone involved!
Great! I replied regarding the missing documentation for concatenation in bug 21331
(https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=21331#c7). Have not yet had time to look into it as perhaps need to create a new section for mappings.yaml in the Online manual. But agree this should be addressed.

After discussion with several people I am considering this one and bug 20244 still as ES is not widely used and needs these improvements to be more usable.
I tested what I can see, but can I get a detailed test plan to ensure we are covered here?

Nick, I'm strongly supporting including these, and I'd like to nominate also bug 19365 for consideration.
My test plan has been briefly put this:
1. Index records using the master version (and time the process)
2. Index records to another index after applying the patches here (and time the process)
3. Check that the new method is indeed faster after taking into account any caching-related differences.
4. Search for the same records in both indexes directly from Elasticsearch and compare the records to verify that the fields have been correctly indexed.
5. Change the title field rule from 245a to 245(abnp) and reindex.
6. Verify that the title field in the index now contains correctly concatenated subfields.