Description

Currently it is possible in Lucene to store data as stored fields or as payloads.
Stored fields provide good performance if you want to load all fields for one
document, because this is an sequential I/O operation.

If you however want to load the data from one field for a large number of
documents, then stored fields perform quite badly, because lot's of I/O seeks
might have to be performed.

A better way to do this is using payloads. By creating a "special" posting list
that has one posting with payload for each document you can "simulate" a column-
stride field. The performance is significantly better compared to stored fields,
however still not optimal. The reason is that for each document the freq value,
which is in this particular case always 1, has to be decoded, also one position
value, which is always 0, has to be loaded.

As a solution we want to add real column-stride fields to Lucene. A possible
format for the new data structure could look like this (CSD stands for column-
stride data, once we decide for a final name for this feature we can change
this):

We distinguish here between the fixed length and the variable length cases. To
allow flexibility, Lucene could automatically pick the "right" data structure.
This could work like this: When the DocumentsWriter writes a segment it checks
whether all values of a field have the same length. If yes, it stores them as
FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger
merges two or more segments it checks if all segments have a FixedLengthList
with the same length for a column-stride field. If not, it writes a
VariableLengthList to the new segment.

Once this feature is implemented, we should think about making the column-
stride fields updateable, similar to the norms. This will be a very powerful
feature that can for example be used for low-latency tagging of documents.

Activity

How would this compare to making the storing of position and freq optional for a field? Then one could have an indexed field with a payload or boost but with no freq (or positions, since freq is required for positions). Would that be equivalent?

Doug Cutting
added a comment - 24/Mar/08 20:28 How would this compare to making the storing of position and freq optional for a field? Then one could have an indexed field with a payload or boost but with no freq (or positions, since freq is required for positions). Would that be equivalent?

How would this compare to making the storing of position and freq optional for a field? Then one could have an indexed field with a payload or boost but with no freq (or positions, since freq is required for positions). Would that be equivalent?

I think this would be very similar, except maybe:

This proposal would allow for optional non-sparse, fixed-length
storage (ie, don't include the docID since all docs have a
payload, and the payload is always the same length). EG norms are
like this.

[From the thread linked above] would allow for binary storage of
field values. EG for int fields you would store the 4 bytes per
value, and populating the cache would be much faster than the
FieldCache now (which must re-parse Strings -> ints, and, must
walk the terms to "reconstruct" the forward index).

This proposal may allow for updating these values, like we can do
with norms today. Maybe this can only work if the field is
non-sparse and perhaps if you've loaded it into the FieldCache?

This would tie into the LUCENE-831, so that you could load these
fields entirely in RAM, incrementally update them from a reopen, etc.

Michael McCandless
added a comment - 26/Mar/08 20:31
How would this compare to making the storing of position and freq optional for a field? Then one could have an indexed field with a payload or boost but with no freq (or positions, since freq is required for positions). Would that be equivalent?
I think this would be very similar, except maybe:
This proposal would allow for optional non-sparse, fixed-length
storage (ie, don't include the docID since all docs have a
payload, and the payload is always the same length). EG norms are
like this.
[From the thread linked above] would allow for binary storage of
field values. EG for int fields you would store the 4 bytes per
value, and populating the cache would be much faster than the
FieldCache now (which must re-parse Strings -> ints, and, must
walk the terms to "reconstruct" the forward index).
This proposal may allow for updating these values, like we can do
with norms today. Maybe this can only work if the field is
non-sparse and perhaps if you've loaded it into the FieldCache?
This would tie into the LUCENE-831 , so that you could load these
fields entirely in RAM, incrementally update them from a reopen, etc.

So there are a number of features these fields would have that differ from other fields:

no freq

no positions

non-sparse representation

binary values (is this different from payloads?)

updateable

My question is whether it is best to bundle these together as a new kind of field, or add these as optional features of ordinary fields, or some combination. There are a certain bundles that may work well together: e.g., a dense array of fixed-size, updateable binary values w/o freqs or positions. And not all combinations may be sensible or easy to implement. But most of these would also be useful ala carte too, e.g., no-freqs, no-positions and (perhaps) updateable.

BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields.

Doug Cutting
added a comment - 26/Mar/08 21:51 So there are a number of features these fields would have that differ from other fields:
no freq
no positions
non-sparse representation
binary values (is this different from payloads?)
updateable
My question is whether it is best to bundle these together as a new kind of field, or add these as optional features of ordinary fields, or some combination. There are a certain bundles that may work well together: e.g., a dense array of fixed-size, updateable binary values w/o freqs or positions. And not all combinations may be sensible or easy to implement. But most of these would also be useful ala carte too, e.g., no-freqs, no-positions and (perhaps) updateable.
BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields.

So there are a number of features these fields would have that differ from other fields:

Maybe add "stored in its own file" or some such, to that list. Ie to
efficiently update field X I would think you want it stored in its own
file. We would then fully write a new geneation of that file whenever
it had changes.

I agree it would be great to implement this as "flexible indexing",
such that these are simply a-la-cart options on how the field is
indexed, rather than make a new specialized kind of field that just
does one of these "combinations". But I haven't wrapped my brain
around what all this will entail... it's a biggie!

BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields.

Michael McCandless
added a comment - 26/Mar/08 22:44 Sorry you're right: the payload is the binary data.
So there are a number of features these fields would have that differ from other fields:
Maybe add "stored in its own file" or some such, to that list. Ie to
efficiently update field X I would think you want it stored in its own
file. We would then fully write a new geneation of that file whenever
it had changes.
I agree it would be great to implement this as "flexible indexing",
such that these are simply a-la-cart options on how the field is
indexed, rather than make a new specialized kind of field that just
does one of these "combinations". But I haven't wrapped my brain
around what all this will entail... it's a biggie!
BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields.
I like that!

While working on this I got kind of stuck on the ingestion side. I
think we need to make FieldInfos and Fields more flexible. If I
introduce a column-stride field, then I also need to add something to
FieldInfos. So I could add another bit indicating a column-stride
field. But the other bits don't make much sense in combination with
this.

Eventually we need more flexibility to utilize the flexible indexing
chain anyway. We need to store which codec to use for a field. Then we could
also just make a new codec for column-stride fields and maybe then we
do not have to introduce a new Field API.

So I think we should try for 3.0 to overhaul the
document/field/fieldinfos APIs. I have some ideas which I started
hacking during a long flight. I'll try to summarize the ideas/goals
I'd have for such a new API and send it to java-dev.

Michael Busch
added a comment - 24/Mar/09 19:13 While working on this I got kind of stuck on the ingestion side. I
think we need to make FieldInfos and Fields more flexible. If I
introduce a column-stride field, then I also need to add something to
FieldInfos. So I could add another bit indicating a column-stride
field. But the other bits don't make much sense in combination with
this.
Eventually we need more flexibility to utilize the flexible indexing
chain anyway. We need to store which codec to use for a field. Then we could
also just make a new codec for column-stride fields and maybe then we
do not have to introduce a new Field API.
So I think we should try for 3.0 to overhaul the
document/field/fieldinfos APIs. I have some ideas which I started
hacking during a long flight. I'll try to summarize the ideas/goals
I'd have for such a new API and send it to java-dev.

For the search side we need an API similar to TermDocs and Payloads,
let's call it ColumnStrideFieldAccessor (CSFA) for now. It should have
next(), skipTo(), doc(), etc. methods.
However, the way TermPositions#getPayloads() currently works
is that it always forces you to copy the bytes from the underlying
IndexInput into the payload byte[] array. Since we usually use a
BufferedIndexInput, this is then an arraycopy from
BufferedIndexInput's buffer array into the byte array.

I think to improve this we could allow users to call methods like
readVInt() directly on the CSFA. So I was thinking about adding
DataInput and DataOutput as superclasses of IndexInput and
IndexOutput. DataIn(Out)put would implement the different read and
write methods, whereas IndexIn(Out)put would only implement methods
like close(), seek(), getFilePointer(), length(), flush(), etc.

So then CSFA would extend DataInput or alternatively have a
getDataInput() method. The danger here compared to the current
payloads API would be that the user might read too few or too many
bytes of a CSF, which would result in an undefined and possibly hard
to debug behavior. But we could offer e.g.:

The SafeAccessor would count for you the number of read bytes and
throw exceptions if you don't consume the number of bytes you should
consume. This is of course overhead, but users could use the
SafeAccessor until they're confident that everything works fine in
their system, and then switch to the fast accessor for better
performance.

If there are no objections I will open a separate JIRA issue for the
DataInput/Output patch.

Michael Busch
added a comment - 04/Apr/09 04:34 For the search side we need an API similar to TermDocs and Payloads,
let's call it ColumnStrideFieldAccessor (CSFA) for now. It should have
next(), skipTo(), doc(), etc. methods.
However, the way TermPositions#getPayloads() currently works
is that it always forces you to copy the bytes from the underlying
IndexInput into the payload byte[] array. Since we usually use a
BufferedIndexInput, this is then an arraycopy from
BufferedIndexInput's buffer array into the byte array.
I think to improve this we could allow users to call methods like
readVInt() directly on the CSFA. So I was thinking about adding
DataInput and DataOutput as superclasses of IndexInput and
IndexOutput. DataIn(Out)put would implement the different read and
write methods, whereas IndexIn(Out)put would only implement methods
like close(), seek(), getFilePointer(), length(), flush(), etc.
So then CSFA would extend DataInput or alternatively have a
getDataInput() method. The danger here compared to the current
payloads API would be that the user might read too few or too many
bytes of a CSF, which would result in an undefined and possibly hard
to debug behavior. But we could offer e.g.:
static ColumnStrideFieldsAccessor getAccessor(ColumnStrideFieldsAccessor in, Mode mode) {
if (mode == Mode.Fast) {
return in;
} else if (mode == Mode.Safe) {
return new SafeAccessor(in);
}
The SafeAccessor would count for you the number of read bytes and
throw exceptions if you don't consume the number of bytes you should
consume. This is of course overhead, but users could use the
SafeAccessor until they're confident that everything works fine in
their system, and then switch to the fast accessor for better
performance.
If there are no objections I will open a separate JIRA issue for the
DataInput/Output patch.

Eventually we need more flexibility to utilize the flexible indexing
chain anyway. We need to store which codec to use for a field. Then we
could also just make a new codec for column-stride fields and maybe
then we do not have to introduce a new Field API.

By creating a custom indexing chain you could actually write CSF,
today.

But the lack of extensibility of Field needs to be addressed: you need
some way to store something arbitrary & opaque into a field such that
your indexing chain could pick it up and act.

And FieldInfos also needs "store this opaque thing for me" API.

One of the big changes in LUCENE-1458 is to strongly separate
different fields on the read APIs. EG there is a separate FieldsEnum
from TermsEnum, meaning you first seek to the field you want, then
request a TermsEnum from that, which can iterate through the terms
only for that field. It's the codec's job to return the right
TermsEnum for a given field.

Not to delay 2.9 further, but... I also wonder if Lucene had
NumericField (say), how it would simplify things here. EG, today, if
I have a field "weight" that is a float, I'm going to have to set
something to tell the CSF (man the similarity of that to CFS is going
to cause problems!) writer to cast-it-and-save-it-as-float-array to
disk; I'm going to have to tell the TrieRangeUtil to do the same, etc.
It'd be much better if that field stored a float (not String), and if
it default "naturally" to using these two special indexers...

DataIn(Out)put would implement the different read and
write methods, whereas IndexIn(Out)put would only implement methods
like close(), seek(), getFilePointer(), length(), flush(), etc.

What is the fastest way in Java to slurp in a bunch of bytes as an
int[], short[], float[], etc? Seems that we need to answer that first
and then work out how to fix our store APIs to enable that. (Maybe
it's IntBuffer wrapping ByteBuffer, instead of an int[]?).

The danger here compared to the current
payloads API would be that the user might read too few or too many
bytes of a CSF, which would result in an undefined and possibly hard
to debug behavior.

I think it's better to have good performance with added risk of
danger, then forced handholding always.

The SafeAccessor would count for you the number of read bytes and
throw exceptions if you don't consume the number of bytes you should
consume.

Michael McCandless
added a comment - 04/Apr/09 13:13
Eventually we need more flexibility to utilize the flexible indexing
chain anyway. We need to store which codec to use for a field. Then we
could also just make a new codec for column-stride fields and maybe
then we do not have to introduce a new Field API.
By creating a custom indexing chain you could actually write CSF,
today.
But the lack of extensibility of Field needs to be addressed: you need
some way to store something arbitrary & opaque into a field such that
your indexing chain could pick it up and act.
And FieldInfos also needs "store this opaque thing for me" API.
One of the big changes in LUCENE-1458 is to strongly separate
different fields on the read APIs. EG there is a separate FieldsEnum
from TermsEnum, meaning you first seek to the field you want, then
request a TermsEnum from that, which can iterate through the terms
only for that field. It's the codec's job to return the right
TermsEnum for a given field.
Not to delay 2.9 further, but... I also wonder if Lucene had
NumericField (say), how it would simplify things here. EG, today, if
I have a field "weight" that is a float, I'm going to have to set
something to tell the CSF (man the similarity of that to CFS is going
to cause problems!) writer to cast-it-and-save-it-as-float-array to
disk; I'm going to have to tell the TrieRangeUtil to do the same, etc.
It'd be much better if that field stored a float (not String), and if
it default "naturally" to using these two special indexers...
DataIn(Out)put would implement the different read and
write methods, whereas IndexIn(Out)put would only implement methods
like close(), seek(), getFilePointer(), length(), flush(), etc.
What is the fastest way in Java to slurp in a bunch of bytes as an
int[], short[], float[], etc? Seems that we need to answer that first
and then work out how to fix our store APIs to enable that. (Maybe
it's IntBuffer wrapping ByteBuffer, instead of an int[]?).
The danger here compared to the current
payloads API would be that the user might read too few or too many
bytes of a CSF, which would result in an undefined and possibly hard
to debug behavior.
I think it's better to have good performance with added risk of
danger, then forced handholding always.
The SafeAccessor would count for you the number of read bytes and
throw exceptions if you don't consume the number of bytes you should
consume.
I generally prefer liberal use of asserts to trip bugs like this,
instead of explicit strongly divoced code paths / classes / modes
etc., containing real if statements at production runtime.

Michael McCandless
added a comment - 08/Apr/09 00:16 One interesting idea, from Earwin's comment here:
https://issues.apache.org/jira/browse/LUCENE-1584?focusedCommentId=12696583&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12696583
is for column-stride fields to be a possible source for IndexReader.document() (or whatever it becomes). Ie, when possible, that method should maybe pull from CSFs for values.

is for column-stride fields to be a possible source for IndexReader.document() (or whatever it becomes). Ie, when possible, that method should maybe pull from CSFs for values.

Hmm, I'm not sure about this. The Lucene store is optimized for the case when ALL stored fields for a document need to be loaded. E.g. to render search results you usually need to load multiple fields. Then only one disk seek is needed to seek to the right document. Loading the fields is then just one sequential I/O.

If you e.g. want to show 5 fields per search result, you need to perform 1 seek per doc using stored fields, but will need 5 seeks with CSFs.

My general rule of thumb is to use payloads/CSFs for data you need during hit collection (e.g. a UID field or e.g. the norms, which is essential a CSF with fixed size), but ordinary stored fields to load data for rendering search results.

Michael Busch
added a comment - 08/Apr/09 04:48
is for column-stride fields to be a possible source for IndexReader.document() (or whatever it becomes). Ie, when possible, that method should maybe pull from CSFs for values.
Hmm, I'm not sure about this. The Lucene store is optimized for the case when ALL stored fields for a document need to be loaded. E.g. to render search results you usually need to load multiple fields. Then only one disk seek is needed to seek to the right document. Loading the fields is then just one sequential I/O.
If you e.g. want to show 5 fields per search result, you need to perform 1 seek per doc using stored fields, but will need 5 seeks with CSFs.
My general rule of thumb is to use payloads/CSFs for data you need during hit collection (e.g. a UID field or e.g. the norms, which is essential a CSF with fixed size), but ordinary stored fields to load data for rendering search results.

If you e.g. want to show 5 fields per search result, you need to perform 1 seek per doc using stored fields, but will need 5 seeks with CSFs.

If the CSFs have been loaded into the FieldCache then it's fast. If they haven't, I agree (one seek per doc).

Anyway this would just be "sugar" (a single 'retrieve doc' API that'd pull from the FieldCache & from stored fields, and merge). EG if I store body text as stored field, but then have a bunch of small "metadata" fields (price, weight, manufacturer, etc.) stored in FieldCache (loaded via CSF), it'd be convenient to call document() somewhere and get all fields back.

Michael McCandless
added a comment - 08/Apr/09 11:18 If you e.g. want to show 5 fields per search result, you need to perform 1 seek per doc using stored fields, but will need 5 seeks with CSFs.
If the CSFs have been loaded into the FieldCache then it's fast. If they haven't, I agree (one seek per doc).
Anyway this would just be "sugar" (a single 'retrieve doc' API that'd pull from the FieldCache & from stored fields, and merge). EG if I store body text as stored field, but then have a bunch of small "metadata" fields (price, weight, manufacturer, etc.) stored in FieldCache (loaded via CSF), it'd be convenient to call document() somewhere and get all fields back.

Doc is my analog for ScoreDoc, for these purporses it is completely identical
FieldInfos are constants defined like UserFields.EMAIL, they hold the type for field, its name, indexing method, whether it is cached or not and the way it is cached. Two synthetic fields exist - LUCENE_ID and SCORE, they allow to use same api for anything field-related.

Workflow looks like this:

// I create a loader. Fields are checked against the cache, for those that aren't cached I create a FieldSelector
loader = searcher.fieldCache().loader(concat(payloadFields, ID, DOCUMENT_TYPE, sortBy.field));

// Then for each document I'm going to send in response for search request I select this document
// an indexReader.document(fieldSelector) happens here if there are any uncached fields
loader.load(doc);

// Then I extract the values I need. Cached ones arrive from the cache, uncached are decoded from Document retrieved in previous step
hit = new Hit(loader.value(ID), loader.value(DOCUMENT_TYPE), loader.value(sortBy.field)) // etc, etc

Having a single API to retrieve values regardless of the way they are stored/cached is very handy. Loading a mix of stored/column-stride (if I correctly understand what are they) fields is pointless, you're more likely to lose performance than to gain it. Loading a mix of cached/uncached fields is massive win, it becomes even more massive if all required fields happen to be cached.

Earwin Burrfoot
added a comment - 08/Apr/09 11:58 I can share my design for doc loading, if anybody needs it:
public interface FieldCache
{
DocLoader loader(FieldInfo<?>... fields);
....
}
public interface DocLoader
{
void load(Doc doc);
<T> T value(FieldInfo<T> field);
}
Doc is my analog for ScoreDoc, for these purporses it is completely identical
FieldInfos are constants defined like UserFields.EMAIL, they hold the type for field, its name, indexing method, whether it is cached or not and the way it is cached. Two synthetic fields exist - LUCENE_ID and SCORE, they allow to use same api for anything field-related.
Workflow looks like this:
// I create a loader. Fields are checked against the cache, for those that aren't cached I create a FieldSelector
loader = searcher.fieldCache().loader(concat(payloadFields, ID, DOCUMENT_TYPE, sortBy.field));
// Then for each document I'm going to send in response for search request I select this document
// an indexReader.document(fieldSelector) happens here if there are any uncached fields
loader.load(doc);
// Then I extract the values I need. Cached ones arrive from the cache, uncached are decoded from Document retrieved in previous step
hit = new Hit(loader.value(ID), loader.value(DOCUMENT_TYPE), loader.value(sortBy.field)) // etc, etc
Having a single API to retrieve values regardless of the way they are stored/cached is very handy. Loading a mix of stored/column-stride (if I correctly understand what are they) fields is pointless, you're more likely to lose performance than to gain it. Loading a mix of cached/uncached fields is massive win, it becomes even more massive if all required fields happen to be cached.

FWIW, I think priority for document fetching should be to optimize for search
clusters where you have separate index servers and document/excerpt servers.
Under such a system, it's clear that you'd want to use the standard stored
fields.

To my mind, column stride fields are more of a search tool – an extension of
the field caches and the "flexible indexing" concept – and I don't think the
design should be encumbered or deployment slowed so that they can perform
double duty.

However, I think column-stride fields may be more of an imperative for Lucy
than Lucene. In Lucy, our sort caches will be either column-stride or
variable width and will be written out at index-time – mmap a
one-value-per-doc column-stride file, and voila: instant sort cache.

> Loading a mix of cached/uncached fields is massive win

I don't quite understand why that would be the case. Presuming a cold OS
cache, the big cost is the .fdt file disk seek. Once you're there, how much
of a difference is it to read the field off disk as opposed to reading it from
the cache?

> it becomes even more massive if all required fields happen to be cached.

I can see that. However, the gain won't be all that significant for systems
where the index fits into RAM, or when the persistant storage device is an
SSD. And of course a different caching strategy altogether (popular document
caching) is best for dedicated doc servers.

Marvin Humphrey
added a comment - 08/Apr/09 17:07 FWIW, I think priority for document fetching should be to optimize for search
clusters where you have separate index servers and document/excerpt servers.
Under such a system, it's clear that you'd want to use the standard stored
fields.
To my mind, column stride fields are more of a search tool – an extension of
the field caches and the "flexible indexing" concept – and I don't think the
design should be encumbered or deployment slowed so that they can perform
double duty.
However, I think column-stride fields may be more of an imperative for Lucy
than Lucene. In Lucy, our sort caches will be either column-stride or
variable width and will be written out at index-time – mmap a
one-value-per-doc column-stride file, and voila: instant sort cache.
> Loading a mix of cached/uncached fields is massive win
I don't quite understand why that would be the case. Presuming a cold OS
cache, the big cost is the .fdt file disk seek. Once you're there, how much
of a difference is it to read the field off disk as opposed to reading it from
the cache?
> it becomes even more massive if all required fields happen to be cached.
I can see that. However, the gain won't be all that significant for systems
where the index fits into RAM, or when the persistant storage device is an
SSD. And of course a different caching strategy altogether (popular document
caching) is best for dedicated doc servers.

To my mind, column stride fields are more of a search tool - an extension of
the field caches and the "flexible indexing" concept

I don't think of CSF's as an extension to FieldCache; I think of them as an alternate (much more efficient than uninversion) underlying store that FieldCache can use to retrieve values. Ie the way you'll access a CSF in Lucene will be through the [new in LUCENE-831] FieldCache API.

However, I think column-stride fields may be more of an imperative for Lucy
than Lucene. In Lucy, our sort caches will be either column-stride or
variable width and will be written out at index-time - mmap a
one-value-per-doc column-stride file, and voila: instant sort cache.

Michael McCandless
added a comment - 09/Apr/09 11:24
To my mind, column stride fields are more of a search tool - an extension of
the field caches and the "flexible indexing" concept
I don't think of CSF's as an extension to FieldCache; I think of them as an alternate (much more efficient than uninversion) underlying store that FieldCache can use to retrieve values. Ie the way you'll access a CSF in Lucene will be through the [new in LUCENE-831] FieldCache API.
However, I think column-stride fields may be more of an imperative for Lucy
than Lucene. In Lucy, our sort caches will be either column-stride or
variable width and will be written out at index-time - mmap a
one-value-per-doc column-stride file, and voila: instant sort cache.
I think column or row stride, vs fixed or variable width storage, are orthogonal issues?
EG column-stride fields (in Lucene) could very well be variable width storage (eg for storing String values).
I can see that. However, the gain won't be all that significant for systems
where the index fits into RAM, or when the persistant storage device is an
SSD.
SSDs, as fast as they are, are still orders of magnitude slower than RAM (assuming of course the OS hasn't swapped your RAM out to your SSD .