Activity

It replaces the Schema.parse() methods with a more flexible Schema.Parser API. If folks think this new API is reasonable, then we should perhaps switch all of the calls to Schema.parse() to the new API.

Doug Cutting
added a comment - 09/Aug/11 21:50 Here's a patch that fixes this.
It replaces the Schema.parse() methods with a more flexible Schema.Parser API. If folks think this new API is reasonable, then we should perhaps switch all of the calls to Schema.parse() to the new API.
This includes the test case you provided.

Bill Graham
added a comment - 10/Aug/11 04:52 Thanks Doug. I've verified that the idl tool now generates a protocol file. I'm unable to parse this using the Schema.parse(File file) method though. Is that supposed to work, or am I doing it wrong?
I've also verified that this now works:
Schema.Parser parser = new Schema.Parser();
parser.parse(new File("position.avsc"));
Schema playerSchema = parser.parse(new File("player.avsc"));
On the email list we've been discussion alternate APIs ( http://search-hadoop.com/m/wzbMK11aTL82 ). Something like this:
public Schema Schema.parse(File[] files);
public Schema Schema.parse(File[] files, Map<Name, Schema> context);
I propose morphing these two approaches to something that could be used like this:
Schema.Parser parser = new Schema.Parser();
parser.parse(new File("position.avsc"));
parser.parse(new File("player.avsc"));
Schema schema = parser.getSchemaByName("Player");
// or alternatively you can pass multiple files to the parse method once
parser.parse(mySchemaFiles);
Thoughts?

Doug Cutting
added a comment - 10/Aug/11 16:32 I believe the former will work with the current patch: sequential calls using the same parser will accumulate names in the parser. The latter would be an easy addition to the patch, if desired.
To parse a protocol file use Protocol.parse(File). We should perhaps convert that to a Schema.Parser-style API too.

Note that the 'getSchemaByName' method you use in the first case above does not exist. Instead you can either use the value of parser.parse(). If we did want to add a 'getSchemaByName' method it would need to accept a fully-qualified name, e.g., parser.getSchemaByName("org.apache.avro.examples.Player"). Or we could change parser.getTypes() to instead return Map<String,Schema> so you could call parser.getTypes().get("org.apache.avro.examples.Player").

Doug Cutting
added a comment - 10/Aug/11 16:39 Note that the 'getSchemaByName' method you use in the first case above does not exist. Instead you can either use the value of parser.parse(). If we did want to add a 'getSchemaByName' method it would need to accept a fully-qualified name, e.g., parser.getSchemaByName("org.apache.avro.examples.Player"). Or we could change parser.getTypes() to instead return Map<String,Schema> so you could call parser.getTypes().get("org.apache.avro.examples.Player").

Yes, the former parsing approach will work with the current patch. We'd just need to add the convenience method to do

Schema schema = parser.getSchemaByName("Player");

(Or maybe getType aligns more with the existing API?) This would allow the caller to load a number of files and then fetch the schema(s) they're interested in, without necessarily knowing which file contained it. They'd still need to be sure to parse in reverse-dependent order.

Bill Graham
added a comment - 10/Aug/11 17:15 Yes, the former parsing approach will work with the current patch. We'd just need to add the convenience method to do
Schema schema = parser.getSchemaByName("Player");
(Or maybe getType aligns more with the existing API?) This would allow the caller to load a number of files and then fetch the schema(s) they're interested in, without necessarily knowing which file contained it. They'd still need to be sure to parse in reverse-dependent order.
Thanks for the pointer re Protocol parsing, that worked.

My last comment was sent before seeing yours FYI. Sure we can either expose the Map or a method to get a value from the map. Either way works. Exposing the Map might be be better since it provides more flexibility to the caller.

Bill Graham
added a comment - 10/Aug/11 17:19 My last comment was sent before seeing yours FYI. Sure we can either expose the Map or a method to get a value from the map. Either way works. Exposing the Map might be be better since it provides more flexibility to the caller.

Here's a version where getTypes() returns Map<String,Schema>. This implementation not very efficient since it creates new Map entries each time its called. This could be optimized, either by defining and returning a Map implementation wrapping the Names instance, or by converting Names to be a Map<String,Schema> and returning it directly. But I'm not sure that's worth the effort now, since this doesn't seem a likely performance bottleneck.

Doug Cutting
added a comment - 10/Aug/11 18:31 Here's a version where getTypes() returns Map<String,Schema>. This implementation not very efficient since it creates new Map entries each time its called. This could be optimized, either by defining and returning a Map implementation wrapping the Names instance, or by converting Names to be a Map<String,Schema> and returning it directly. But I'm not sure that's worth the effort now, since this doesn't seem a likely performance bottleneck.

I'm beginning to question the motivation for this a bit. The example player.avsc file is not a well-formed JSON schema, since it's not standalone, but rather depends on another .avsc file. To date, we've only consumed or produced standalone JSON, not fragmentary. JSON is meant to be the low-level schema language, with IDL as the higher level language, better supporting manually maintained schemas.

So, before we commit this, I'd like to understand the use case a bit more. Is there a reason one couldn't define the two schemas in a single .avsc, .avpr file, or multiple .avdl files?

We could I suppose change the Schema parser to, when it encounters an undefined name, look for a file defining it, much like the Java compiler looks for a .java file on the CLASSPATH, but I feel such features should be confined to IDL and that JSON should be primarily used for self-contained schemas. Standalone JSON schemas are what we save with files and exchange in RPC handshakes. Currently the API will not permit one to write a non-standalone Schema so I'm a bit reluctant to permit reading them.

Doug Cutting
added a comment - 10/Aug/11 19:06 I'm beginning to question the motivation for this a bit. The example player.avsc file is not a well-formed JSON schema, since it's not standalone, but rather depends on another .avsc file. To date, we've only consumed or produced standalone JSON, not fragmentary. JSON is meant to be the low-level schema language, with IDL as the higher level language, better supporting manually maintained schemas.
So, before we commit this, I'd like to understand the use case a bit more. Is there a reason one couldn't define the two schemas in a single .avsc, .avpr file, or multiple .avdl files?
We could I suppose change the Schema parser to, when it encounters an undefined name, look for a file defining it, much like the Java compiler looks for a .java file on the CLASSPATH, but I feel such features should be confined to IDL and that JSON should be primarily used for self-contained schemas. Standalone JSON schemas are what we save with files and exchange in RPC handshakes. Currently the API will not permit one to write a non-standalone Schema so I'm a bit reluctant to permit reading them.
Do others have thoughts on this?

The last patch works well, thanks. Attached is patch #4 which also has a parse(File[]) method, if we choose to go this route.

The value of being able to parse multiple JSON files into a single schema is that it allows for a more modular approach when creating and managing schema definitions. Without support for this at the JSON level, users will resort to copy and pasting common schemas into much larger and less manageable schema definitions.

It seems like a defacto best-practice is emerging to concat multiple schemas together into a union as a way to partially get around repeatedly in-lining JSON child schemas. This approach gets the job done, but has manageability problems.

This problem can be solved at the IDL level, but that provides yet another level of abstraction, a new language syntax and a compilation step to complicate what would otherwise be a very simple use case.

Regarding consuming/producing fragmentary JSON, with the proposed approach producing JSON fragments will still not occur, since the in-memory schema is always complete, due to the reverse-dependency ordering that is required at parse time (not unlike parsing a union). Also, parsing a JSON fragment will still fail without parsing it's dependancies first so it's not loosening the contract of how parsing is handled in any way.

Bill Graham
added a comment - 10/Aug/11 20:05 The last patch works well, thanks. Attached is patch #4 which also has a parse(File[]) method, if we choose to go this route.
The value of being able to parse multiple JSON files into a single schema is that it allows for a more modular approach when creating and managing schema definitions. Without support for this at the JSON level, users will resort to copy and pasting common schemas into much larger and less manageable schema definitions.
It seems like a defacto best-practice is emerging to concat multiple schemas together into a union as a way to partially get around repeatedly in-lining JSON child schemas. This approach gets the job done, but has manageability problems.
This problem can be solved at the IDL level, but that provides yet another level of abstraction, a new language syntax and a compilation step to complicate what would otherwise be a very simple use case.
Regarding consuming/producing fragmentary JSON, with the proposed approach producing JSON fragments will still not occur, since the in-memory schema is always complete, due to the reverse-dependency ordering that is required at parse time (not unlike parsing a union). Also, parsing a JSON fragment will still fail without parsing it's dependancies first so it's not loosening the contract of how parsing is handled in any way.
I'd also like to hear others thoughts on this though.

Scott Carey
added a comment - 10/Aug/11 20:54 - edited Two things block me from using AvroIDL:
I started maintaining and using Avro schemas before AvroIDL existed, so it is natural for developers on my project to use the JSON form, not AvroIDL.
AvroIDL only supports protocols. I only use schemas.
The former can be overcome, AvroIDL should be easier to use anyway.
The latter needs work, and I haven't looked at what it would take to extend AvroIDL to work with schemas.

If you're generating specific code, then Foo will have the same name as if you defined it in a .avsc file, so no changes to clients should be required.

The only unnecessary part is the protocol name and the fact that an interface is generated with no methods that you'll never use. We could change that, e.g., by changing the idl parser to accept files of the form:

@namespace("foo.bar")
record Foo {
import idl "Bar.avdl";
...
}

The return type for the IDL parser could then be either a schema or a protocol, and clients would need to do different things depending.

Doug Cutting
added a comment - 10/Aug/11 21:35 There's not much overhead to using IDL for schemas: just use an idl file without any messages:
@namespace( "foo.bar" )
protocol MyProtocol {
import idl "Bar.avdl" ;
record Foo {
...
}
}
If you're generating specific code, then Foo will have the same name as if you defined it in a .avsc file, so no changes to clients should be required.
The only unnecessary part is the protocol name and the fact that an interface is generated with no methods that you'll never use. We could change that, e.g., by changing the idl parser to accept files of the form:
@namespace( "foo.bar" )
record Foo {
import idl "Bar.avdl" ;
...
}
The return type for the IDL parser could then be either a schema or a protocol, and clients would need to do different things depending.

It is possible to generate the Protocol at run time directly from an IDL file. That way the Project can be generated and consumed without the extra step to pre-generate an avpr file. Something like this works:

Idl parser = new Idl(idfFile);
Protocol p = parser.CompilationUnit();

The downside is that now the avro-compiler jar would be needed at run time. I agree though that it would be nice to not have to create and use Protocols if you only need Schemas.

@Doug if all your schema objects are defined in your avsc files, then defining a dummy "record Foo" would be needed, right? Could you instead do something like this to load a collection of Schemas instead?:

Bill Graham
added a comment - 10/Aug/11 22:21 It is possible to generate the Protocol at run time directly from an IDL file. That way the Project can be generated and consumed without the extra step to pre-generate an avpr file. Something like this works:
Idl parser = new Idl(idfFile);
Protocol p = parser.CompilationUnit();
The downside is that now the avro-compiler jar would be needed at run time. I agree though that it would be nice to not have to create and use Protocols if you only need Schemas.
@Doug if all your schema objects are defined in your avsc files, then defining a dummy "record Foo" would be needed, right? Could you instead do something like this to load a collection of Schemas instead?:
@namespace("foo.bar")
schema MySchema {
import schema "Bar.avsc";
...
}
Granted, now there's a dummy MySchema thing, but it's not a record.

We'd like to continue to use avsc files because they're easier to read and author and our developers are already familiar with them. They're also not experimental and changing like the IDL language. So we'd just use IDL as a mechanism to combine fragmented avsc files, like the initial problem statement:

Bill Graham
added a comment - 10/Aug/11 23:19 We'd like to continue to use avsc files because they're easier to read and author and our developers are already familiar with them. They're also not experimental and changing like the IDL language. So we'd just use IDL as a mechanism to combine fragmented avsc files, like the initial problem statement:
@namespace("avro.examples.baseball")
protocol Baseball {
import schema "position.avsc";
import schema "player.avsc";
}

We should perhaps remove the 'experimental' declaration from the IDL documentation. I don't think we should make incompatible changes to the syntax of IDL, so we might as well declare it stable. But that's another issue...

Bill, will you use the parse(File[]) method, or would you instead use an IDL file? It's not yet clear to me that method is so common a pattern that it warrants adding here. If we think it's a common pattern then we should at add some javadoc, otherwise we should remove it. Other than that, I'm willing to commit this.

Doug Cutting
added a comment - 11/Aug/11 20:29 We should perhaps remove the 'experimental' declaration from the IDL documentation. I don't think we should make incompatible changes to the syntax of IDL, so we might as well declare it stable. But that's another issue...
Bill, will you use the parse(File[]) method, or would you instead use an IDL file? It's not yet clear to me that method is so common a pattern that it warrants adding here. If we think it's a common pattern then we should at add some javadoc, otherwise we should remove it. Other than that, I'm willing to commit this.

My slight preference would be to leave parse(File[]) out, since I can imagine many use cases that are slightly different, e.g., 'List<Schema> parse(List<File>)' or 'Schema parse(File[])' that returns the last schema, etc. All of these are just a few lines of code that I think is reasonable to leave to applications. On the other hand, if lots of applications are using the same few lines of code, then it makes sense to capture it in a utility, but I don't yet know what the common idioms are here. Meh.

Doug Cutting
added a comment - 12/Aug/11 16:57 My slight preference would be to leave parse(File[]) out, since I can imagine many use cases that are slightly different, e.g., 'List<Schema> parse(List<File>)' or 'Schema parse(File[])' that returns the last schema, etc. All of these are just a few lines of code that I think is reasonable to leave to applications. On the other hand, if lots of applications are using the same few lines of code, then it makes sense to capture it in a utility, but I don't yet know what the common idioms are here. Meh.

Another reason to omit it it to keep other similar APIs (i.e. Protocol) consistant and concise. No need to re-implement Array-based support everywhere else. Also I'm already thinking about contributing a parse(URL) method. Not having to support each type as both single and array signatures will keep the code from bloating.

Bill Graham
added a comment - 12/Aug/11 18:24 +1 on leaving parse(File[]) out.
Another reason to omit it it to keep other similar APIs (i.e. Protocol) consistant and concise. No need to re-implement Array-based support everywhere else. Also I'm already thinking about contributing a parse(URL) method. Not having to support each type as both single and array signatures will keep the code from bloating.