[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Attachment: pigstorageschema_7.patch
Fixed javadoc, moved JsonMetadata to experimental.
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch,
pigstorageschema_7.patch,
TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Status: Patch Available (was: Open)
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch,
pigstorageschema_7.patch,
TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated PIG-760:
---
Resolution: Fixed
Status: Resolved (was: Patch Available)
patch7 checked in. Thanks Dmitriy for your work on this, including being
willing to make several revisions.
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch,
pigstorageschema_7.patch,
TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Attachment: pigstorageschema_5.patch
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch,
TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Status: Open (was: Patch Available)
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch,
TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Status: Patch Available (was: Open)
Moved the Load/StoreMetadata, ResourceSchema, and ResourceStats to
o.a.p.experimental
Modified pig latin in unit test to reference PigStorageSchema by its package
(piggybank).
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch,
TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Status: Open (was: Patch Available)
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch,
TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated PIG-760:
---
Attachment: TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt
When I run the unit tests in piggybank, the new TestPigStorageSchema fails.
I've attached the output of the test.
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch, pigstorageschema_4.patch,
TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Status: Open (was: Patch Available)
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch, pigstorageschema_4.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Status: Open (was: Patch Available)
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Attachment: pigstorageschema_3.patch
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch, pigstorageschema_3.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Attachment: (was: pigstorageschema_3.patch)
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Status: Patch Available (was: Open)
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Attachment: pigstorageschema_3.patch
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.6.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Status: Open (was: Patch Available)
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.6.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Fix Version/s: (was: 0.6.0)
0.7.0
Status: Patch Available (was: Open)
The updated patch moves PigStorageSchema to the piggybank (I feel it needs to
include proper handling of complex structures before it can be considered a
builtin). Also updated the various interfaces from the Load/Store redesign to
match latest spec.
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.7.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch,
pigstorageschema_3.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Status: Open (was: Patch Available)
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Attachments: pigstorageschema.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Fix Version/s: 0.6.0
Status: Patch Available (was: Open)
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.6.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Attachment: pigstorageschema-2.patch
New patch to address findbugs and make the classes a little nicer to use.
Made internal fields protected, since having them public *and* having
getters/setters didn't really make sense.
Setters now return this, so that they can be chained.
Array setters make a copy of the passed in array. Getters return the internal
array, so it's still possible to shoot oneself in the foot (as findbugs points
out), but side-effecting those arrays is the intended use case.
Still flat-schemas only, haven't gotten around to wrestling the Jackson Parser
on this one. David -- do you need nested schemas?
Submitting as a patch so that Hudson can have a go. Would appreciate code
comments, especially with regards to the interfaces (and changes I made to
them) from the Load/Store redesign proposal.
We probably want to hold off on commiting this until the new interfaces settle
in a bit.
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Dmitriy V. Ryaboy
Fix For: 0.6.0
Attachments: pigstorageschema-2.patch, pigstorageschema.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Attachment: pigstorageschema.patch
I am attaching a preliminary patch for this issue.
It implements a new Load/StoreFunc PigStorageSchema that inherits from
PigStorage and performs schema serialization into JSON; currently it only works
for flat schemas (a JSON parser limitation that can probably be overcome with a
bit of elbow grease). It also only works in MR mode due to limitations on the
StoreFunc interface (in local mode, there is no way I am aware of to get the
directory name you are writing to from the StoreFunc -- in MR mode I am able to
get it from the JobConf).
It also writes the headers as described above, but at the moment does not
provide nice constructors (like the ones suggested by David) to allow one to
turn functionality on/off.
Implementation notes:
I chose Jackson for JSON parsing because that's what Avro uses, so once Avro is
used in Pig, we won't have two parsers that do the same thing.
I didn't modify the zip targets in build.xml to package the Avro libs, so if
you want to use PigStorageSchema, you will want to register
build/ivy/lib/Pig/jackson-mapper-asl-1.0.1.jar and
build/ivy/lib/Pig/jackson-core-asl-1.0.1.jar.
This patch also uses a number of the interfaces (MetadataLoader/Writer,
ResourceStatistics, ResourceSchema) from the Load/Store redesign proposal. I
simply dumped them into org.apache.pig -- we may want to come up with an
appropriate package.
As expected, implementing this raised a number of issues with the interfaces as
proposed, most notably the need for getters and setters in order to enable Java
tools that work with POJOs to interact with these interfaces.
I indulged in some Class.cast trickery in DataType to avoid large swaths of
copy+paste code. Despite what the patch appears to say, the changes to
determineFieldSchema are really fairly minimal, I just made it work on Object
and ResourceFieldSchemas at the same time.
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Attachments: pigstorageschema.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-760:
--
Status: Patch Available (was: Open)
Serialize schemas for PigStorage() and other storage types.
---
Key: PIG-760
URL: https://issues.apache.org/jira/browse/PIG-760
Project: Pig
Issue Type: New Feature
Reporter: David Ciemiewicz
Attachments: pigstorageschema.patch
I'm finding PigStorage() really convenient for storage and data interchange
because it compresses well and imports into Excel and other analysis
environments well.
However, it is a pain when it comes to maintenance because the columns are in
fixed locations and I'd like to add columns in some cases.
It would be great if load PigStorage() could read a default schema from a
.schema file stored with the data and if store PigStorage() could store a
.schema file with the data.
I have tested this out and both Hadoop HDFS and Pig in -exectype local mode
will ignore a file called .schema in a directory of part files.
So, for example, if I have a chain of Pig scripts I execute such as:
A = load 'data-1' using PigStorage() as ( a: int , b: int );
store A into 'data-2' using PigStorage();
B = load 'data-2' using PigStorage();
describe B;
describe B should output something like { a: int, b: int }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.