Add export/import facilities to the hive system

Details

Description

This is an enhancement request to add export/import features to hive.

With this language extension, the user can export the data of the table - which may be located in different hdfs locations in case of a partitioned table - as well as the metadata of the table into a specified output location. This output location can then be moved over to another different hadoop/hive instance and imported there.

This should work independent of the source and target metastore dbms used; for instance, between derby and mysql.

For partitioned tables, the ability to export/import a subset of the partition must be supported.

Edward Capriolo
added a comment - 19/Jan/11 15:08 Can we do this without?
- public Partition(Table tbl, Map<String, String> partSpec, Path location)
+ public Partition(Table tbl, Map<String, String> partSpec, Path location, Map<String, String> partParams)
I would like to see some form of API stability in the metastore. It is very disruptive for programs that work against the metastore. Of which internally I have several.

Krishna Kumar
added a comment - 20/Jan/11 10:39 Ok. Will take of this via a delegating ctor.
A process question: I guess I should wait for more comments from other reviewers before I create another patch in case if others are reviewing the current patch?

Edward Capriolo
added a comment - 20/Jan/11 19:31 I was not implying that we should definately not add
, Map<String, String> partParams
. What I am asking is what is the rational for doing it? I think we should not need to add things to the metastore to export it's information, but I might be missing something.

Why export/import needs this change: It is not the export part, but rather the import part which needs this change. While creating a partition as part of an import, we need to be able to create the partition along with its ancillary data including partition parameters. But first part of the existing "create partition" flow (AddPartitionDesc -> DDLTask.addPartition -> Hive.createPartition) did not support partition params specification but the second part (metastore.api.Partition -> IMetaStoreClient.add_partition -> HiveMetaStore.HMSHandler.add_partition -> ObjectStore.addPartition) does. So I added the ability to pass the partition parameters along in the first part of the flow.

In terms of options for compatible changes, there are two I can see:

1. The solution suggested above. Add an additional ctor so that no existing code breaks.

Krishna Kumar
added a comment - 21/Jan/11 04:48 Why export/import needs this change: It is not the export part, but rather the import part which needs this change. While creating a partition as part of an import, we need to be able to create the partition along with its ancillary data including partition parameters. But first part of the existing "create partition" flow (AddPartitionDesc -> DDLTask.addPartition -> Hive.createPartition) did not support partition params specification but the second part (metastore.api.Partition -> IMetaStoreClient.add_partition -> HiveMetaStore.HMSHandler.add_partition -> ObjectStore.addPartition) does. So I added the ability to pass the partition parameters along in the first part of the flow.
In terms of options for compatible changes, there are two I can see:
1. The solution suggested above. Add an additional ctor so that no existing code breaks.
public Partition(Table tbl, Map<String, String> partSpec, Path location) {
this(tbl, partSpec, location, null);
}
public Partition(Table tbl, Map<String, String> partSpec, Path location, Map<String, String> partParams) {...}
2. Have only the current ctor but in Hive.createPartition get the underlying metastore.api.Partition and set the parameters to it before passing it on to the metastoreClient.
Thoughts?

If partition currently only has one constructor I do not think we add a second constructor, just change it and break the old compatibility. If we do add partParams we should have DDL statements to view and modify these. For a long time there were many things that were added in the create table such as InputFormat or OutputFormat and we had no way to view or modify these besides going directly to the JDBC and changing them. I am not sure if this is the case here, but I do not think we should be adding fields that are not completely manageable.

Edward Capriolo
added a comment - 21/Jan/11 21:40 If partition currently only has one constructor I do not think we add a second constructor, just change it and break the old compatibility. If we do add partParams we should have DDL statements to view and modify these. For a long time there were many things that were added in the create table such as InputFormat or OutputFormat and we had no way to view or modify these besides going directly to the JDBC and changing them. I am not sure if this is the case here, but I do not think we should be adding fields that are not completely manageable.

@Krishna: I tried applying the patch to trunk and running the tests. Noticed a couple things upfront that need to be fixed:

The patch doesn't apply cleanly with 'patch -p0'. To satisfy this using a Git repo you need to generate the patch using 'git diff --no-prefix ...'

Most of the new exim* tests fail with diffs. Can you please fix this and update the patch?

'hive.test.exim' needs to be added to HiveConf and hive-default.xml. There is already a 'hive.test.mode.*' namespace defined in HiveConf, so you should probably follow this convention and change the name to 'hive.test.mode.exim'. I think an even better solution would be to instead define a new conf property called 'hive.exim.uri.scheme.whitelist' and make it a comma separated list of acceptable URI schemes for import and export.

Carl Steinbach
added a comment - 21/Jan/11 22:47 @Krishna: I tried applying the patch to trunk and running the tests. Noticed a couple things upfront that need to be fixed:
The patch doesn't apply cleanly with 'patch -p0'. To satisfy this using a Git repo you need to generate the patch using 'git diff --no-prefix ...'
Most of the new exim* tests fail with diffs. Can you please fix this and update the patch?
'hive.test.exim' needs to be added to HiveConf and hive-default.xml. There is already a 'hive.test.mode.*' namespace defined in HiveConf, so you should probably follow this convention and change the name to 'hive.test.mode.exim'. I think an even better solution would be to instead define a new conf property called 'hive.exim.uri.scheme.whitelist' and make it a comma separated list of acceptable URI schemes for import and export.

1. Can you add a new test directory - I mean, exporting in /tmp/.. means that there will be problems in concurrent tests on the same machine.
2. Do you want to support errors at import time - I mean, what happens if one of the rows is bad - should I have an option to specify the number
of rows to ignore, and dump them somewhere else ?

Namit Jain
added a comment - 24/Jan/11 06:02 A couple of issues:
1. Can you add a new test directory - I mean, exporting in /tmp/.. means that there will be problems in concurrent tests on the same machine.
2. Do you want to support errors at import time - I mean, what happens if one of the rows is bad - should I have an option to specify the number
of rows to ignore, and dump them somewhere else ?

@Edward: Both the existing data model (prettified er diagram attached) and the object model (class org.apache.hadoop.hive.metastore.api.Partition) allow the specification of parameters on a per-partition basis. So I am not adding new fields to either of these models. By proposal 2 above, I will not be adding any ctor parameters to org.apache.hadoop.hive.ql.metadata.Partition as well.

Your point re providing manageability via ddl statements to all aspects of the data/object model is taken. But I am not adding new aspects to either model, so if indeed we need to address current manageability gaps, should they not be addressed via another enhancement request, rather than this one, which aims simply to add export/import facilities?

Krishna Kumar
added a comment - 24/Jan/11 08:26 @Edward: Both the existing data model (prettified er diagram attached) and the object model (class org.apache.hadoop.hive.metastore.api.Partition) allow the specification of parameters on a per-partition basis. So I am not adding new fields to either of these models. By proposal 2 above, I will not be adding any ctor parameters to org.apache.hadoop.hive.ql.metadata.Partition as well.
Your point re providing manageability via ddl statements to all aspects of the data/object model is taken. But I am not adding new aspects to either model, so if indeed we need to address current manageability gaps, should they not be addressed via another enhancement request, rather than this one, which aims simply to add export/import facilities?

1. Taken care in the new patch.
2. Can you post some of the diffs that you get failures on? I had a problem with running the tests on nfs mounted directories. That had to do with an existing bug in the load functionality. This used to result in a "MetaException: could not delete dir" error while trying to cleanup the effects of the previous test. I have created a separate jira HIVE-1924 for this and have attached a patch.
3. Have taken the whitelist approach, the whitelist now set as "hdfs,pfile".

Krishna Kumar
added a comment - 24/Jan/11 08:31 @Carl:
1. Taken care in the new patch.
2. Can you post some of the diffs that you get failures on? I had a problem with running the tests on nfs mounted directories. That had to do with an existing bug in the load functionality. This used to result in a "MetaException: could not delete dir" error while trying to cleanup the effects of the previous test. I have created a separate jira HIVE-1924 for this and have attached a patch.
3. Have taken the whitelist approach, the whitelist now set as "hdfs,pfile".

1. Do you have any ideas re how we can get an unique, temporary directory name for use in the test script files? In code of course we can use the getScratchDir methods, but how to solve this problem in these test scripts?

2. Export/Import, as in the case of Load, operates at file level rather than at record level. So there are no record-level filters available.

Krishna Kumar
added a comment - 24/Jan/11 08:52 @Namit:
1. Do you have any ideas re how we can get an unique, temporary directory name for use in the test script files? In code of course we can use the getScratchDir methods, but how to solve this problem in these test scripts?
2. Export/Import, as in the case of Load, operates at file level rather than at record level. So there are no record-level filters available.

But I am not adding new aspects to either model, so if indeed we need to address current manageability gaps, should they not be addressed via another enhancement request, rather than this one, which aims simply to add export/import facilities?

This depends on the lag time between the feature getting added and the enhancement being added. With wishful thinking the two events will be close, but history does not always agree. Management becomes more important as different people begin using the metastore for different purposes. I believe if we add any feature support to manage it completely with DML is mandatory. Currently the metastore is changing for security. import-export, and indexes if each of these features are only half manageable at time X this will make releases awkward and half functional.

Edward Capriolo
added a comment - 24/Jan/11 15:45
But I am not adding new aspects to either model, so if indeed we need to address current manageability gaps, should they not be addressed via another enhancement request, rather than this one, which aims simply to add export/import facilities?
This depends on the lag time between the feature getting added and the enhancement being added. With wishful thinking the two events will be close, but history does not always agree. Management becomes more important as different people begin using the metastore for different purposes. I believe if we add any feature support to manage it completely with DML is mandatory. Currently the metastore is changing for security. import-export, and indexes if each of these features are only half manageable at time X this will make releases awkward and half functional.

1. How is it different from running multiple tests currently - the output files are generated in
build/ql/.... directory under the hive installation. Similarly, you can create a new directory under build.
This should be definitely done in this patch itself.

2. Load does not do any checking - not does it perform any metadata operation. The table already exists,
and load is merely a helper function. Since import is touching both data and metadata, it makes sense
to validate the data (atleast optionally) - but that can always be done in a follow-up.

Namit Jain
added a comment - 24/Jan/11 16:59 @Krishna
1. How is it different from running multiple tests currently - the output files are generated in
build/ql/.... directory under the hive installation. Similarly, you can create a new directory under build.
This should be definitely done in this patch itself.
2. Load does not do any checking - not does it perform any metadata operation. The table already exists,
and load is merely a helper function. Since import is touching both data and metadata, it makes sense
to validate the data (atleast optionally) - but that can always be done in a follow-up.

Namit Jain
added a comment - 07/Feb/11 20:29 Still looking:
1. Move export/import stuff to a new directory/file.
All the changes of HiveUtils should be in EximUtil.
2. DDLTask.java - why are the existing APIs changing ?
The default behavior for createPartition() should not change
3. inputs not populated for import, and outputs for export.
This can break some dependent stuff: authorization, concurrency etc.

1. Ok re moving serialization/deserialization methods to EximUtil, but did not understand the first part. Are you suggesting moving EximUtil, ImportSemanticAnalyzer and ExportSemanticAnalyzer to a new package? Does not seem to warrant it; today all parsing/semantic analysis classes are in o.a.h.h.ql.parse package...

2. You mean Hive.java's API? The existing first createPartition remains as it is, the second createPartition used in DDLTasek is changing to allow the creation of a partition with all the partition-specific configurations. Since AddPartitionDesc is initialized with nulls/-1 for these extra parameters, the existing behaviour is not altered.

3. Can you expand a little? What are inputs/outputs (classes?, tables?) - if they are part of the existing object model/data model, I think they are exported and imported.

Krishna Kumar
added a comment - 08/Feb/11 16:36 Thanks, Namit, for the comments.
1. Ok re moving serialization/deserialization methods to EximUtil, but did not understand the first part. Are you suggesting moving EximUtil, ImportSemanticAnalyzer and ExportSemanticAnalyzer to a new package? Does not seem to warrant it; today all parsing/semantic analysis classes are in o.a.h.h.ql.parse package...
2. You mean Hive.java's API? The existing first createPartition remains as it is, the second createPartition used in DDLTasek is changing to allow the creation of a partition with all the partition-specific configurations. Since AddPartitionDesc is initialized with nulls/-1 for these extra parameters, the existing behaviour is not altered.
3. Can you expand a little? What are inputs/outputs (classes?, tables?) - if they are part of the existing object model/data model, I think they are exported and imported.

1. Just moving the serialization/deserialization methods
3. inputs and outputs are present in SemanticAnalyzer -

Look at ReadEntity/WriteEntity - they are populated by the appropriate semantic analyzer and
then used later by:

a. concurrency : uses inputs/outputs to lock those objects
b. authorization: uses inputs/outputs for permission checking
c. execution hooks: they can be used for a variety of things: in facebook, we use them for replication

Namit Jain
added a comment - 09/Feb/11 06:40 1. Just moving the serialization/deserialization methods
3. inputs and outputs are present in SemanticAnalyzer -
Look at ReadEntity/WriteEntity - they are populated by the appropriate semantic analyzer and
then used later by:
a. concurrency : uses inputs/outputs to lock those objects
b. authorization: uses inputs/outputs for permission checking
c. execution hooks: they can be used for a variety of things: in facebook, we use them for replication

This section describes factors to take into account when you import data into existing tables.

Manually Creating Tables Before Importing Data

When you choose to create tables manually before importing data into them from an export file, you should use either the same table definition previously used or a compatible format. For example, although you can increase the width of columns and change their order, you cannot do the following:

Add NOT NULL columns
Change the datatype of a column to an incompatible datatype (LONG to NUMBER, for example)
Change the definition of object types used in a table
Change DEFAULT column values

Note:
When tables are manually created before data is imported, the CREATE TABLE statement in the export dump file will fail because the table already exists. To avoid this failure and continue loading data into the table, set the import parameter IGNORE=y. Otherwise, no data will be loaded into the table because of the table creation error.

Do you want to support this ? Seems like a reasonable thing to have - currently, an error is thrown during import
if the table already exists ?

Namit Jain
added a comment - 09/Feb/11 06:54 Reading from:
http://download.oracle.com/docs/cd/B10500_01/server.920/a96652/ch02.htm#1005081
Importing into Existing Tables
This section describes factors to take into account when you import data into existing tables.
Manually Creating Tables Before Importing Data
When you choose to create tables manually before importing data into them from an export file, you should use either the same table definition previously used or a compatible format. For example, although you can increase the width of columns and change their order, you cannot do the following:
Add NOT NULL columns
Change the datatype of a column to an incompatible datatype (LONG to NUMBER, for example)
Change the definition of object types used in a table
Change DEFAULT column values
Note:
When tables are manually created before data is imported, the CREATE TABLE statement in the export dump file will fail because the table already exists. To avoid this failure and continue loading data into the table, set the import parameter IGNORE=y. Otherwise, no data will be loaded into the table because of the table creation error.
Do you want to support this ? Seems like a reasonable thing to have - currently, an error is thrown during import
if the table already exists ?

Importing into existing tables is now supported, but the checks (to see whether the imported table and the target table are compatible) have been kept fairly simple for now. Please see ImportSemanticAnalyzer.checkTable. The schemas (column and partition) of the two should match exactly, except for comments. Since we are just moving files (rather than rewriting records), I think there will be issues if the metadata schema does not match (in terms of types, number etc) the data serialization exactly.

Re the earlier comment re outputs/inputs, got what you meant. I will add the table/partition to the inputs in exportsemanticanalyzer. But in the case of the imports, I see that the tasks themselves adds the entity operated upon to the inputs/outputs list. Isn't that too late for authorization/concurrency, even though it may work for replication. Or both the sem.analyzers and the tasks are expected to add them? In the case of newly created table/partition, the sem.analyzer does not have a handle ?

Krishna Kumar
added a comment - 09/Feb/11 08:03 Importing into existing tables is now supported, but the checks (to see whether the imported table and the target table are compatible) have been kept fairly simple for now. Please see ImportSemanticAnalyzer.checkTable. The schemas (column and partition) of the two should match exactly, except for comments. Since we are just moving files (rather than rewriting records), I think there will be issues if the metadata schema does not match (in terms of types, number etc) the data serialization exactly.
Re the earlier comment re outputs/inputs, got what you meant. I will add the table/partition to the inputs in exportsemanticanalyzer. But in the case of the imports, I see that the tasks themselves adds the entity operated upon to the inputs/outputs list. Isn't that too late for authorization/concurrency, even though it may work for replication. Or both the sem.analyzers and the tasks are expected to add them? In the case of newly created table/partition, the sem.analyzer does not have a handle ?

Hmm. LoadSemanticAnalyzer (which knows the table) does not add it to the outputs, but the MoveTask it schedules, does.

Similarly, CREATE-TABLE does not add the entity but the DDLTask it schedules, does. This may be fine only because the entity does not exist at compile time?

ADD-PARTITION adds the table as an input at compile time and the partition itself is added as an output at execution time. Should not the table be an output (at compile time) as well - for authorization/concurrency purposes?

Anyway, where the import operates on existing tables/partitions, I will add them at compile time. If the entity is being created as part of the task, then the task will be adding them to inputs/outputs at runtime. Is this fine?

Krishna Kumar
added a comment - 09/Feb/11 17:13 Hmm. LoadSemanticAnalyzer (which knows the table) does not add it to the outputs, but the MoveTask it schedules, does.
Similarly, CREATE-TABLE does not add the entity but the DDLTask it schedules, does. This may be fine only because the entity does not exist at compile time?
ADD-PARTITION adds the table as an input at compile time and the partition itself is added as an output at execution time. Should not the table be an output (at compile time) as well - for authorization/concurrency purposes?
Anyway, where the import operates on existing tables/partitions, I will add them at compile time. If the entity is being created as part of the task, then the task will be adding them to inputs/outputs at runtime. Is this fine?

Namit Jain
added a comment - 20/Feb/11 16:35 Krishna, the code changes look good - I had one concern only.
The functions in EximUtil.java like:
private static Element createStorageDescriptor(Document doc,
String location,
String inputFormatClass,
have a implicit dependency on the metastore schema.
If the schema changes, export/import will break, and it will be difficult
to add them.
Do you want to think about it ?
I mean, add a API in the metastore thrift to generate this, or something like this ?
So that, this code is auto-generated and is amenable to new fields.

The decision on compatibility (forward/backward) checks as in EximUtil.checkCompatibility needs to taken consciously. That is, automatically breaking backward compatibility is not an option here I think.

What needs to be serialized/deserialized is also requires a human decision. For instance, even now, authorization details are not transferred by an export/import.

The serialization/deserialization methods are also used by howl codebase outside of a hive context. It will be good to have this code only loosely coupled to the metastore code.

Krishna Kumar
added a comment - 21/Feb/11 08:07 There are a few reasons why I took this approach
The decision on compatibility (forward/backward) checks as in EximUtil.checkCompatibility needs to taken consciously. That is, automatically breaking backward compatibility is not an option here I think.
What needs to be serialized/deserialized is also requires a human decision. For instance, even now, authorization details are not transferred by an export/import.
The serialization/deserialization methods are also used by howl codebase outside of a hive context. It will be good to have this code only loosely coupled to the metastore code.

[Your comments are on a superseded review board submission; I will remind Carl again to take it down. The current reviewboard submission is up at https://reviews.apache.org/r/430/, but never the less both your comments are still applicable.]

1. Ok. Will address it.

2. I am not seeing how compatibility checking and selective serialization/deserialization of an object graph will be possible by auto-generated code. Will look into both thrift and datanucleus serialization (that you mentioned) from this aspect, but fine-grained control over this process is required here I think.

Krishna Kumar
added a comment - 23/Feb/11 16:43 Thanks Paul.
[Your comments are on a superseded review board submission; I will remind Carl again to take it down. The current reviewboard submission is up at https://reviews.apache.org/r/430/, but never the less both your comments are still applicable.]
1. Ok. Will address it.
2. I am not seeing how compatibility checking and selective serialization/deserialization of an object graph will be possible by auto-generated code. Will look into both thrift and datanucleus serialization (that you mentioned) from this aspect, but fine-grained control over this process is required here I think.

I discarded my reviewboard request after Krishna's original request. Unfortunately this doesn't appear to prevent people from commenting on it, and reviewboard won't let me delete the request outright.

Gelesh
added a comment - 04/Jul/13 08:35 Wish:-
Can We have a option to copy meta information alone,
UseCase:-
So that, during DistCp with out copying the Hive files, (with partition folder and clustered file structure) to a temp location, we can create a _meta file alone.
Then, DistCp the hive files (the partioned and clusteded file structure) as such and load re create a hive table in the new cluster.