I split the changes into multiple commits to ease the review.
Used the example program to test the new API.
I will add unit tests once we converge on the API after review.
Thanks to @anatolishein for collaborating with the API design.

Author: Deepak Majeti <deepak.majeti@hpe.com>

Closes #484 from majetideepak/PARQUET-1372 and squashes the following commits:

Decided to go with the more simplistic approach and only introduce a convenience API for now. Once is merged, I'll do some work that at least primitive arrays are read into a single Array in this path.

Author: Korn, Uwe <Uwe.Korn@blue-yonder.com>

Closes #492 from xhochy/PARQUET-1392 and squashes the following commits:

This is first part of bloom filter patch set, which include a bloom filter utility and also some unit tests.
Note that this patch also includes murmur3Hash original code from Austin Appleby. The code isn't formatted as parquet-cpp format.

Per comments in PARQUET-1348. I also upgraded clang-format to 6.0 here so pardon the diff noise from that.

To summarize the style points:

* Don't pass `const std:shared_ptr<T>&` if `const T&` will do
* Don't pass `const std::shared_ptr<T>&` if `T*` will do
* I prefer to only use class static function members for alternate constructors
* Out arguments after in arguments
* `lower_with_underscores` for variable names

Author: Wes McKinney <wesm+git@apache.org>

Closes #482 from wesm/PARQUET-1360 and squashes the following commits:

The call with size 0 ends up in arrows memory_pool, https://github.com/apache/arrow/blob/884474ca5ca1b8da55c0b23eb7cb784c2cd9bdb4/cpp/src/arrow/memory_pool.cc#L50, and the according allocation fails. See according documentation, https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/aligned-malloc. Only happens on Windows environment, as posix_memalign seems to handle 0 inputs in unix environments.

Author: Philipp Hoch <p.hoch@celonis.com>

Closes #472 from philhoch/bugfix-cover-empty-dicitionary-size-on-windows and squashes the following commits:

There were a number of helper functions in the unit tests that has ASSERT_ macros. However, the callers of these functions were not asserting if the helper functions had encountered an assertion. In those circumstances, the unit test will continue because the ASSERT_ macros only return in the block in which they are used. This commit adds the use of ASSERT_NO_FATAL_FAILURE to the calls of those helper functions at the scope of the unit test, which will cause the unit test to exit. This ensures that we won't encounter segfaults or unexpected behavior due to a unit test continuing execution past an asserting inside of a helper function.

Author: Joshua Storck <joshua.storck@twosigma.com>

Closes #458 from joshuastorck/use_assert_no_fatal_failure and squashes the following commits:

fa89f5d [Joshua Storck] There were a number of helper functions in the unit tests that has ASSERT_ macros. However, the callers of these functions were not asserting if the helper functions had encountered an assertion. In those circumstances, the unit test will continue because the ASSERT_ macros only return in the block in which they are used. This commit adds the use of ASSERT_NO_FATAL_FAILURE to the calls of those helper functions at the scope of the unit test, which will cause the unit test to exit. This ensures that we won't encounter segfaults or unexpected behavior due to a unit test continuing execution past an asserting inside of a helper function.

PARQUET-1274: Prevent segfault that was occurring when writing a nanosecond timestamp with arrow writer properties set to coerce timestamps and support deprecated int96 timestamps.

The bug was a due to the fact that the physical type was int64 but the WriteTimestamps function was taking a path that assumed the physical type was int96. This caused memory corruption because it was writing past the end of the array. The bug was fixed by checking that coerce timestamps is disabled when writing int96.

A unit test was added for the regression.

Author: Joshua Storck <joshua.storck@twosigma.com>

Closes #456 from joshuastorck/ARROW_2082 and squashes the following commits:

5fa0a94 [Joshua Storck] Removing 'using ::arrow' in favor of using ::arrow::SomeType
9725ecc [Joshua Storck] Bug fix for ARROW-2082, in which a segfault was being encountered when writing a nanosecond timestamp column with arrow writer properties set to coerce timestamps and support deprecated int96 timestamps. The bug was a segfault due to the fact that the physical type was int64 but the WriteTimestamps function was taking a path that assumed the physical type was int96. The bug was fixed by checking that coerce timestamps is disabled when writing int96. A unit test was added for the regression

The error was reported here: https://issues.apache.org/jira/browse/ARROW-1938.

Because dictionary types are not supported in writing yet, the code converts the dictionary column to the actual values first before writing. However, the existing code was accidentally using zero as the offset and the length of the column as the size. This resulted in writing all of the column values for each chunk of the column that was supposed to be written.

The fix is to pass the offset and size when recursively calling through to WriteColumnChunk with the "flattened" data.

Author: Joshua Storck <joshua.storck@twosigma.com>

Closes #453 from joshuastorck/ARROW_1938 and squashes the following commits:

c2af50f [Joshua Storck] Remove extraneous semicolon in unit test
23f5722 [Joshua Storck] Ran clang-format on arrow-reader-writer-test.cc
314b159 [Joshua Storck] Removing print statements from AssertTableEqual
f0bc71a [Joshua Storck] Fixing bug reported in https://issues.apache.org/jira/browse/ARROW-1938, namely preventing all of the values in a dictionary column from being written to parquet for each chunk created as a result of specifying row_group_size

PARQUET-1225: NaN values may lead to incorrect filtering under certai…

1) `parquet-cpp` does not implement filtering (predicate pushdown). Clients such as Vertica, read the statistics from the metadata and implement their own filtering based on these stats.
Therefore, the read path does not require any changes. We should document that the min/max value can potentially contain NaNs.

2) I made changes to the write path to ignore the NaNs.

Author: Deepak Majeti <deepak.majeti@hpe.com>

Closes #444 from majetideepak/PARQUET-1225 and squashes the following commits:

Per the parquet.thift spec, for primitive nodes the num_children schema
attibute should remain unset. This is implemeted correctly in parquet-mr
see [1]. However currently parquet-cpp does set the num_children
attribute to 0 if it is a primitive node. This pull requests fixes this
issue and the tests that were relying on this behavior

I did quite a bit of refactoring to make this easier / simpler. I think there's some additional work we could do to make the write-path code cleaner, but we should probably wait to do some of that until we implement complete nested data write and read.

I will follow up shortly with some functional tests for ARROW-232 to make sure this works end-to-end on the pyarrow side.

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #426 from wesm/PARQUET-1092 and squashes the following commits:

The main change made is that you don't have to specify the size of a row-group upfront when writing it. This is signalled through a "row_count_determined" flag which is threaded through the relevant classes.

The AppendRowGroup(int64_t num_rows) method should have identical behaviour as before.
The AppendRowGroup() method should fix its row group after completing a single column, and enforce all
future columns match that. Empty row groups are still not allowed.

(You'll definitely want to squash these commits. Apologies for the poor git usage.)

We did not have very consistent logic around reading values from leaf nodes versus reading semantic records where the repetition level is greater than zero. This introduces a reader class that reads from column chunks until it identifies the end of records. It also reads values (with spaces, if required by the schema) into internal buffers. This permitted a substantial refactoring and simplification of the code in parquet::arrow where we were handling the interpretation of batch reads as records manually.

As follow up patch, we should be able to take a collection of record readers from the same "tree" in a nested type and reassemble the intermediate Arrow structure and dealing with any redundant structure information in repetition and definition levels. This should a allow a unification of our nested data read code path so that we can read arbitrary nested structures.

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #398 from wesm/PARQUET-1100 and squashes the following commits: