Semi-structured data is data that does not conform to the standards of traditional structured data, but it contains tags or other types of mark-up that identify individual, distinct entities within the data.

Two of the key attributes that distinguish semi-structured data from structured data are the lack of a fixed schema and nested data structures:

Structured data requires a fixed schema that is defined before the data can be loaded and queried in a relational database system. Semi-structured data does not require a prior definition of a schema and can constantly evolve, i.e. new attributes can be added at any time.

In addition, entities within the same class may have different attributes even though they are grouped together, and the order of the attributes is not important.

The steps for loading semi-structured data into tables are identical to those for loading structured data into relational tables.

Snowflake loads semi-structured data into a single VARIANT column. Alternatively, using a COPY INTO table statement with data transformation, you can extract selected columns from a staged data file into separate table columns.

Concatenation of JSON documents (which may or may not be line-separated).

Because there is no formal specification, there are significant differences between various implementations. These differences makes import of JSON-like data sets impossible if the JSON parser is strict in
its language definition. To make import of JSON data sets as problem-free as possible, Snowflake follows the rule “be liberal in what you accept”. The intent is to accept the widest possible range of JSON
and JSON-like inputs that permit unambiguous interpretation.

This topic describes the syntax for JSON documents accepted by Snowflake.

Contains an array with 3 employee records (objects) and their associated dependent data (children, the children’s names and ages, cities where the employee has lived, and the years that the employee has
lived in those cities):

Avro is an open-source data serialization and RPC framework originally developed for use with Apache Hadoop. It utilizes schemas defined in JSON to produce serialized data in a compact binary format. The
serialized data can be sent to any destination (i.e. application or program) and can be easily deserialized at the destination because the schema is included in the data.

An Avro schema consists of a JSON string, object, or array that defines the type of schema and the data attributes (field names, data types, etc.) for the schema type. The attributes differ depending on
the schema type.

Used to store Hive data, the ORC (Optimized Row Columnar) file format was designed for efficient compression and improved performance for reading, writing, and processing data over earlier Hive file formats. For more information about ORC, see https://orc.apache.org/.

Snowflake reads ORC data into a single VARIANT column. You can query the data in a VARIANT column just as you would JSON data, using similar commands and functions.

Alternatively, you can extract select columns from a staged ORC file into separate table columns using a CREATE TABLE AS SELECT statement.

ORC is a binary format.

Note

The maximum length of ORC binary and string columns is subject to the Snowflake 16MB limit for VARCHAR data (compressed).

Parquet is a compressed, efficient columnar data representation designed for projects in the Hadoop ecosystem. The file format supports complex nested data structures and uses Dremel record shredding and assembly algorithms. For more information, see parquet.apache.org/documentation/latest/.

Snowflake reads Parquet data into a single VARIANT column. You can query the data in a VARIANT column just as you would JSON data, using similar commands and functions.

Alternatively, you can extract select columns from a staged Parquet file into separate table columns using a CREATE TABLE AS SELECT statement.

Parquet is a binary format. It is not possible to provide an example of a Parquet file.

XML (Extensible Markup Language) is a markup language that defines a set of rules for encoding documents. It was originally based on SGML, another markup language developed for standardizing the structure
and elements that comprise a document.

Since its introduction, XML has grown beyond an initial focus on documents to encompass a wide range of uses, including representation of arbitrary data structures and serving as the base language for
communication protocols. Because of its extensibility, versatility, and usability, it has become one of the most commonly-used standards for data interchange on the Web.

An XML document consists primarily of the following constructs:

Tags (identified by angle brackets, < and >)

Elements

Elements typically consist of a “start” tag and matching “end” tag, with the text between the tags constituting the content for the element. An element can also consist of an “empty-element” tag with no
“end” tag. “start” and “empty-element” tags may contain attributes, which help define the characteristics or metadata for the element.