Hadoop’s Schema-on-Read model does not impose any requirements when loading data into Hadoop.

Data can be simply loaded into HDFS without association of a schema or preprocess the data. Although creating a carefully structured and organized repository of your data will provide many benefits. It allows for enforcing access and quota controls to prevent accidental deletion or corruption.

Data model will be highly dependent on the specific use case. For example, data warehouse implementations and other event stores are likely to use a schema similar to the traditional star schema, including structured fact and dimension tables. Unstructured and semi-structured data, on the other hand, are likely to focus more on directory placement and metadata management.

Make sure your design will work well with the tools you are planning to use. The schema design is highly dependent on the way the data will be queried.

Keep usage patterns in mind when designing a schema. Different data processing and querying patterns work better with different schema designs. Understanding the main use cases and data retrieval requirements will result in a schema that will be easier to maintain and support in the long term as well as improve data processing performance.

Optimize organisation of data with partitioning, bucketing, and denormalizing strategies. Keeping in mind, storing a large number of small files in Hadoop can lead to excessive memory use for the NameNode.

A good average bucket size is a few multiples of the HDFS block size. Having an even distribution of data when hashed on the bucketing column is important because it leads to consistent bucketing. Also, having the number of buckets as a power of two is quite common.

Hadoop schema consolidates many of the small dimension tables into a few larger dimensions by joining them during the ETL process.

Hadoop-specific file formats include columnar format such as Parquet and RCFile, serialization formats like Avro, and file-based data structures such as sequence files.

Splittability and compression are the key consideration for storing data in Hadoop. It allows large files to be split for input to MapReduce and other types of jobs. Splittability is a fundamental part of parallel processing.

SequenceFiles

SequenceFiles store data as binary key-value pairs and can be uncompressed or compressed. SequenceFiles are well supported within the Hadoop ecosystem, however their support outside of the ecosystem is limited. They are also only supported in Java.

Storing a largenumber of small files in Hadoop can cause a couple of issues. A common use case for SequenceFiles is as a container for smaller files.

Seriaization Formats

Serialization is the process of turning data structures into byte streams. Data storage and transmission are main purpose of serialization. The main serialization format utilized by Hadoop is Writables. Writables are compact and fast, but limited to Java.

However, other serialization frameworks getting more reputation within the Hadoop ecosystem, including Thrift, Protocol Buffers, and Avro. Avro is the most efficient and specifically created to address limitations of Hadoop Writables.

Thrift

Thrift was designed at Facebook as a framework for developing cross-language interfaces to services. Using Thrift allowed Facebook to implement a single interface that can be used with different languages to access different underlying systes.

Thrift does not support compression of records, it’s not splittable, and have no native MapReduce support.

Protocol Buffers

The Protocol Buffer (protobuf) was developed at Google to facilitate data exchange between services written in different languages. Protocol Buffers are not splittable, do not support internal compression of records, and have no native MapReduce support.

Avro

Avro is a language-neutral data serialization system designed to address the downside of Hadoop Writables: lack of language portability. Since Avro stores
the schema in the header of each file, it’s self-describing and Avro files can easily be read from a different language than the one used to write the file. Avro is splittable.

Avro stores the data definition in JSON format making it easy to read and interpret, the data itself is stored in binary format making it compact and efficient.

Avro supports native MapReduce and schema evolution. The scehma used to read a file does not need to match the schema used to write the file which provides great flexibility with requirement change.

Avro supports a number of data types such as Boolean, int, float, and string. It also supports complex types such as array, map, and enum.

Columnar Formats

Most RDBMS stored data in a row-oriented format. This is efficient when many columns of the record need to be fetched. This
option can also be more efficient when you’re writing data, particularly if all columns of the record are available at write time because the record can be written with a single disk seek.

More recently, a number of databases have introduced columnar data storage which is well suited for data warehousing and queries that only access a small subset of columns. Columnar data sets provides more efficient compression.

The RCFile was developed to provide fast data loading, fast query processing, and efficient processing for MapReduce applications, although it’s only seen use as a Hive storage format.

The RCFile format breaks files into row splits, then within each split uses column-oriented storage. It also has some deficiencies that prevent optimal performance for query times and compression. RCFile is still a fairly common format used with Hive storage.

ORC

The ORC format has come to life to address some of the weaknesses with the RCFile format, specifically around storage and query performance efficiency. The ORC provides lightweight, always-on compression provided by type-specific readers
and writers. Supports the Hive type model, including new primitives such as decimal and complex types. Is a splittable storage format.

Parquet

Parquet documents says: Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Parquet shares many of the same design goals as ORC, but is intended to be a general-purpose storage format for Hadoop. The goal is to create a format that’s suitable for different MapReduce interfaces such as Java, Hive, and Pig, and also suitable for other processing engines

such as Impala and Spark. Parquet provides the following benefits, many of which it shares with ORC:

• Similar to ORC files, Parquet allows for returning only required data fields, thereby reducing I/O and increasing performance.
• Is designed to support complex nested data structures.
• Compression can be specified on a per-column level.
• Fully supports being able to read and write to with Avro and Thrift APIs.
• Stores full metadata at the end of files, so Parquet files are self-documenting.
• Uses efficient and extensible encoding schemas—for example, bit-packaging/run length encoding (RLE).

Conclusion

Having a single interface to all the files in your Hadoop cluster is valuable. Speaking of picking a file format, you will want to pick one with a schema because, in the end, most data in Hadoop will be structured or semistructured data.

So if you need a schema, Avro and Parquet are great options. However, we don’t want to have to worry about making an Avro version of the schema and a Parquet version.

Thankfully, this isn’t an issue because Parquet can be read and written to with Avro APIs and Avro schemas.

We can meet our goal of having one interface to interact with our Avro and Parquet files, and we can have a block and columnar options for storing our data.

Working With Emotional Intelligence takes the concepts from Daniel Goleman’s bestseller, Emotional Intelligence, into the workplace. Business leaders and outstanding performers are not defined by their IQs or even their job skills, but by their “emotional intelligence”: a set of competencies that distinguishes how people manage feelings, interact, and communicate.

Analyses done by dozens of experts in 500 corporations, government agencies, and nonprofit organizations worldwide conclude that emotional intelligence is the barometer of excellence on virtually any job. This book explains what emotional intelligence is and why it counts more than IQ or expertise for excelling on the job. It details 12 personal competencies based on self-mastery (such as accurate self-assessment, self-control, initiative, and optimism) and 13 key relationship skills (such as service orientation, developing others, conflict management, and building bonds). Goleman includes many examples and anecdotes–from Fortune 500 companies to a nonprofit preschool–that show how these competencies lead to or thwart success.

In this eye-opening account, Cal Newport debunks the long-held belief that “follow your passion” is good advice. Not only is the cliché flawed-preexisting passions are rare and have little to do with how most people end up loving their work-but it can also be dangerous, leading to anxiety and chronic job hopping.

After making his case against passion, Newport sets out on a quest to discover the reality of how people end up loving what they do. Spending time with organic farmers, venture capitalists, screenwriters, freelance computer programmers, and others who admitted to deriving great satisfaction from their work, Newport uncovers the strategies they used and the pitfalls they avoided in developing their compelling careers.

Matching your job to a preexisting passion does not matter, he reveals. Passion comes after you put in the hard work to become excellent at something valuable, not before.
In other words, what you do for a living is much less important than how you do it.

With a title taken from the comedian Steve Martin, who once said his advice for aspiring entertainers was to “be so good they can’t ignore you,” Cal Newport’s clearly written manifesto is mandatory reading for anyone fretting about what to do with their life, or frustrated by their current job situation and eager to find a fresh new way to take control of their livelihood. He provides an evidence-based blueprint for creating work you love.

SO GOOD THEY CAN’T IGNORE YOU will change the way we think about our careers, happiness, and the crafting of a remarkable life.

The 5 Elements of Effective Thinking presents practical, lively, and inspiring ways for you to become more successful through better thinking. The idea is simple: You can learn how to think far better by adopting specific strategies. Brilliant people aren’t a special breed–they just use their minds differently. By using the straightforward and thought-provoking techniques in The 5 Elements of Effective Thinking, you will regularly find imaginative solutions to difficult challenges, and you will discover new ways of looking at your world and yourself–revealing previously hidden opportunities.

Surprisingly inspiring.

Understand deeply. Understand simple things first. See what’s there and what’s missing. Master the basics. See the invisible.

Fail to success. Fail better. Let your errors be your guide. Have a bad day. Lean from those missteps.

Create questions out of the thin air. What’s the real question? Improve the question. Ask meta-questions. Teach to learn.

Seeing the flow of ideas. Creating new ideas from old ones. Think back. Extend ideas.

Everyone knows that Icarus’s father made him wings and told him not to fly too close to the sun; he ignored the warning and plunged to his doom. The lesson: Play it safe. Listen to the experts. It was the perfect propaganda for the industrial economy. What boss wouldn’t want employees to believe that obedience and conformity are the keys to success?

But we tend to forget that Icarus was also warned not to fly too low, because seawater would ruin the lift in his wings. Flying too low is even more dangerous than flying too high, because it feels deceptively safe.

The safety zone has moved. Conformity no longer leads to comfort. But the good news is that creativity is scarce and more valuable than ever. So is choosing to do something unpredictable and brave: Make art. Being an artist isn’t a genetic disposition or a specific talent. It’s an attitude we can all adopt. It’s a hunger to seize new ground, make connections, and work without a map. If you do those things you’re an artist, no matter what it says on your business card.

Useful business analysis requires you to effectively transform data into actionable information. This book helps you use SQL and Excel to extract business information from relational databases and use that data to define business dimensions, store transactions about customers, produce results, and more. Each chapter explains when and why to perform a particular type of business analysis in order to obtain useful results, how to design and perform the analysis using SQL and Excel, and what the results should look like.

Ace your preparation for the skills measured by MCTS Exam 70-433—and on the job. Work at your own pace through a series of lessons and reviews that fully cover each exam objective. Then, reinforce and apply what you’ve learned through real-world case scenarios and practice exercises. This official Microsoft study guide is designed to help you make the most of your study time.

Work with XML and SQLCLR Assess your skills with the practice tests on CD. You can work through hundreds of questions using multiple testing modes to meet your specific learning needs. You get detailed explanations for right and wrong answers—including a customized learning path that describes how and where to focus your studies.