Cloudera Glossary

Note: This page contains references to CDH 5 components or features that have been removed from CDH 6. These references are only applicable if you
are managing a CDH 5 cluster with Cloudera Manager 6. For more information, see Deprecated Items.

This is a reference list of terms related to Cloudera products and services. Additional information is available from a number of resources.

access control list (ACL)

A list of permissions associated with an object in a computer file system. An ACL specifies which users or processes are allowed to access an object, and what operations can be
performed.

Accumulo

A sorted, distributed key-value store based on the Google BigTable design. Apache Accumulo is a NoSQL DBMS that
operates over HDFS, and supports efficient storage and retrieval of structured data, including queries for ranges. Accumulo tables can be used as input and output for MapReduce jobs. Accumulo
includes automatic load-balancing and partitioning, data compression, and fine-grained security labels.

action

In Spark, a function that returns a value to the driver after running a computation on an RDD.

Apache

Apache Incubator

Apache Software Foundation gateway for open-source projects that aim to become Apache projects.
Incubating projects are open source and may or may not become Apache projects.

Apache Software Foundation (ASF)

A non-profit corporation that supports various open-source software products, including Apache Hadoop and related projects on which Cloudera products are based. Apache projects are
developed by teams of collaborators and protected by an ASF license that provides legal protection to volunteers who work on Apache products and protect the Apache brand name.

Apache projects are characterized by a collaborative, consensus-based development process and an open and pragmatic software license. Each project is managed by a self-selected team of
technical experts who are active contributors to the project.

Cloudera employees are major contributors to many Apache projects.

application JAR

A JAR containing a Spark application. In some cases you can use an "uber" JAR containing your application with its dependencies. The JAR should never include Hadoop or Spark libraries,
however, because these are added at run time.

audit, audit event

Log of activity performed in the cluster. Many cluster services write an entry in an audit log for each action performed; these logs are collected by Cloudera Navigator Audit Server and
can be reviewed through the Navigator console. Some ways to use audit logs include measuring activity against specific data assets or by individual users or groups, tracing failed attempts to access
data assets, or identifying who accessed specific data assets and when.

authentication

The function of confirming the identity of a person or software program.

authorization

The function of specifying access rights to resources.

Avro

A serialization system for storing and transmitting data over a network. Apache Avro supports rich data structures, a compact binary encoding, and a container file for sequences of Avro
data (often referred to as Avro data files). Avro is language-independent and several language bindings are available, including Java, C, C++, Python, and Ruby. All components
in CDH that produce or consume files support Avro data files.

Beeswax

A Hue application that enables you to perform queries on Hive. You can create Hive tables, load data, and run and
manage Hive queries.

big data

Data sets in which the input/output velocity, variety of data structure, and volume exceed the capabilities of systems which were designed for smaller data sets to capture, manage, and
process the data within a tolerable elapsed time. Big data sizes are expanding, currently ranging from terabytes to many petabytes in a single data set.

BigTable

A compressed, high-performance, column-oriented database built on Google File System (GFS). The BigTable design was the inspiration for HBase and Accumulo, but the implementation, unlike other Google projects
such as Protocol Buffers, is proprietary.

Bigtop

An Apache project to develop the packaging and interoperability testing of the Apache Hadoop ecosystem projects.

BDR

Backup and Disaster Recovery. CDH has several features you can use to back up your data that you can use for disaster recovery. See snapshots and replication.

user-defined properties

In Cloudera Navigator, business metadata, such as key-value pairs and tags, that is added to extracted entities. You can add and modify user-defined properties before or after entities
are extracted.

CDH is free, 100% open source, and licensed under the Apache 2.0 license. CDH is supported on many Linux
distributions.

Cloudera Enterprise

Essentials Edition offers an enterprise-ready distribution of CDH together with Cloudera
Manager and other advanced management tools and technical support for core Apache Hadoop.

Data Science and Engineering Edition offers an enterprise-ready distribution of CDH together
with Cloudera Manager and other advanced management tools and technical support for programmatic data preparation and predictive modeling.

Operational Database Edition offers an enterprise-ready distribution of CDH together with
Cloudera Manager and other advanced management tools and technical support for online applications with real-time serving needs.

Data Warehouse Edition offers an enterprise-ready distribution of CDH together with Cloudera
Manager and other advanced management tools and technical support for BI and SQL analytics.

Enterprise Data Hub Edition offers an enterprise-ready distribution of CDH together with
Cloudera Manager and other advanced management tools and technical support for complete use of the platform.

Securing data and simplifying storage and management of encryption keys. Data encryption and key management provide protection against potential threats by malicious actors on the
network or in the datacenter. It is also a requirement for meeting key compliance initiatives and ensuring the integrity of enterprise data.

Cloudera Search

A fully integrated search tool for the Apache Hadoop platform that integrates Apache Solr, including Apache
Lucene, Apache SolrCloud, and Apache Tika, with CDH. Cloudera Search makes searching more scalable, easy to use, and optimized for both near-real-time and batch-oriented indexing.

cluster

A set of computers or racks of computers that contains an HDFS filesystem and runs MapReduce and other processes on that data. A pseudo-distributed cluster is a CDH
installation run on a single machine and useful for demonstrations and individual study.

In Cloudera Manager, a logical entity that contains a set of hosts, a single version of CDH installed on the hosts, and the service and role instances running on the hosts. A host can
belong to only one cluster. Cloudera Manager can manage multiple CDH clusters, however each cluster can only be associated with a single Cloudera Manager Server or Cloudera Manager HA pair.

cluster manager

An external service for acquiring resources on the cluster: Spark Standalone or YARN.

commit

An operation in Cloudera Search that makes documents searchable.

hard - A commit that starts the autowarm process, closes old searchers, and opens new ones. It may also trigger replication.

compression

A mechanism to reduce the size of a file so that it takes up less disk space for storage and consumes less network bandwidth when transferred. Common compression tools used with Apache
Hadoop include gzip, bzip2, Snappy, and LZO.

container

A resource bucket and process space for a task. A container's resources consist of vcores and memory.

connector

Usually refers to software for connecting external systems with Apache Hadoop. Some connectors work with Apache
Sqoop to enable efficient data transfer between an external system and Hadoop. Other connectors translate ODBC driver calls
from business intelligence systems into HiveQL queries.

Crunch

An Apache Java library that can be used to write, test, and run MapReduce pipelines.

business metadata

In Cloudera Navigator, descriptions, key-value pairs, and tags that can be added to entities such as HDFS files, Hive tables, and YARN operations. You can add and modify business
metadata before and after entities are extracted.

Data Encryption Key (DEK)

The encryption/decryption key assigned to a file in an encryption zone. Each file has its own DEK, and these DEKs are never stored persistently unless they are encrypted with the
encryption zone's key.

data science

A discipline that builds on techniques and theories from many fields, including mathematics, statistics, and computer science, with the goal of extracting meaning from data and creating
data products.

DataFu

A collection of Pig user-defined functions (UDFs) that can be used by Apache Pig for statistical analysis. Apache
DataFu was deprecated in CDH 5.9 and is removed from CDH 6.0 Beta 1. For more information, see Removal of the Apache DataFu Pig JAR from CDH
6.

DataNode

datastore

A repository of a set of integrated information objects. Datastores include repositories such as databases and files.

dataset

A collection of records, similar to a relational database table. Records are similar to table rows, but the columns can contain not only strings or numbers, but also nested data
structures such as lists, maps, and other records.

DDL

A category of SQL statements that affect database state rather than table data. Includes all the CREATE, ALTER, and
DROP statements.

deployment

A configuration of Cloudera Manager and all the clusters it manages.

distributed system

A system composed of multiple autonomous computers that communicate through a computer network.

DML

A category of SQL statements that change table data, such as INSERT and LOAD DATA.

driver

In Apache Spark, a process that represents an application session. The driver is responsible for converting the application to a directed graph of individual steps to execute on the
cluster. There is one driver per application.

dynamic resource pool

In Cloudera Manager, a named configuration of resources and a policy for scheduling the resources among YARN applications or Impala queries running in the pool.

embedded Solr

Provides the ability to execute Solr commands without having a separate servlet container. Use of embedded Solr is generally discouraged, particularly if used because HTTP is assumed to
be too slow. However, in Cloudera Search, particularly if a MapReduce process is adopted, embedded Solr is advisable.

Encrypted Data Encryption Key (EDEK)

An encrypted DEK, which is stored persistently as part of the file's metadata on the NameNode.

encryption

The encoding of information so that only authorized users are permitted to read it.

encryption zone

A directory in HDFS in which every file and subdirectory is encrypted. The files in this directory are transparently encrypted on write and transparently decrypted on read. Each
encryption zone is associated with a key that is specified when the zone is created.

encryption zone key

Key used to encrypt EDEKs. When a new file is created in an encryption zone, the NameNode sends a request to the KMS to generate a new EDEK encrypted with the encryption zone key. When
reading a file from an encryption zone, the NameNode provides the client with the file's EDEK and the encryption zone key version used to encrypt the EDEK. The client then sends a request to the KMS
to decrypt the EDEK. If successful, the client uses the DEK to decrypt the file contents.

Enterprise Data Hub

An enterprise data hub (EDH), built on Apache Hadoop, provides a single central system for the storage and management of all data in the enterprise. An EDH runs the full range of
workloads that enterprises require, including batch processing, interactive SQL, enterprise search, and advanced analytics, together with the integrations to existing systems, robust security, data
management, and data protection.

executor

A process that serves a Spark application. An executor runs multiple tasks over its lifetime, and multiple tasks concurrently. A host may have several Spark executors and there are many
hosts running Spark executors for each application.

expression

A construct that allows certain policy properties to be specified programmatically using Java, instead of string
literals.

extract, load, transform (ELT)

A variation of Extract, Transform, Load (ETL). The process of transferring data from a source to an end
target (a database or data warehouse), and then transforming the data as required.

extract, transform, load (ETL)

A process that involves extracting data from sources, transforming the data to fit operational needs, and loading the data into the end target, typically a database or data
warehouse.

facet

In Cloudera Manager and Cloudera Navigator, an explicit dimension of an entity that enables it to be accessed and filtered in multiple ways. Facets correspond to entity properties.

faceting

Arrangement of query results into categories, usually with counts for each category. You can use these categories to explore and further restrict search results to find the information
you need.

fault-tolerant design

A design that enables a system to continue operation, possibly at a reduced level instead of failing completely, when some part of the system fails.

field-level

Level at which encryption and data masking can be applied. When protection is applied at this level, it is generally applied only to specific sensitive fields, such as credit card
numbers, social security numbers, or names, not to all data.

Flume

A distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of text or streaming data from many different sources to a centralized
datastore. Apache Flume is robust and fault tolerant and uses a simple, extensible data model that allows for online analytic application.

filesystem-level

Level at which encryption can be applied to protect some or all files in a volume.

filter query (fq)

A clause that limits returned results in Cloudera Search. For example, “fq=sex:male” limits results to males. Filter queries are cached and reused.

Fuse-DFS

A service that allows HDFS to be mounted on Linux and
accessed using standard filesystem tools.

gateway

A type of role that typically provides client access to specific cluster services. For example, HDFS, Hive, Kafka, MapReduce, Solr, and Spark each have gateway roles to provide access
for their clients to their respective services. Gateway roles do not always have "gateway" in their names, nor are they exclusively for client access. For example, Hue Kerberos Ticket Renewer is a
gateway role that proxies tickets from Kerberos.

The node supporting one or more gateway roles is sometimes referred to as the gateway node or edge node, with the notion of "edge" common
in network or cloud environments. In terms of the Cloudera cluster, the gateway nodes in the cluster receive the appropriate client configuration files when Deploy Client
Configuration is selected from the Actions menu in Cloudera Manager Admin Console.

HA

Hadoop

A free, open source software framework that supports data-intensive distributed applications. The core components of Apache Hadoop are the HDFS and the MapReduce and YARN processing frameworks. The term is also used for an ecosystem of projects related to Hadoop, under the umbrella of infrastructure for distributed
computing and large-scale data processing.

Hadoop Distributed File System (HDFS)

A user space filesystem designed for storing very large files with streaming data access patterns, running on clusters of industry-standard machines. HDFS defines three components:

NameNode - Maintains the namespace tree for HDFS and a mapping of file blocks to DataNodes where the data is stored. A simple HDFS cluster can have only one primary NameNode, supported
by a secondary NameNode that periodically compresses the NameNode edits log file that contains a list of HDFS metadata modifications. This reduces the amount of disk space consumed by the log file on
the NameNode, which also reduces the restart time for the primary NameNode. A high availability cluster contains two
NameNodes: active and standby.

DataNode - Stores data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated on multiple DataNodes for reliability and so that localized
computation can be executed near the data.

JournalNode - Maintains a directory to log the modifications to the namespace metadata when using the Quorum-based Storage mechanism for providing high availability. During failover, the NameNode standby ensures that it has applied all of the
edits from the JournalNodes before promoting itself to the active state.

HDFS

heterogeneous storage

high availability (HA)

A system and implementation design to keep a service available at all times in case of failure, without regard to its performance.

Hive

An Apache data warehouse system for Hadoop that facilitates summarization and the analysis of large datasets stored in HDFS
using an SQL-like language called HiveQL.

HiveServer

A server process that supports clients that connect to Hive over an Thrift connection. The name also refers to a
Thrift protocol used by both Impala and Hive.

HiveServer2

A server process that supports clients that connect to Hive over a network connection. These clients can be native command-line editors or applications and tools that use an ODBC or JDBC
driver. The name also refers to a Thrift protocol used by both Impala and Hive.

HiveQL

The name of the SQL dialect used by the Hive component. It uses a syntax that is similar to standard SQL to execute
MapReduce jobs on HDFS. HiveQL does not support all SQL functionality. Transactions and materialized views are not supported, and support for indexes and subquery is limited. It supports features
that are not part of standard SQL, such as multitable, including multitable inserts and create table as select.

Internally, a compiler translates a HiveQL statement into a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution. Beeswax, which is included in Hue, provides a graphical front end for HiveQL queries.

host

In Cloudera Manager, a physical or virtual machine that runs role instances. A host can belong to only one cluster.

host template

A set of role groups in Cloudera Manager. When a template is applied to a host, a role instance from each role group is created and assigned to that host.

Impala

Official name: Apache Impala. A service that enables real-time querying of data stored in HDFS or HBase. It supports the same metadata and ODBC and JDBC drivers as Apache Hive
and a query language based on the Hive Standard Query Language (HiveQL). To avoid latency, Impala circumvents MapReduce to directly
access data through a specialized distributed query engine that is similar to those found in commercial parallel RDBMS.

Incubator

index

A data structure that improves the speed of data retrieval on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or more columns
of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

job

JobTracker

JournalNode

Kafka

A distributed publish-subscribe messaging system that provides high throughput for publishing and subscribing. as well as replication to prevent data loss. Apache Kafka is frequently
used for log collection and stream processing and often (but not exclusively) used in tandem with Hadoop, Apache Storm, and Spark Streaming.

Kerberos

An authentication protocol in wide use since it was first developed by MIT in 1993 and standardized by the IETF in 2005 (RFC 4120, Kerberos Version 5). Cloudera recommends using Kerberos
to secure clusters, by integrating either MIT Kerberos or Microsoft Active Directory (which uses Kerberos). With Kerberos enabled, user authentication is required. Once users authenticate, other
components of the cluster can also be leveraged (for example, Sentry's role-based access privileges) to provide appropriate, secure access to the cluster. See Cloudera Security for more information.

key material

The portion of the key used during encryption and decryption.

Key Management Server (KMS)

Hadoop service that interfaces with a backing key store on behalf of HDFS daemons and clients. Both the backing key store and the KMS implement the Hadoop KeyProvider client API.

Kite

A collection of libraries, tools, examples, and documentation engineered to simplify the most common tasks when working with CDH. Just like CDH, Kite is 100% free, open source, and
licensed under the Apache License v2, so you can use the code any way you choose in your existing commercial code base or open source project.

Kudu

A columnar storage manager for the Hadoop platform. Like other Hadoop ecosystem applications, Apache Kudu runs on commodity hardware, is horizontally scalable, and supports highly
available operation.

Fast processing, integration with Hadoop ecosystem components, high availability, and other benefits make it ideal for a variety of applications: reporting applications where new data
must be immediately available for end users; time-series applications that must support queries across large amounts of historic data while simultaneously returning granular queries about an
individual entity; and applications that use predictive models to make real-time decisions.

latency

A measure of time delay experienced in a system.

lineage diagram

In Cloudera Navigator, a directed graph that depicts an entity and its relationship with other entities.

Linux

A Unix-like computer operating system assembled under the model of free, open-source software development and distribution. Linux is a leading operating system on servers, mainframe
computers, supercomputers, and embedded systems such as mobile phones, tablets, network routers, televisions, and video game consoles. The major distributions of enterprise Linux are CentOS, Debian,
RHEL, SLES, and Ubuntu.

LZO

A free, open source compression library. LZO compression provides a good balance between data size and speed of compression. The LZO compression algorithm is the most efficient of the
codecs, using very little CPU. Its compression ratios are not as good as others, but its compression is still significant compared to the uncompressed file sizes. Unlike some other formats, LZO
compressed files are splittable, enabling MapReduce to process splits in parallel.

LZO is published under the GNU General Public License and so is not included in CDH but can be used with CDH components; the Cloudera public Git repository hosts the hadoop-lzo package that provides a version of LZO that can be used with CDH.

machine learning

A field of computer science that explores the construction and study of algorithms that can learn from and make predictions on data. Categories of machine learning problems include:

Mahout

A machine learning library for Hadoop that is scalable to large datasets, thereby simplifying the task of
building intelligent applications.

Apache Mahout was deprecated in CDH 5.5 and has been removed from CDH 6, starting with CDH 6.0 Beta. It is no longer supported.

managed properties

In Cloudera Navigator, metadata in the form of key-value pairs that are added to entities such as HDFS files, Hive tables, and YARN operations. Managed properties are created by an
administrator to be used on specific entity types. They are defined within a namespace and enforce conformance to value constraints (for example, require the value to be a date). You can add and
modify managed properties after entities are extracted. Managed properties differ from user-defined properties in that they are centrally defined and managed; user-defined properties are created
one-at-a-time without any constraints on consistency or content.

MapReduce

A distributed processing framework for processing and generating large data sets and an implementation that runs on large clusters of industry-standard machines.

The processing model defines two types of functions: a map function that processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges
all intermediate values associated with the same intermediate key.

A MapReduce job partitions the input data set into independent chunks that are processed by the map functions in a parallel manner. The framework sorts the outputs of the maps, which are
then input to the reduce functions. Typically both the input and the output of the job are stored in a distributed filesystem.

The implementation provides an API for configuring and submitting jobs and job scheduling and management services; a library of search, sort, index, inverted index, and word
co-occurrence algorithms; and the runtime. The runtime system partitions the input data, schedules the program's execution across a set of machines, handles machine failures, and manages the required
inter-machine communication.

MapReduce v1 (MRv1)

The runtime framework on which MapReduce jobs execute. It defines two daemons:

JobTracker - Coordinates running MapReduce jobs and provides resource management and job lifecycle management. In YARN,
those functions are performed by two separate components.

TaskTracker - Runs the tasks that the MapReduce jobs have been split into.

MapReduce v2 (MRv2)

Maven

A software project-management tool. Based on the concept of a project object model, Apache Maven can manage a project's build, reporting, and documentation. CDH artifacts are available
in the Cloudera Maven repository.

metadata

Data about data. Metadata includes attributes that characterize the entities contained in or generated by the cluster. The name and date properties associated with a file created on a
hard-disk drive are examples of metadata. Apache Sentry generates metadata to indicate who can access cluster files and tables. Cloudera Navigator collects system-generated metadata (technical
metadata) for cluster entities and allows you to create additional metadata to associate with these entities.

NameNode

Navigator Key Trustee

A virtual safe-deposit box for managing encryption keys, certificates, and passwords. It provides software-based key and certificate management that supports a variety of robust,
configurable, and easy-to-implement policies governing access to the secure artifacts.

near real-time (NRT)

In Cloudera Search, the ability to search documents very soon after they are added to Solr. With SolrCloud, this is largely automatic and measured in seconds.

Navigator

network-level

Level at which encryption and decryption are applied before and after data is sent across a network. In Hadoop, this includes data sent from client user interfaces as well as
service-to-service communication like remote procedure calls (RPCs). This protection is available on virtually all transmissions within the Hadoop ecosystem using industry-standard protocols such as
TLS/SSL.

ODBC driver

OL

Oozie

A workflow and coordination service for Hadoop that orchestrates data ingest, store, transform, and analysis actions. Apache Oozie supports several types of Hadoop jobs, including
MapReduce, Streaming, Pipes, Pig, Hive, and Sqoop.

Oryx

Provides a simple, real-time, large-scale machine learning and predictive analytics infrastructure. Using
Apache Hadoop, Oryx can continuously build models from a data stream. It also serves queries of those models in real time through an HTTP
REST API, and can update models based on new streaming data.

parcel

A binary distribution format that contains compiled code and meta-information such as a package description, version, and dependencies.

Parquet

An open source, column-oriented binary file format for Hadoop that supports very efficient
compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and allows adding
more encodings as they are invented and implemented. Encoding and compression are separated, allowing Parquet consumers to implement operators that work directly on encoded data without paying a
decompression and decoding penalty, when possible.

partition

A subset of the elements in an RDD. Partitions define the unit of parallelism; Spark processes elements within a partition in sequence and multiple partitions in parallel. When Spark
reads a file from HDFS, it creates a single partition for a single input split. It returns a single partition for a single block of HDFS (but the split between partitions is on line split, not the
block split), unless you have a compressed text file. With a compressed file, you get a single partition for a single file (because compressed text files are not splittable).

peer

A Cloudera Manager instance that manages clusters and is used as the source of data to be replicated. See replication.

petabyte

Pig

A data flow language and parallel execution framework built on top of MapReduce. Internally, a compiler translates Apache Pig statements into a directed acyclic graph of MapReduce jobs,
which are submitted to Hadoop for execution.

policy

In Cloudera Navigator, a set of actions performed when a class of entities is extracted. Use policies to add managed metadata or tags to entities in Navigator's index.

Quorum-based storage

A mechanism for enabling a standby NameNode to keep its state synchronized with the active NameNode, in which both
nodes communicate with a group of daemons called JournalNodes.

rack

In Cloudera Manager, a physical entity that contains a set of physical hosts typically served by the same switch.

RegionServer

In HBase, applications store data into labeled tables, which are partitioned horizontally into regions.
RegionServer is responsible for managing one or more regions.

relational database management system (RDBMS)

A database management system based on the relational model, in which all data is represented in terms of tuples, grouped into relations. Most implementations of the relational model use
the SQL data definition and query language.

replica

In SolrCloud, a complete copy of a shard. Each replica is identical, so only one replica has to be queried (per shard) for searches.

replication

The ability to copy HDFS directories and files, the Hive metastore and data, and HBase tables to another cluster.

resilient distributed dataset (RDD)

In Spark, a fault-tolerant collection of elements that can be operated on in parallel.

RHEL

Red Hat Enterprise Linux.

role

In Cloudera Manager, a category of functionality within a service. For example, the HDFS service has the following roles: NameNode, SecondaryNameNode, DataNode, and Balancer. Sometimes
referred to as a role type. See also user role.

role group

In Cloudera Manager, a set of configuration properties for a set of role instances.

role instance

In Cloudera Manager, an instance of a role running on a host. It typically maps to a Unix process. For example: "NameNode-h1" and "DataNode-h1".

scheduler

Component of a computing framework such as YARN, MapReduce, or Spark, that is responsible for determining which jobs run, where and when they run, and resources allocated to the
jobs.

schema

Defines the field names and data types for a dataset. Kite relies on an Apache Avro schema definition for all datasets, standardizes data definition by using Avro schemas for both
Parquet and Avro, and supports the standard Avro object models generic and specific.

scheme

In a dataset, defines its storage type and location. You can create datasets in Hive, HDFS, HBase, or as local files. You define dataset schemes using scheme-specific URI patterns.

serialization

The process of converting a data structure or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection).
Deserialization is the process of converting a data structure or object state back to the original state later in the same or another computer environment. See Avro and Thrift.

service

A Linux command that runs a System V init script in /etc/init.d/ in as predictable an environment as possible, removing most environment variables and
setting the current working directory to /.

A category of managed functionality in Cloudera Manager, which may be distributed or not, running in a cluster. Sometimes referred to as a service type. For example: MapReduce, HDFS,
YARN, Spark, and Accumulo. In traditional environments, multiple services run on one host; in distributed systems, a service runs on many hosts.

service instance

In Cloudera Manager, an instance of a service running on a cluster. For example: "HDFS-1" and "yarn". A service instance spans many role instances.

sharding

In Cloudera Search, splitting a single logical index up into some number of sub-indexes, each of which can be hosted on a separate machine. Solr (and especially SolrCloud) handles
querying each shard and assembling the response into a single, coherent list.

SLES

SUSE Linux Enterprise Server.

Snappy

A compression library. Snappy aims for very high speeds and reasonable compression instead of maximum compression or compatibility with other compression libraries. Snappy is provided in
the Hadoop package along with the other native libraries (such as native gzip compression).

snapshots

Point-in-time backups of HDFS directories or files, or HBase tables.

SolrCloud

ZooKeeper-enabled, fault-tolerant, distributed Solr.

SolrJ

A Java API for interacting with a Solr instance.

Spark

Apache Spark is a general framework for
distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and consists of Spark core and several related
projects.

Spark SQL - Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark
programs.

Cloudera supports Spark core, Spark SQL (including DataFrames), Spark Streaming, and MLlib. Cloudera does not currently offer commercial support for GraphX or
SparkR.

SQL

A declarative programming language designed for managing data in relational database management systems. It includes features for creating schema objects such as databases and tables,
and for querying and modifying data. CDH includes SQL support through Impala for high-performance interactive queries, and Hive for long-running batch-oriented jobs.

Sqoop

A tool for efficiently transferring bulk data between Hadoop and external structured datastores, such as relational databases. Apache Sqoop imports the contents of tables into HDFS, Hive, and HBase and generates Java classes that enable users to interpret the table's schema. Sqoop can also extract data from Hadoop storage and export
records from HDFS to external structured datastores such as relational databases and enterprise data warehouses.

There are two versions: Sqoop and Sqoop 2. Sqoop requires client-side installation and configuration. Sqoop 2 is a web-based service with a client command-line interface. In Sqoop 2,
connectors and database drivers are configured on the server.

stage

In Spark, a collection of tasks that all execute the same code, each on a different partition. Each stage contains a sequence of transformations that can be completed without shuffling
the data.

static service pool

In Cloudera Manager, a static partitioning of total cluster resources—CPU, memory, and I/O weight—across a set of services.

suppression

In Cloudera Manager, the ability to suppress the display of health test results, configuration warnings, and parameter validation warnings.

task

TaskTracker

technical metadata

In Cloudera Navigator, metadata defined when entities are extracted from a CDH deployment. You cannot modify technical metadata.

terabyte

1012 bytes. 1,000 gigabytes.

Thrift

An interface definition language, runtime library, and code-generation engine to build services that can be invoked from many languages. Apache Thrift can be used for serialization and
RPC, but within Hadoop is mainly used for RPC.

transformation

In Spark, a function that creates a new RDD from an existing RDD. Spark uses "lazy evaluation": transformations do not execute on the cluster until an action is invoked. Examples of
actions are collect, which pulls data to the client, and saveAsTextFile, which writes data to a filesystem like HDFS.

TrusteeKeyProvider

KeyTrustee-specific implementation of the Hadoop KeyProvider API, allowing the Hadoop KMS to use the Navigator KeyTrustee server as a key store and enabling key generation on behalf of
clients.

UEK

Unbreakable Enterprise Kernel.

user role

Determines the Cloudera Manager or Cloudera Navigator features visible to the user and the actions the user can perform.

virtual core (vcore)

A CPU with a logical separation between areas of a processor. Virtual cores divide the processing resources of a physical core and work independent of one another.

Whirr

A set of libraries that can be used to run CDH clusters on cloud services such as Amazon Elastic Compute Cloud (Amazon
EC2).

Apache Whirr was deprecated in CDH 5.5 and has been removed from CDH 6, starting with CDH 6.0 Beta. It is no longer supported.

YARN (Yet Another Resource Negotiator)

A general architecture for running distributed applications. YARN specifies the following components:

ResourceManager - A master daemon that authorizes submitted jobs to run, assigns an ApplicationMaster to them, and enforces resource limits.

ApplicationMaster - A supervisory task that requests the resources needed for executor tasks. An ApplicationMaster runs on a different NodeManager for each application. The
ApplicationMaster requests containers, which are sized by the resources a task requires to run.

NodeManager - A worker daemon that launches and monitors the ApplicationMaster and task containers.

JobHistory Server - Keeps track of completed applications.

The ApplicationMaster negotiates with the ResourceManager for cluster resources—described in terms of a number of containers, each with a certain memory limit—and then runs
application-specific processes in those containers. The containers are overseen by NodeManagers running on cluster nodes, which ensure that the application does not use more resources than it has
been allocated.

MapReduce v2 (MRv2) is implemented as a YARN application.

ZooKeeper

A centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services. In a CDH cluster, ZooKeeper coordinates the
activities of high-availability services, including HDFS, Oozie, Hive, Solr, YARN, HBase, and Hue.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.