a

Gradient descent is the most commonly used optimization method deployed in machine learning and deep learning algorithms. It’s used to train a machine learning model
Types of Gradient Descent
There are three primary types of gradient descent used in modern machine learning and deep(...)

Alternative data is information gathered by using alternative sources of data that others are not using; non-traditional information sources. Analysis of alternative data can provide insights beyond that which an industry’s regular data sources are capable of providing.
However, what(...)

Apache Kudu is a free and open source columnar storage system developed for the Apache Hadoop. It is an engine intended for structured data that supports low-latency random access millisecond-scale access to individual rows together with great analytical access patterns. It is a Big Data(...)

Apache Spark is an open source cluster computing framework for fast real-time large-scale data processing. Since its inception in 2009 at UC Berkeley’s AMPLab, Spark has seen major growth. It is currently rated as the largest open source communities in big data and it features over 200(...)

An artificial neuron network (ANN) is a computing system patterned after the operation of neurons in the human brain.
How artificial neural networks work
Artificial Neural Networks can be best viewed as weighted directed graphs, that are commonly organized in layers. These layers feature(...)

Automation bias is an over-reliance on automated aids and decision support systems. As the availability of automated decision aids is increasing additions to critical decision-making contexts such as intensive care units, or aircraft cockpits are becoming more common
It is a human tendency(...)

b

Bayesian Neural Networks (BNNs) refers to extending standard networks with posterior inference in order to control over-fitting.
From a broader perspective, the Bayesian approach uses the statistical methodology so that everything has a probability distribution attached to it, including(...)

Before Hadoop, both storage and compute technology was limited; as a result, the analytics process was long and rigid.
In order to get every new data source ready to be stored it had to go through a lengthy process, usually known as ETL. Once the data was ready, it had to be stored in a(...)

Bioinformatics is a field of study that uses computation to extract knowledge from large collections of biological data.
Bioinformatics refers to the use of IT in biotechnology for storing, retrieving, organizing and analyzing biological data.
An outstanding amount of data is being(...)

c

At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer.
Catalyst is based on functional programming constructs in Scala and designed(...)

Complex event processing (CEP) also known as event, stream or event stream processing is the use of technology for querying data before storing it within a database or, in some cases, without it ever being stored.
Complex event processing is an organizational tool that helps to aggregate a(...)

Continuous applications are an end-to-end application that reacts to data in real-time. In particular, developers would like to use a single programming interface to support the facets of continuous applications that are currently handled in separate systems, such as query serving or(...)

In deep learning, a convolutional neural network (CNN or ConvNet) is a class of deep neural networks, that are typically used to recognize patterns present in images but they are also used for spatial data analysis, computer vision, natural language processing, signal processing, and various(...)

d

A data analytics platformm is an ecosystem of services and technologies that needs to perform analysis on voluminous, complex and dynamic data that allows you to retrieve, combine, interact with, explore, and visualize data from the various sources a company might have.
A comprehensive data(...)

A data lake is a central location, that holds a large amount of data in its native, raw format, as well as a way to organize large volumes of highly diverse data.
Compared to a hierarchical data warehouse which stores data in files or folders, a data lake uses a different approach; it uses(...)

A data warehouse is a system that pulls together data derived from operational systems and external data sources within an organization for reporting and analysis.
A data warehouse is a central repository of information that provides users with current and historical decision support(...)

Databricks Runtime is the set of software artifacts that run on the clusters of machines managed by Databricks. It includes Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics. The primary(...)

A DataFrame is the most common Structured API and simply represents a table of data with rows and columns. The list of columns and the types in those columns the schema. A simple analogy would be a spreadsheet with named columns. The fundamental difference is that while a spreadsheet sits on(...)

Datasets are a type-safe version of Spark’s structured API for Java and Scala. This API is not available in Python and R, because those are dynamically typed languages, but it is a powerful tool for writing large applications in Scala and Java.
Recall that DataFrames are a distributed(...)

Deep Learning is a subset of machine learning concerned large amounts of data. with algorithms that have been inspired by the structure and function of the human brain, which is why deep learning models are often referred to as deep neural networks. It is is a part of a broader family of(...)

Dense tensors store values in a contiguous sequential block of memory where all values are represented.
Tensors or multi-dimensional arrays are used in a diverse set of multi-dimensional data analysis applications.
There are a number of software products that can perform tensor(...)

The DNA sequence is the process of determining the exact sequence of nucleotides of DNA (deoxyribonucleic acid). Sequencing DNA the order of the four chemical building blocks - adenine, guanine, cytosine, and thymine also known as bases, occur within the DNA molecule.
The first methods for(...)

e

Elasticsearch is a NoSQL, distributed database that stores, retrieves, and manages document-oriented and semi-structured data. Furthermore, it is an open source, RESTful search engine built on top of Apache Lucene and released under the terms of the Apache License. It is Java-based, thus(...)

An ETL Pipeline refers to a set of processes extracting data from an input source, transforming the data, and loading into an output destination such as a database, data mart, or a data warehouse for reporting, analysis, and data synchronization.
The letters stand for Extract,(...)

ETL stands for Extract-Transform-Load and it refers to the process used to collect data from numerous disparate databases, applications and systems, transforming the data so that it matches the target system’s required formatting and loading it into a destination database.
How ETL(...)

g

Genomics is an area within genetics that concerns the sequencing and analysis of an organism’s genome. Its main task is to determine the entire sequence of DNA or the composition of the atoms that make up the DNA and the chemical bonds between the DNA atoms.
The field of genomics is(...)

h

Apache Hadoop is an open-source, Java-based, software platform that manages data processing and storage for big data applications. The data is stored on commodity servers that run as clusters. It can provide a quick and reliable analysis of both structured data and unstructured data. It can(...)

A Hadoop cluster is a combination of many computers designed to work together as one system, in order to store and analyze big data (structured, semi-structured and unstructured) in a distributed computing environment. These computer clusters run Hadoop’s open source distributed processing(...)

Apache Hadoop ecosystem refers to the various components of the Apache Hadoop software library; it includes open source projects as well as a complete range of complementary tools. Some of the most well-known tools of Hadoop ecosystem include HDFS, Hive, Pig, YARN, MapReduce, Spark, HBase(...)

In computing, a hash table (hash map) is a data structure that provides virtually direct access to objects based on a key (a unique String or Integer). A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.
Here(...)

Hive provides many built-in functions to help us in the processing and querying of data. Some of the functionalities provided by these functions include string manipulation, date manipulation, type conversion, conditional operators, mathematical functions, and several others.
Types of(...)

Apache Spark is a fast and general cluster computing system for Big Data built around speed, ease of use, and advanced analytics that was originally built in 2009 at UC Berkeley. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general(...)

k

Keras is a high-level library for deep learning, built on top of Theano and Tensorflow. It is written in Python and provides a clean and convenient way to create a range of deep learning models. Keras has become one of the most used high-level neural networks APIs when it comes to developing(...)

l

Lambda architecture is a way of processing massive quantities of data (i.e. “Big Data”) that provides access to batch-processing and stream-processing methods with a hybrid approach.
Lambda architecture is used to solve the problem of computing arbitrary functions. The lambda architecture(...)

m

Apache Spark’s Machine Learning Library (MLlib) is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, and speed of Spark, data scientists can focus on their data problems and models instead of solving the complexities(...)

A managed Spark service lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. By using such an automation you will be able to quickly create clusters on -demand, manage them with ease and turn them off when the task is(...)

n

A neural network is a computing model whose layered structure resembles the networked structure of neurons in the brain. It features interconnected processing elements called neurons that work together to produce an output function.
Neural networks are made of input and output(...)

p

Pandas is an open source, BSD-licensed library written for the Python programming language that provides fast and adaptable data structures, and data analysis tools. This easy to use data manipulation tool was originally written by Wes McKinney. It is built on the Numpy package and its key(...)

Parquet is an open source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files.
Parquet uses the record shredding and assembly(...)

Predictive analytics is a form of advanced analytics that uses both new and historical data to determine patterns and predict future outcomes and trends
How does predictive analytics work?
Predictive analytics uses many techniques such as statistical analysis techniques, analytical(...)

PyCharm is an integrated development environment (IDE) used in computer programming, created for the Python programming language. When using PyCharm on Databricks, by default PyCharm creates a Python Virtual Environment, but you can configure to create a Conda environment or use an existing(...)

Apache Spark is written in Scala programming language. PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and(...)

r

RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.
5(...)

s

If you are working with Spark, you will come across the three APIs: DataFrames, Datasets, and RDDs
RDD
RDD or Resilient Distributed Datasets, is a collection of records with distributed computing, which are fault tolerant, immutable in nature. They can be operated on in parallel with(...)

Spark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; responding to a user’s program or input; and(...)

Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It(...)

Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources(...)

What is Spark Performance Tuning?
Spark Performance Tuning refers to the process of adjusting settings to record for memory, cores, and instances used by the system. This process guarantees that the Spark has a flawless performance and also prevents bottlenecking of resources in(...)

Sparklyr is an open-source package that provides an interface between R and Apache Spark. You can now leverage Spark’s capabilities in a modern R environment, due to Spark’s ability to interact with distributed data with little latency. Sparklyr is an effective tool for interfacing with large(...)

SparkR is a tool for running R on Spark. It follows the same principles as all of Spark’s other language bindings. To use SparkR, we simply import it into our environment and run our code. It’s all very similar to the Python API except that it follows R’s syntax instead of Python. For the most(...)

Python offers an inbuilt library called numpy to manipulate multi-dimensional arrays. The organization and use of this library is a primary requirement for developing the pytensor library.
Sptensor is a class that represents the sparse tensor. A sparse tensor is a dataset in which most of(...)

How does Stream Analytics work?
Streaming analytics, also known as event stream processing, is the analysis of huge pools of current and “in-motion” data through the use of continuous queries, called event streams.
These streams are triggered by a specific event that happens as a direct(...)

Structured Streaming is a high-level API for stream processing that became production-ready in Spark 2.2. Structured Streaming allows you to take the same operations that you perform in batch mode using Spark’s structured APIs, and run them in a streaming fashion. This can reduce latency and(...)

t

In November of 2015, Google released it's open-source framework for machine learning and named it TensorFlow. It supports deep-learning, neural networks, and general numerical computations on CPUs, GPUs, and clusters of GPUs. One of the biggest advantages of TensorFlow is its open-source(...)

Estimators represent a complete model but also look intuitive enough to less user. The Estimator API provides methods to train the model, to judge the model’s accuracy, and to generate predictions.
TensorFlow provides a programming stack consisting of multiple API layers like in the below(...)

In Spark, the core data structures are immutable meaning they cannot be changed once created. This might seem like a strange concept at first, if you cannot change it, how are you supposed to use it? In order to “change” a DataFrame you will have to instruct Spark how you would like to modify(...)

Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware. This effort includes the following(...)

u

Unified Artificial Intelligence or UAI was announced by Facebook during F8 this year. This brings together 2 specific deep learning frameworks that Facebook created and outsourced - PyTorch focused on research assuming access to large-scale compute resources while Caffe focused on model(...)

Unified Analytics is a new category of solutions that unify data processing with AI technologies, making AI much more achievable for enterprise organizations and enabling them to accelerate their AI initiatives. Unified Analytics makes it easier for enterprises to build data pipelines across(...)

A unified database also known as an enterprise data warehouse holds all the business information of an organization and makes it accessible all across the company.
Most companies today, have their data managed in isolated silos while different teams of the same organization use various data(...)

w

Apache Spark is an open source analytics engine used for big data workloads. It can handle both batches as well as real-time analytics and data processing workloads. Apache Spark started in 2009 as a research project at the University of California, Berkeley. Researchers were looking for a way(...)