Why MongoDB for Deep Learning?

If you haven't read part 3, it's worth visiting that post to learn more about the key considerations when selecting a database for new deep learning projects. As the following section demonstrates, developers and data scientists can harness MongoDB as a flexible, scalable, and performant distributed database to meet the rigors of AI application development.

Flexible Data Model

MongoDB's document data model makes it easy for developers and data scientists to store and combine data of any structure within the database, without giving up sophisticated validation rules to govern data quality. The schema can be dynamically modified without application or database downtime that results from costly schema modifications or redesign incurred by relational database systems.

This data model flexibility is especially valuable to deep learning, which involves constant experimentation to uncover new insights and predictions:

Input datasets can comprise rapidly changing structured and unstructured data ingested from clickstreams, log files, social media and IoT sensor streams, CSV, text, images, video, and more. Many of these datasets do not map well into the rigid row and column formats of relational databases.

A database supporting a wide variety of input datasets, with the ability to seamlessly modify parameters for model training, is therefore essential.

Rich Programming and Query Model

MongoDB offers both native drivers and certified connectors for developers and data scientists building deep learning models with data from MongoDB. The PyMongo driver is the recommended way to work with MongoDB from Python, implementing an idiomatic API that makes development natural for Python programmers. The community developed MongoDB Client for R is also available for R programmers.

The MongoDB query language and rich secondary indexes enable developers to build applications that can query and analyze the data in multiple ways. Data can be accessed by single keys, ranges, text search, graph, and geospatial queries through to complex aggregations and MapReduce jobs, returning responses in milliseconds.

To parallelize data processing across a distributed database cluster, MongoDB provides the aggregation pipeline and MapReduce. The MongoDB aggregation pipeline is modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated result using native operations executed within MongoDB. The most basic pipeline stages provide filters that operate like queries, and document transformations that modify the form of the output document. Other pipeline operations provide tools for grouping and sorting documents by specific fields as well as tools for aggregating the contents of arrays, including arrays of documents. In addition, pipeline stages can use operators for tasks such as calculating the average or standard deviations across collections of documents, and manipulating strings. MongoDB also provides native MapReduce operations within the database, using custom JavaScript functions to perform the map and reduce stages.

In addition to its native query framework, MongoDB also offers a high performance connector for Apache Spark. The connector exposes all of Spark's libraries, including Python, R, Scala, and Java. MongoDB data is materialized as DataFrames and Datasets for analysis with machine learning, graph, streaming, and SQL APIs.

The MongoDB Connector for Apache Spark can take advantage of MongoDB's aggregation pipeline and secondary indexes to extract, filter, and process only the range of data it needs - for example, analyzing all customers located in a specific geography. This is very different from simple NoSQL datastores that do not support either secondary indexes or in-database aggregations. In these cases, Spark would need to extract all data based on a simple primary key, even if only a subset of that data is required for the Spark process. This means more processing overhead, more hardware, and longer time-to-insight for data scientists and engineers. To maximize performance across large, distributed data sets, the MongoDB Connector for Apache Spark can co-locate Resilient Distributed Datasets (RDDs) with the source MongoDB node, thereby minimizing data movement across the cluster and reducing latency.

Performance, Scalability & Redundancy

Model training time can be reduced by building the deep learning platform on top of a performant and scalable database layer. MongoDB offers a number of innovations to maximize throughput and minimize latency of deep learning workloads:

WiredTiger is the default storage engine for MongoDB, developed by the architects of Berkeley DB, the most widely deployed embedded data management software in the world. WiredTiger scales on modern, multi-core architectures. Using a variety of programming techniques such as hazard pointers, lock-free algorithms, fast latching and message passing, WiredTiger maximizes computational work per CPU core and clock cycle. To minimize on-disk overhead and I/O, WiredTiger uses compact file formats and storage compression.

For the most latency-sensitive deep learning applications, MongoDB can be configured with the In-Memory storage engine. Based on WiredTiger, this storage engine gives users the benefits of in-memory computing, without trading away the rich query flexibility, real-time analytics, and scalable capacity offered by conventional disk-based databases.

To parallelize model training and scale input datasets beyond a single node, MongoDB uses a technique called sharding, which distributes processing and data across clusters of commodity hardware. MongoDB sharding is fully elastic, automatically rebalancing data across the cluster as the input dataset grows, or as nodes are added and removed.

Within a MongoDB cluster, data from each shard is automatically distributed to multiple replicas hosted on separate nodes. MongoDB replica sets provide redundancy to recover training data in the event of a failure, reducing the overhead of checkpointing.

Tunable Consistency

MongoDB is strongly consistent by default, enabling deep learning applications to immediately read what has been written to the database, thus avoiding the developer complexity imposed by eventually consistent systems. Strong consistency will provide the most accurate results for machine learning algorithms; however in some scenarios, such as SGD, it is acceptable to trade consistency against specific performance goals by distributing queries across a cluster of MongoDB secondary replica set members.

MongoDB AI Deployments

Due to the properties discussed above, MongoDB is serving as the database for many AI and deep learning platforms. A selection of users across different applications and industries follows:

IBM Watson: Analytics & Visualization

Watson Analytics is IBM's cloud-hosted service providing smart data discovery to guide data exploration, automate predictive analytics and visualize outputs. Watson Analytics is used across banking, insurance, retail, telecommunications, petroleum, and government applications. MongoDB is used alongside DB2 for managing data storage. MongoDB provides a metadata repository of all source data assets and analytics visualizations, stored in rich JSON document structures, with the scalability to support tens of thousands of concurrent users accessing the service.

x.ai: Personal Assistant

x.ai is an AI-powered personal assistant that schedules meetings for its user. Users connect their calendars to x.ai, and then when it's time to set a meeting via email, users instead delegate the scheduling task to 'Amy Ingram' by ccing amy@x.ai. Once she's copied into the email thread, she finds a mutually agreeable time and place and sets up the meeting for you. MongoDB serves as the system of record for the entire x.ai platform, supporting all services including natural language processing, supervised learning, analytics and email communication. MongoDB's flexible data model has been critical in enabling x.ai to rapidly adapt its training and input data sets, while supporting complex data structures. Learn more by reading the case study.

Auto Trader: Predicting Value

The UK's largest digital car marketplace makes extensive use of machine learning running against data stored in MongoDB. The car's specifications and details, such as number of previous owners, condition, color, mileage, insurance history, upgrades, and more are stored in MongoDB. This data is extracted by machine learning algorithms written by Auto Trader's data science team to generate accurate predictions of value, which are then written back to the database. MongoDB was selected due to its flexible data model and distributed design, allowing scalability across a cluster of more than 40 instances. Learn more from coverage in the technology press.

Mintigo: Predictive Sales & Marketing

Founded by former intelligence agency data scientists, Mintigo delivers a predictive marketing engine for companies such as Red Hat. Through sophisticated machine learning algorithms operating against large data sets stored in MongoDB, Mintigo helps marketing and sales organizations better identify leads most likely to convert to customers. Through its engine, Mintigo users average a 4x improvement in overall marketing funnel efficiency. Mintigo runs on AWS, with machine learning algorithms written in Python. MongoDB is used to store multi-TB data sets, and was selected for scalability of streaming data ingest and storage, and schema flexibility. MongoDB's expressive query framework and secondary indexes feeds the algorithms with relevant data, without needing to scan every record in the database. Learn more from the case study.

Geo-Location Analysis for Retail

A US-based mobile app developer has built its Intelligence Engine on MongoDB, processing and storing tens of millions of rich geospatial data points on customers and their locations in real time. The Intelligence Engine uses scalable machine learning and multi-dimensional analytic techniques to surface behavioral patterns that allows retailers to predict and target customers with location-based offers through their mobile devices. MongoDB's support for geospatial data structures with sophisticated indexing and querying provides the foundation for the machine learning algorithms. MongoDB's scale-out design with sharding allows the company to scale from 10s to 100s of millions of customer data points.

Natural Language Processing (NLP)

A North American AI developer has built NLP software that is embedded by major consumer electronics brands into smart home and mobile devices. All interactions between the device and user are stored in MongoDB, which are then fed back into the learning algorithms. MongoDB was selected for its schema flexibility that supports rapidly changing data structures.

Bringing Data Science to Talent Acquisition

Working with HR departments in the Fortune 500, this company tackles the resume pile and candidate sourcing problem with data science and workforce intelligence. The company provides real-time analysis and prioritization of applicants by applying AI to thousands of information sources beyond the resume, including public and enterprise data. With predictive analytics generated by its AI algorithms, recruiters can instantly identify the most qualified candidates among active and passive applicants, accelerating the hiring process and reducing costs per hire. MongoDB was selected as the underlying database due to its data model flexibility and scale, coupled with extensive security controls to protect Personally Identifiable Information (PII).

Wrapping Up Part 4

That wraps up our 4-part blog series. Over the course of the blog series, we've discussed how deep learning and AI have moved well beyond science fiction into the cutting edge of internet and enterprise computing. Access to more computational power in the cloud, advancement of sophisticated algorithms, and the availability of funding are unlocking new possibilities unimaginable just five years ago. But it's the availability of new, rich data sources that is making deep learning real.

To advance the state of the art, developers and data scientists need to carefully select the underlying databases that manage the input, training, and results data. MongoDB is already helping teams realize the potential of AI.