Friday, July 27, 2018

On January 31, 2017, I was invited by Prof. Wei Ding at the Department of Computer Science, University of Massachusetts Boston, and gave 3 talks about my data science projects across different industries. These talks are extremely well received. The following is what Prof. Ding says about my talks.

"It was a fortune to have Jay come to our computer science department to share his experience in solving business problems with predictive analytics on February 28, 2017. What Jay had presented in his 3 talks, each lasting for 1 hour in different topics of data mining, was totally impressive and beyond our wildest expectation. Having built competition-winning predictive models for some of the biggest companies and produced hundreds of millions of dollars’ savings, Jay shared the secret of his success with students and faculty without reservation. His strong presentations were such an inspiration for our computer science students and faculty and his methodology was innovative and powerful, even for very seasoned data scientists among the audience. Jay, thank you so much for your hard work preparing and delivering these presentations!" -Prof. Ding Wei, Department of Computer Science, University of Massachusetts Boston

The audience are particularly amazed by how I come up with solutions using Oracle SQL environment. To share my expertise, I create the online course Oracle SQL for Data Science to show how to perform common data science tasks using Oracle SQL and the benefits for doing that. The tuition for the course is $14.95.

Thursday, June 28, 2018

Deep-Data-Mining is pleased to introduce the new book, Principles of Database
Management – The Practical Guide to Storing, Managing and Analyzing Big and
Small Data, by Lemahieu W., vanden Broucke S., and Baesens B. (ISBN: 9781107186125). The following is the book interview.

Why did you decide to write your book
Principles of Database Management?

This book is the result of having taught an
undergraduate database management class and a postgraduate advanced database
management class for more than ten years at KU Leuven (Belgium).Throughout these years, we have found no good
textbook which covers the material in a comprehensive way without getting
flooded by theoretical detail and losing focus.Hence, after having teamed up together, we decided
to start writing a book ourselves.This work aims to offer a complete and practical guide covering all the
governing principles of database management, including:

A
unique perspective on how lessons learnt from past data management could be
relevant in today’s technology setting (e.g., navigational access and its
perils in Codasyl and XML/OO databases)

A
critical reflection and accompanying risk management considerations when
implementing the technologies considered, based on our own experiences from
participating in data and analytics related projects with industry partners in
a variety of sectors, from banking to retail and from government to the
cultural sector

Offering
a solid balance between theory and practice, including various exercises,
industry examples and case studies originating from a diversified and
complimentary business practice, scientific research and academic teaching
experience

We hear a lot of companies
complaining about bad data quality these days.How can database management contribute to this?

First of all, data quality (DQ) is often
defined as ‘fitness for use,’ which implies the relative nature of the
concept.Data of acceptable quality in
one decision context may be perceived to be of poor quality in another decision
context, even by the same business user.For instance, the extent to which data is required to be complete for
accounting tasks may not be required for analytical sales prediction tasks.Database management can contribute in various
ways to improving data quality.

A good conceptual data model capturing the
business requirements as accurately as possible is the start of
everything.As discussed in the book,
both EER and UML can be used for this purpose.It is important to also list the semantic shortcomings of the conceptual
models developed such that they can be followed up during application
development.

A next important activity concerns metadata
management.Just as raw data, also
metadata is data that needs to be properly modeled, stored and managed.Hence, the concepts of data modeling should
also be applied to metadata in a transparent way.In a DBMS approach, metadata is stored in a
catalog, sometimes also called data dictionary or data repository, which
constitutes the heart of the database system.

Finally, to manage and safeguard data quality,
a data governance culture should be put in place assigning clear roles and
responsibilities. The ultimate aim of data governance is to set up a
company-wide controlled and supported approach towards data quality,
accompanied by data quality management processes. The core idea is to manage
data as an asset rather than a liability, and adopt a proactive attitude
towards data quality problems. To succeed, it should be a key element of a
company’s corporate governance and supported by senior management.

Should all companies start to invest in NoSQL
databases?

The explosion of popularity of NoSQL databases should be put in
perspective considering their limitations.Most NoSQL implementations have yet to prove their true worth in the
field (most are very young and in development).Most implementations sacrifice ACID (atomicity, consistency, isolation
and durability) concerns in favor of being
eventually consistent, and the lack of relational support makes expressing some queries or aggregations particularly difficult,
with map-reduce interfaces being offered as a
possible, but harder to learn and use, alternative.Combined with the
fact that RDBMSs do provide strong support for transactionality,
durability and manageability, quite a few early adopters of NoSQL were
confronted with some sour lessons.It
would be an over-simplification to reduce the choice between RDBMSs and NoSQL
databases to a choice between consistency and integrity on the one hand, and
scalability and flexibility on the other.The market of NoSQL systems is far too diverse for that.Still, this tradeoff will often come into
play when deciding on taking the NoSQL route.We see many NoSQL vendors focusing again on robustness and
durability.We also observe traditional
RDBMS vendors implementing features that let you build schema-free, scalable
data stores inside a traditional RDBMS, capable to store nested,
semi-structured documents, as this seems to remain the true selling point of
most NoSQL databases, especially those in the document store category.Expect the future trend to continue towards
adoption of such “blended systems”, except for use cases that require
specialized, niche database management systems.In these settings, the NoSQL movement has rightly taught users that the
one size fits all mentality of relational systems is no longer applicable and
should be replaced by finding the right tool for the job.For instance, graph databases arise as being
“hyper-relational” databases, which makes relations first class citizens next
to records themselves rather than doing away with them altogether.

Many companies are investing in data lakes
these days.What is the difference with
a data warehouse?

Much more recent than
data warehouses, the data lake concept became known as part of the big data and
analytics trend.Although both data
warehouses and data lakes are essentially data repositories, there are some
clear differences as listed in the table below.

Data
Warehouse

Data
lake

Data

Structured

Often unstructured

Processing

Schema-on-write

Schema-on-read

Storage

Expensive

Low cost

Transformation

Before entering the DW

Before analysis

Agility

Low

High

Security

Mature

Maturing

Users

Decision makers

Data Scientists

A key distinguishing property of a data lake is that it stores raw data
in its native format, which could be structured, unstructured or
semi-structured.This makes data lakes
fit for more exotic and ‘bulk’ data types that we generally do not find in data
warehouses, such as social media feeds, clickstreams, server logs, sensor data,
etc. A data lake collects data emanating
from operational sources ‘as is’, often without knowing upfront which analyses
will be performed on it, or even whether the data will ever be involved in
analysis at all.For this reason, either
no or only very limited transformations (formatting, cleansing, …) are
performed on the data before it enters the data lake.Consequently, when the data is tapped from
the data lake to be analyzed, quite a bit of processing will typically be
required before it is fit for analysis.The data schema definitions are only determined when the data is read
(schema-on-read) instead of when the data is loaded (schema-on-write) as is the
case for a data warehouse.Storage costs
for data lakes are also relatively low because most of the implementations are
open-source solutions that can be easily installed on low-cost commodity
hardware.Since a data warehouse assumes
a predefined structure, it is less agile compared to a data lake which has no
structure.Also, data warehouses have
been around for quite some time already, which automatically implies that their
security facilities are more mature.Finally, in terms of users, a data warehouse is targeted towards
decision makers at middle and top management level, whereas a data lake requires
a data scientist, which is a more specialized profile in terms of data handling
and analysis.

How do Object Oriented DBMSs (OODBMSs) compare
against Relational Databases (RDBMSs)? Why haven’t they been that successful?

OODBMSs offer several advantages.First, they store complex objects and
relationships in a transparent way.The
identity based approach allows for improved performance when performing complex
queries involving multiple interrelated objects, avoiding expensive joins.By using the same data model as the
programming language to develop database applications, the impedance mismatch
problem is no longer an issue.In
addition, developers can be more productive, as they are confronted with only a
single language and data model.

Still, the success of OODBMSs has been limited
to niche applications characterized by complex, nested data structures where an
identity-based, instead of a value-based, method of working pays off.An example is the processing of scientific data sets by CERN in
Switzerland, where data access
follows predictable
patterns.The widespread use and performance of RDBMSs, however, proved hard to
displace: the (ad-hoc) query formulation and optimization procedures of OODBMSs
are often inferior to relational databases, which all adopt SQL as their
primary database language combined with a powerful query optimizer.When compared to RDBMSs, OODBMSs are not that
developed in terms of robustness, security,
scalability and fault-tolerance.They
also don’t provide a transparent implementation of
the three-layer database architecture.More specifically, most OODBMSs provide
no support for defining external database models, such as views in the
relational model.

Why are so many companies still struggling with
database legacy instead of investing in newer technologies?

Many firms still struggle with legacy databases
due to historical implementations and limited IT budgets.Hence, knowing the basic characteristics
thereof is essential to the maintenance of the corresponding database
applications and the potential migration to modern DBMSs.Second, understanding the basics of these
legacy models will contribute to better understanding the semantical richness
of newer database technologies.Finally,
the concept of procedural DML and navigational access originally introduced by
these legacy models has also been adopted by more recent databases such as
OODBMSs.

This book covers a few decades of evolutions in
the database world. Which developments or patterns struck you as the most
remarkable in the long run ?

The perseverance of SQL is really striking. It survived object-oriented databases and XML
databases, and after a while, many of its constructs were even incorporated in
these paradigms. More recently, there is
considerable irony (and beauty) in the fact that one of the main tendencies of
NoSQL development efforts is to mimic the convenient data retrieval and query
facilities of, well, SQL-based databases. Solutions for ‘SQL on top of NoSQL’ are all
over the place. And let’s not forget: SQL still remains a beautiful
language.

Is there any additional material provided
(e.g., for instructors or students)?

Wednesday, January 24, 2018

I published an article today on my Linkedin profile Using Artificial Intelligence to Calculate Characteristic Town Index
In this article, I described a milestone project that we recently accomplished and how it made me instantly famous in the field.The follow is the content of the Linkedin article.
Characteristic town development is a hot topic in China. I am extremely excited to announce that using artificial intelligence we have successfully developed Characteristic Town Index for 3,700 towns in China. Based on the index, we are able to identify the most promising towns objectively and efficiently for development. This is the first time that artificial intelligence is used in characteristic town evaluation. It is a cross-team effort under my leadership as the CTO of Hangzhou Jingli Company.
I published an article, Use Big Data and Artificial Intelligence to Rationally Evaluate the Characteristic Town, on influential "Economic Information Daily" (in Chinese) on January 10, 2018 (Please use Google Translate if you are interested). It quickly becomes one of the most cited articles on the Internet in China in this field. On one website alone, the article has been read nearly 52 thousand times. Websites hosting the article consistently ranked number one by Google and Baidu on keywords "characteristic town assessment artificial intelligence" (in Chinese). After reading the article, a government official of a major city in China says, "Dr. Zhou's article should be read carefully and thoroughly. The points raised are thought-provoking". The article's main point is that data-driven artificial intelligence models have advantages over experience-based expert systems. The following are some highlights:

Expert systems are subjective, rigid and static. It is hard to adjust expert systems based on the discrepancies between their outputs and the realities. The feedback mechanism is lacking.

Data, machine learning models and applications should form closed loops. The data are used to train models. Models' outputs provide decision support for applications. The data are updated and enhanced when models are applied in real world, e.g., finding better target variables. These closed loops allow models to be improved iteratively.