Google created the framework for MapReduce – MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Mahout ML is mostly restricted to set theory. Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.

2012 – Apache SPARK

Apache SPARK Introduced to deal with Very Large Data and IN-Memorry Processing. It is an architecture for cluster computing – that increases the computing compared with slow MapReduce by 100 times and also better solves parallelization of the algorithm. Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab

GraphX is a distributed graph processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph databasE. GraphX can be viewed as being the Spark in-memory version of Apache Giraph, which utilized Hadoop disk-based MapReduce.

Machine learning – the ability of computers to understand data, manage results and infer insights from uncertain information – is the force behind many recent revolutions in computing. Email spam filters, smartphone personal assistants and self-driving vehicles are all based on research advances in machine learning. Unfortunately, even as the demand for these capabilities is accelerating, every new application requires a Herculean effort. Teams of hard-to-find experts must build expensive, custom tools that are often painfully slow and can perform unpredictably against large, complex data sets.

The Probabilistic Programming for Advancing Machine Learning (PPAML) program aims to address these challenges. Probabilistic programming is a new programming paradigm for managing uncertain information.

Ingine Responded to DARPA’s RFQ with a detailed architecture based on Barry’s innovation in the algorithm that basically solves the above ask to some extent. Importantly it solve Probabilistic Ontology for Knowledge Extraction from Uncertainty and Semantic Reasoning.

In this era of big data, the rates at which these data sets grow continue to accelerate. The ability to manage and analyze the largest data sets is always severely taxed. The most challenging of these data sets are those containing relational or network data. The HIVE challenge is envisioned to be an annual challenge that will advance the state of the art in graph analytics on extremely large data sets. The primary focus of the challenges will be on the expansion and acceleration of graph analytic algorithms through improvements to algorithms and their implementations, and especially importantly, through special purpose hardware such as distributed and grid computers, and GPUs. Potential approaches to accelerate graph analytic algorithms include such methods as massively parallel computation, improvements to memory utilization, more efficient communications, and optimized data processing units.

Despite emergence of Big Data, Machine Learning, Graphing Techniques and Semantic Web. The convergence is still far fleeting. Especially Semantic / Cognitive / Knowledge Extraction techniques are very poorly defined and there does not exists a framework approach to knowledge engineering leading into Machine Learning and automation in Knowledge Extraction, Representation, Learning and Reasoning. This is what Q-UEL and HDN solves at the algorithmic level.

Given the challenge of analyzing against the large data sets both structured (EHR data) and unstructured data; the emerging Healthcare analytics are around below discussed methods d (multivariate regression), e (neural-net) and f (multivariate probabilistic inference); Ingine is unique in the Hyperbolic Dirac Net proposition for probabilistic inference.

The basic premise in engineering The BioIngine.com™ is in acknowledging the fact that in solving knowledge extraction from the large data sets (both structured and unstructured), one is confronted by very large data sets riddled with high-dimensionality and uncertainty.

Generally in solving insights from the large data sets the order in complexity is scaled as follows.

a) Insights around :- “what”

For large data sets, descriptive statistics are adequate to extract a “what” perspective. Descriptive statistics generally delivers statistical summary of the ecosystem and the probabilistic distribution.

c) Bivariate Problem :- “what”

From above link :- In machine learning, correlation clustering or cluster editing operates in a scenario where the relationships between the objects are known instead of the actual representations of the objects. For example, given a weighted graph G = (V,E), where the edge weight indicates whether two nodes are similar (positive edge weight) or different (negative edge weight), the task is to find a clustering that either maximizes agreements (sum of positive edge weights within a cluster plus the absolute value of the sum of negative edge weights between clusters) or minimizes disagreements (absolute value of the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters). Unlike other clustering algorithms this does not require choosing the number of clusters k in advance because the objective, to minimize the sum of weights of the cut edges, is independent of the number of clusters.

From above link. :- Correlation is a bivariate analysis that measures the strengths of association between two variables. In statistics, the value of the correlation coefficient varies between +1 and -1. When the value of the correlation coefficient lies around ± 1, then it is said to be a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. Usually, in statistics, we measure three types of correlations: Pearson correlation, Kendall rank correlation and Spearman correlation

d) Multivariate Analysis (Complexity increases) :- “what”

§ Multiple regression (considering multiple univariate to analyze the effect of the independent variables on the outcomes)

The above discussed challenges of analyzing multivariate pushes us into techniques such as Neural Net; which is the next level to Multivariate Regression Statistical Approach…. where multiple regression models are feeding into the next level of clusters, again an array of multiple regression models.

The above Neural Net method still remains inadequate in depicting “how” probably the human mind is operates. In discerning the health ecosystem for diagnostic purposes, for which “how”, “why” and “when” interrogatives becomes imperative to arrive at accurate diagnosis and target outcomes effectively. Its learning is “smudged out”. A little more precisely put: it is hard to interrogate a Neural Net because it is far from easy to see what are the weights mixed up in different pooled contributions, or where they come from.

“So we enter Probabilistic Computations which is as such Combinatorial Explosion Problem”.

All the above are still discussing the “what” aspect. When the complexity increases the notion of independent and dependent variables become non-deterministic, since it is difficult to establish given the interactions, potentially including cyclic paths of influence in a network of interactions, amongst the variables. A very simple example in just a simple case is that obesity causes diabetes, but the also converse is true, and we may also suspect that obesity causes type 2 diabetes cause obesity. In such situation what is best as “subject” and what is best as “object” becomes difficult to establish. Existing inference network methods typically assume that the world can be represented by a Directional Acyclic Graph, more like a tree, but the real world is more complex than that that: metabolism, neural pathways, road maps, subway maps, concept maps, are not unidirectional, and they are more interactive, with cyclic routes. Furthermore, discovering the “how” aspect becomes important in the diagnosis of the episodes and to establish correct pathways, while also extracting the severe cases (chronic cases which is a multivariate problem). Indeterminism also creates an ontology that can be probabilistic, not crisp.

Note: From Healthcare Analytics perspective most Accountable Care Organization (ACO) analytics addresses the above based on the PQRS clinical factors, which are all quantitative. Barely useful for advancing the ACO into solving performance driven or value driven outcomes most of which are qualitative.

To conduct HDN Inference, bear in mind that getting all the combinations of factors by data mining is “ combinatorial explosion ” problem, which lies behind the difficulty of Big Data as high dimensional data.

It applies in any kind of data mining, though it is most clearly apparent when mining structured data, a kind of spreadsheet with many columns, each of which are our different dimensions. In considering combinations of demographic and clinical factors, say A, B, C, D, E.., we ideally have to count the number of combinations (A), (A,B) (A, C) …(B, C, E)…and so on. Though sometimes assumptions can be made, you cannot always deduce a combination with many factors from those with fewer, nor vice versa. In the case of the number N of factors A,B,C,D,E,… etc. the answer is that there are 2N-1 possible combinations. So data with 100 columns as factors would imply about

1,000,000,000,000,000,000,000,000,000,000

combinations, each of which we want to observe several times and so count them, to obtain probabilities. To find what we need without knowing what exactly it is in advance, distinguishes unsupervised data mining from statistics in which traditionally we test a hunch, a hypothesis. But worse still, in our spreadsheet the A, B, C, D, E are really to be seen as column headings with say about n possible different values in the columns below them, and so roughly we are speaking of potentially needing to count not just, say, males and females but each of nN different kinds of patient or thing. This results in truly astronomic number of different things, each to observe many time. If merely n=10, then nN is

There is a further implied difficulty, which in a strange way lifts much the above challenge from the shoulders of researchers and of their computers. In most cases of the above, must of the things we are counting contain many of the factors A,B,C,D, E..etc. Such concurrences of so many things is typically rare, so many of the things we would like to count will never be seen at all, and most of the rest will just be seen 1, 2, or 3 times. Indeed, any reasonably rich patient record with lots of data will probably be unique on this planet. However, most approaches are unable to make proper use of that sparse data, since it seems that it would need to be weighted and taken into account in the balance of evidence according to the information it contains, and it is not evident how. The zeta approach tells us how to do that. In short, the real curse of high dimensionality is in practice not that our computers lack sufficient memory to hold all the different probabilities, but that this is also true for the universe: even in principle we do not have all the data to work to determine probabilities by counting with even if we could count and use them. Note that probabilities of things that are never observed are, in the usual interpretation of zeta theory and of Q-UEL, assumed to have probability 1. In a purely multiplicative inference net, multiplying by probability 1 will have no effect. Information I = –log(P) for P = 1 means that information I = 0. Most statements of knowledge are, as philosopher Karl Popper argued, assertions awaiting refutation.

Nonetheless the general approach in the fields of semantics, knowledge representation, and reasoning from it is to gather all the knowledge that can be got into a kind of vast and ever growing encyclopedia.

In The BioIngine.com™ the native data sets have been transformed into Semantic Lake or Knowledge Representation Store (KRS) based on Q-UEL Notational Language such that they are now amenable to HDN based Inferences. Where possible, probabilities are assigned, if not, the default probabilities are again 1.

[healthcare cognitive computing platform]

Conquering Uncertainties Creating Infinite Possibilities

(Possible application :- Achieving Algorithm Driven ACO)

Introduction

The QEXL Approach is a Systems Thinking driven technique that has been designed with the intension of developing “Go To Market”solutions for Healthcare Big Data applications requiring integration between Payor, Provider, Health Management (Hospitals), Pharma etc; where the systemic complexities tethering on the “edge of chaos” pose enormous challenges in achieving interoperability owing to existence of plethora of healthcare system integration standards and management of the unstructured data in addition to structured data ingested from diverse sources. Additionally, The QEXL Approach targets for the creation of Tacit Knowledge Sets by inductive techniques and probabilistic inference from the diverse sets of data characterized by volume, velocity and variability. In fact, The QEXL Approach facilitates algorithmic driven Proactive Public Health Management, while rendering business models achieving Accountable Care Organization most effective.

The QEXL Approachis an integrative multivariate declarative cognitive architecture proposition to develop Probabilistic Ontology driven Big Data applications creating interoperability among Healthcare systems. Where, it is imperative to develop architecture that enable systemic capabilities such as Evidence Based Medicine, Pharmacognomics, biologics etc; while also creating opportunities for studies such as Complex Adaptive System (CAS). Such approach is vital to develop ecosystem as an response to mitigate the Healthcare systemic complexities. Especially CAS studies makes it possible to integrate both macro aspects (such as epidemiology) related to Efficient Heathcare Management Outcomes ; andmicro aspects (such as Evidence Based Medicine and Pharmacogenomics that helps achieve medicine personalization) achieving Efficacy in the Healthcare delivery, to help achieve systemic integrity.In The QEXL Approach QEXL stands for “Quantum Exchange Language”, and Q-UEL is the initial proposed language. The QEXL Consortium embraces Quantal Semantics, Inc; (NC) and Ingine, Inc; (VA), and collaborates with The Dirac Foundation (UK), which has access to Professor Paul Dirac’s unpublished papers. The original consortium grew as a convergence of responses to four stimuli:

The “re-emerging” interest in Artificial Intelligence (AI) as “computational thinking”, e.g. under the American Recovery Act;

The President’s Council of Advisors on Science and Technology December 2010 call for an “XML-like” “Universal Exchange Language” (UEL) for healthcare;

A desire to respond to the emerging Third World Wide Web (Semantic Web) by an initiative based on generalized probability theory – the Thinking Web; and

In the early courses of these efforts, a greater understanding of what Paul Dirac meant in his Nobel Prize dinner speech where he stated that quantum mechanics should be applicable to all aspects of human thought.

The QEXL Approach

The QEXL Approach is developed based on considerable experiences in Expert Systems, linguistic theory, neurocognitive science, quantum mechanics, mathematical and physics-based approaches in Enterprise Architecture, Internet Topology, Filtering Theory, Semantic Web, Knowledge Lifecycle Management, and principles of Cloud Organization and Integration. The idea for well-formed probabilistic programming reasoning language is simple. Importantly, also, the more essential features of it for reasoning and prediction are correspondingly simple such that the programmers are not necessarily humans, but structured and unstructured (text-analytic) “data mining” software robots. We have constructed a research prototype Inference Engine (IE) network (and more generally a program) that “simply” represents a basic Dirac notation and algebra compiler, with the caveat that it extends to Clifford-Dirac algebra; notably a Lorentz rotation of the imaginary number i (such that ii = -1) to the hyperbolic imaginary number h (such that hh = +1) corresponding to Dirac’s s, and gtime or g5) is applied.

[Outside the work of Dr. Barry Robson, this approach has not been tried in the inference and AI fields, with one highly suggestive exception: since the late 1990s it has occasionally been used in the neural network field by T. Nitta and others to solve the XOR problem in a single “neuron” and to reduce the number of “neurons” generally. Also suggestively, in particle physics it may be seen as a generalization of the Wick rotation time → i x time used by Richard Feynman and others to render wave mechanics classical. It retains the mathematical machinery and philosophy of Schrödinger’s wave mechanics but, instead of probability amplitudes as wave amplitudes, it yields classical but complex probability amplitudes encoding two directions of effect: “A acts on B, and B differently on A”. It maps to natural language where words relate to various types of real and imaginary scalar, vector, and matrix quantities. Dirac’s becomes the XML-like semantic triple . ]

In an endeavor of this kind the partitions-of-work are inevitably artificial; it is important that this does not impede the integrity of optimal solutions. The most important aspect in The QEXL Approach is, in essence where architecturally Probabilistic Inference (PI) and Data Architecture for the Inference Engine (IE) is designed to be cooperative; software robots are created while PI and IE interact; and the inference knowledge gained by the PI and IE provide rules for solvers (robots) to self compile and conduct queries etc. This is therefore the grandeur of the scheme: This approach will have facilitated programming by nice compilers so that writing the inference network is easy, but it is not required to write the inference net as input code to compile, with the exception of reusable metarules as Dirac expressions with variables to process other rules by categorical and higher order logic. The robots are designed and programmed to do the remaining coding required to perform as solvers. So the notion of a compiler disappears under the hood. The robots are provided with well-formed instructions as well formed queries. Once inferences are formed, different “what – if” questions can be asked. Given that probability or that being the case, what is the chance of… and so on. It is as if having acquired knowledge, Phenomenon Of Interest (POI) is in a better state to explore what it means. Hyperbolic Dirac Networks (HDNs) are inference networks capable of overcoming the limitations imposed by Bayesian Nets (and statistics) and creating generative models richly expressing the “Phenomenon Of Interest” (POI) by the action of expressions containing binding variables. This may be thought of as an Expert System but analogous to Prolog data and Prolog programs that act upon the data, albeit here a “probabilistic Prolog”. Upfront should be stated the advantages over Bayes Nets as a commonly used inference method, but rather than compete with such methods the approach may be regarded as extending them. Indeed a Bayes Net as a static directed acyclic conditional probability graph is a subset of the Dirac Net as a static or dynamic general bidirectional graph with generalized logic and relationship operators, i.e. empowered by the mathematical machinery of Dirac’s quantum mechanics.

Impact Of The QEXL Approach

Impact of The QEXL Approach creating Probabilistic Ontology based on Clifford-Dirac algebra has immense opportunity in advancing the architecture to tackle large looming problems involving System of Systems; in which vast uncertain information emerge. Generally, as such systems are designed and developed employing Cartesian methods; such systems do not offer viable opportunity to deal with vast uncertain information when ridden with complexity. Especially when the context complexity poses multiple need for ontologies, and such a system inherently defies Cartesian methods. The QEXL Approach develops into an ecosystem response while it overcomes the Cartesian dilemma (link to another example for Cartesian Dilemma) and allows for generative models to emerge richly expressing the POI. The models generatively develops such that the POI behavior abstracted sufficiently lend for the IE and the Solvers to varieties of studies based on evidence and also allows for developing systemic studies pertaining to Complex Adaptive System and Complex Generative Systems afflicted by multiple cognitive challenges. Particularly, The QEXL Approach has potential to address complex challenges such as evidence based medicine (EBM); a mission that DoD’s Military Health System envisions while it modernizes its Electronics Health Record System – Veterans Health Information Systems and Technology Architecture (VistA). Vast potential also exists in addressing Veteran Administration’s (VA) Million Veteran Program (MVP); an effort by VA to consolidate genetic, military exposure, health, and lifestyle information together in one single database. By identifying gene-health connections, the program could consequentially advance disease screening, diagnosis, and prognosis and point the way toward more effective, personalized therapies.

Although The QEXL Approach is currently targeted to the healthcare and pharmaceutical domains where recognition of uncertainty is vital in observations, measurements and predictions, and probabilities underlying a variety of medical metrics, the scope of application is much more general. The QEXL Approach is to create a generic multivariate architecture for complex system characterized by Probabilistic Ontology that employing generative order will model “POI” facilitating creation of “communities of interest” by self-regulation in diverse domains of interest, requiring integrative of disciplines to create complex studies. The metaphor of “Cambrian Explosion” may aptly represent the enormity of the immense possibilities in advancing studies that tackle large systemic concerns riddled with uncertain information and random events that The QEXL Approach can stimulate.

It is interesting to note that in the genesis of evolving various NoSQL solutions based on Hadoop few insights have emerged related to need for designing the components recognizing their cooperative existence.

The Goal of The QEXL Approach: Is all about Contextualization

The goal employing The QEXL Approach is to enable the realization of cognitive multivariate architecture for Probabilistic Ontology, advancing the Probabilistic Ontology based architecture for context specific application; such as Healthcare. Specifically, The QEXL Approach will develop PI that helps in the creation of generative models that depicts the systemic behavior of the POI riddled with vast uncertain information. Generally, uncertainty in the vast information is introduced by the System of Systems complexity that is required to resolve multiples of ontologies, standards etc., these further introduce cognitive challenges. The further goal of The QEXL Approach is to overcome such challenges, by addressing interoperability at all levels, including the ability to communicate data and knowledge in a way that recognizes uncertainty in the world, so that automated PI and decision-making is possible. The aim is semiotic portability, i.e. the management of signs and symbols that deals especially with their function and interactions in both artificially constructed and natural languages. Existing systems for managing semantics and language are mostly systems of symbolic, not quantitative manipulation, with the primary exception of BayesOWL. RQSA, or Robson Quantitative Semantic Algebra by its author Dr. Barry Robson, to distinguish it from other analogous systems, underlies Q-UEL. It is the development of (a) details of particular aspect of Dirac’s notation and algebra that is found to be of practical importance in generalizing and correctly normalizing Bayes Nets according to Bayes Theorem (i.e. controlling coherence, which ironically Bayes Nets usually neglect, as they are unidirectional), (b) merged with the treatment of probabilities and information based on finite data using the Riemann Zeta function that he has employed for many years in bioinformatics and data mining (http://en.wikipedia.org/wiki/GOR_method), and (c) the extension to more flavors of hyperbolic imaginary number to encode intrinsic “dimensions of meaning” under a revised Rojet’s thesaurus system.

The Layers of the Architecture Created by The QEXL Approach

The QEXL Layered View

Layer 1- Contextualization: Planning, Designing driven by Theories

A. Probabilistic Ontology creating Inferencing leading into Evidence Based Medicine