The Knowledge Graph as the Default Data Model for Machine Learning

Tracking #: 465-1445

Authors:

Responsible editor:

Michel Dumontier

Submission Type:

Position Paper

Abstract:

In modern machine learning, raw data is the preferred input for our models. Where a decade ago data scientists were still engineering features, manually picking out the details they thought salient, they now prefer the data in their raw form. As long as we can assume that all relevant and irrelevant information is present in the input data, we can design deep models that build up intermediate representations to sift out relevant features. However, these models are often domain specific and tailored to the task at hand, and therefore unsuited for learning on heterogeneous knowledge: information of different types and from different domains. If we can develop methods that operate on this form of knowledge, we can dispense with a great deal of ad-hoc feature engineering and train deep models end-to-end in many more domains. To accomplish this, we first need a data model capable of expressing heterogeneous knowledge naturally in various domains, in as usable a form as possible, and satisfying as many use cases as possible. In this position paper, we argue that the knowledge graph is a suitable candidate for this data model. This paper describes current research and discusses some of the promises and challenges of this approach.

Date of Decision:

Decision:

Overall Impression: AverageSuggested Decision: UndecidedTechnical Quality of the paper: AveragePresentation: GoodReviewer`s confidence: MediumSignificance: High significanceBackground: ReasonableNovelty: Clear noveltyData availability: All used and produced data are FAIR and openly available in established data repositoriesLength of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for
second round reviews):

My review has changed slightly after the second reading.

Reasons to accept:

The idea to use knowledge graphs as a data model for machine learning is compelling. It seems to be a very useful approach for certain machine learning problems and especially for integrating heterogeneous knowledge.
The authors describe the benefits of this idea quite well using clear examples and these benefits are quite attractive because they avoid several problems with manual feature engineering.

Reasons to reject:

The title might have been too optimistic. The authors do acknowledge that using knowledge graphs as the data model for machine learning cannot work for all domains. They concede, for example, that it would not be practical to represent individual pixels as nodes in the graph (at the end of Section 3.4). I also underestimated the magnitude of the challenge of differently-modelled knowledge (Section 4.4). Not being able to recognize that two pieces of information, although modeled differently, may represent the same knowledge, can undermine the benefits of using the knowledge graph for learning substantially. These issues lead me to believe that knowledge graphs, as a data model for machine learning, is probably a good addition to the toolkit of a data scientist, but to claim that it should be the default data model is perhaps too strong a statement.

Overall Impression: GoodSuggested Decision: AcceptTechnical Quality of the paper: GoodPresentation: GoodReviewer`s confidence: HighSignificance: High significanceBackground: ComprehensiveNovelty: Clear noveltyData availability: With exceptions that are admissible according to the data availability guidelines, all used and produced data are FAIR and openly available in established data repositoriesLength of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for
second round reviews):

The authors have adequately addressed this reviewer's comments, except one: the title of the manuscript still is "The Knowledge Graph as the Default Data
Model for Machine Learning"; in their response, the authors agree that this is not sufficiently precise and should be restated to refer to machine learning with "heterogeneous knowledge". I would like to see that the authors change the title to more accurately reflect the position they outline in the manuscript.

Reasons to accept:

Good, topical, and somewhat bold position paper suitable for this issue.

Reasons to reject:

The title is too broad and does not accurately reflect the position in the paper.

Overall Impression: GoodSuggested Decision: AcceptTechnical Quality of the paper: GoodPresentation: GoodReviewer`s confidence: HighSignificance: High significanceBackground: ReasonableNovelty: Clear noveltyData availability: All used and produced data are FAIR and openly available in established data repositoriesLength of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for
second round reviews):

The paper introduces the idea of using knowledge graphs as a default model for representing heterogeneous data. Those would allow designing end-to-end machine learning pipelines.

Reasons to accept:

Some of the overselling points of the original version of this paper have been toned down, which is good. The comparison with XML and relational databases is appropriate, and clarifies the main message.

Most of my original points of critique (i.e., OWA, heterogeneous modeling etc.) have been addressed.

Reasons to reject:

Some points still need a more thorough discussion (see further comments).

Further comments:

Some of the overselling points of the original version of this paper have been toned down, which is good. The comparison with XML and relational databases is appropriate, and clarifies the main message.

In order to round off the picture, I would like to see a bit more discussion on when to use a knowledge graph and when not to, maybe as part of the conclusion section. These pieces are scattered (e.g., at the end of 3.4), but for researchers making sense from this paper, it would be good to see a matrix with aspects of the data (e.g., multiple sources, larger text literals, mixed media, streaming, time indexed, etc.) they wish to analyse, and whether a knowledge graph is suitable for those aspects or not.

Most of my original points of critique (i.e., OWA, heterogeneous modeling etc.) have been addressed.

For some of the problems, I have my doubts that simply using a deep neural net will solve the issues. For example, for data with different modeling paradigms, some data will follow one paradigm, while others will follow the other. It might be hard for a learning machine to identify the correspondence if there is no significant overlap here. Consider the example where a fraction of the dataset uses foaf:based_near, while another uses dbo:location. Without a significant overlap of pairs of instances that use *both* properties simultaneously (also indirectly by interlinking instances in both datasets), it will be difficult to learn that they refer to the same property. For the sake of correctnes, I would expect a more thorough discussion of the limitations w.r.t. the challenges. Here, it might make sense to distinguish what current approaches such as RDF2vec are already capable of doing, what they might be extended to, and what might be the hard challenges for which no straightforward solution exists.

Overall, the revision is well done. With a little bit of discussion added on top, I would like to see it accepted.