Tracking #: 1862-3075

Recent machine learning algorithms have shown a considerable success in various computer vision tasks, including semantic segmentation. However, they seldom perform without error. A key aspect of discovering why the algorithm has failed is usually the task of the human who, using domain knowledge and contextual information, can discover systematic shortcomings in either the data or the algorithm. In this paper, we propose a symbolic-based technique, called a semantic referee, which is both able to explain the errors emerging from the machine learning framework and suggest corrections. The semantic referee relies on a spatial reasoning method applied on ontological knowledge in order to retrieve the features of the errors in terms of their spatial relations with their environment. The symbolic explanation of the errors is then reported to the learning algorithm to learn from its mistakes and consequently improve the performance. In this paper, the proposed method of the interaction between a neural network classifier and a semantic referee show how to improve the performance of semantic segmentation for satellite imagery data.

Thank you for this interesting submission. Let me start by summarizing the work to make sure I understand correctly: so this research introduces a new deep learning system for image segmentation over satellite data, where a subset of the available shadow and elevation information is provided in additional input channels to a CAE. A “semantic referee” is developed as a system external to the CAE. It uses spatial relations and semantics defined in OntoCity. Relations specific to Stockholm, Sweden (the area surveilled in the satellite photo used in this manuscript) are added and used. For regions that were misclassified by the CAE, spatial relations between the misclassified area and classified ones are identified, and the most commonly occurring relations are pushed to the semantic referee to get an “explanation”. The explanation inevitably leads to adding data to a 4th shadow and 5th elevation channel to the training idea during subsequent training “passes” (are the passes epochs)?

I think it is a very clever idea to have a semantic referee that would be able to provide meaningful feedback to any deep network (DNN) — to essentially provide a new set of data and information that can augment the data and improve DNN performance. But there is something unconvincing about this work. For this satellite data, the use of OntoCity is a completely reasonable choice and although there may be stronger DNN models for satellite image segmentation the point of the article is to demonstrate the utility of integrating semantics into deep learning; the paper is not seeking state-of-the-art performance.

My three major concerns follow. I am concerned that they may be fatal, but these are the concerns of only one reviewer. :)

1. First, I must push back on the notion that the semantic referee offers an explanation of misclassifications. I do agree that the referee tells gives us a nice description of *what* has been misclassified (“all the entities that are at least in one geos:rcc8ec relation with the region type oc:Building”) leading to a query that returns oc:shadow. I can’t help but wonder if the referee can just jump in at this point and just help the DNN, in the sense that it can tell the user “ignore what the DNN says when it comes to these patches that it has low confidence about, lets do spatial reasoning instead in this local area, and I can tell you that this segment is a shadow”.

When we think of an explanation as to why the DNN was unable to classify this area correctly…. It cannot. We leave it up to the analyst (the authors) to “explain that the classifier is confused due to the similarity between the color of the shadow and the color of the water”. The notion that just finding a “shadow” is an explanation of a misclassification seems to be a weak notion of explanation. And I suppose that is okay, but the manuscript has to be clear that the system is giving an explanation of what the error is, not why or what caused the error, the latter of which would be much more helpful for DNN debugging and evaluation.

2. I don’t think that the example presented convinces the reader that the semantic referee is truly helpful. I say this because the only features added to the data are a subset of the shadow pixel map of the image and of the elevation data associated with each image. The full shadow feature map can be inferred using existing algorithms, and the elevation data is provided with the dataset. Finding that classification accuracy is improved by augmenting the data with an elevation channel is not that surprising, in fact it is expected, and the semantic referee does not tell the user anything about elevation. This feature was added because of the analyst’s thought that shadows are caused by elevation. Moreover, one could just process a shadow map, add it as another feature, and expect improved performance. I agree that the semantic referee indicated to the user that shadow information would be relevant, but I also see the argument that shadows are an important feature to add (or perhaps to exclude these shadow pixels from being classified) that should be considered even without a referee telling you otherwise. In defense of this approach, the referee tells you what subset of shadow masks should be added, and perhaps that is useful, but it would be interesting to see if the partial mask has better performance than if you just provided the entire shadow mask as a feature.

Another reason I don’t find the semantic reasoner is convincing is because, for this one satellite image, shadows were the only interesting concept identified. It would have been much more convincing if a number of concepts were recovered, and in particular concepts that would be counter intuitive or surprising to a computer vision analyst that would use such a system. But to just make a shadow mask… is this really just equivalent to building a pixel map corresponding to misclassified region with high spatial relation agreement?

3. Lastly, experiments were done on only one satellite image over Stockholm. The concepts included are also specific to this one geographic area, including concepts like buildings do not intersect with railways. But perhaps these concepts are not true for all regions. I think this system should be evaluated over satellite data from a variety of regions, especially in the hopes of having the semantic referee finding a variety of different concepts and new types of feedback that can be provided to the DNN. Because the evaluation is only done over this one satellite image, it seems just like a DNN system that adds a shadow and elevation channel; things that could be suggested by an experienced computer vision analyst but was also suggested by a semantic referee.

This manuscript does represent a complete and carefully implemented piece of research. I do not see anything technically wrong with what the authors attempted to do, nor do I think the concept of a semantic referee is a bad one at all (in fact I love the idea of bringing in a reasoning engine to a deep learning system that provides explanation and system feedback). I think my first major concern can be addressed with repositioning, and my 2nd and 3rd concerns could be addressed by running the system on additional datasets, ideally from geospatial regions that are really varying in shape and structure. If the authors want to claim that providing partial shadow and elevation data is a good thing, then the work should also show that the performance is better under the partial data setting rather than just providing complete partial and elevation channels to the DNN.

Review #2

Anonymous submitted on 15/May/2018

Suggestion: Major Revision

Review Comment:

(1) originality,

The paper presents original work in image segmentation combining a spatial reasoner (RCC-8 based) with a deep neural network (CAE). The reasoner assists by augmenting classification data with an additional concept (shadow) coming from domain knowledge. The paper claims to present the first system applied to image segmentation that explains the error of the classifier using a spatial reasoner. Authors also claim to be the first to close the loop between the reasoner and the classification.

The paper presents an interesting approach to combining spatial reasonsers and deep neural networks and applies the system to image segmentation.

(2) significance of the results

My main problem with the paper is to see the results as significant. The approach is promising but in the end is used here for shadow detection. The full power of adding knowledge is therefore not investigated and the results are more on the level of proof of concept. The paper needs to improve and provide stronger evidence for the power and performance of the proposed system.

(3) quality of writing

The paper is easy to follow and understand. There are number of minor issues but overall style is good.

(4) General comments

a) Explanation
There is one general issue I keep struggling with reading the paper. Does the system explain the error or does it describe frequent structures of the domain triggered by a misclassification. There is no guarantee that frequency actually captures the nature of the mistake. In my view, this needs to be controlled for general frequency.

Authors reference the word "explanations" a lot and mention the need for Natural language expression. However, this is not happening here. The system outputs "shadow" but does not output "this is a shadow because XY". I would refrain from calling this explanation. I find the term "referee". Another term comes to mind is "data augmentation".

Related, there is a lot of discussion about explanation and about explanations in Natural language. Authors suggest that their method is unique (or belongs to a limited set of work) because of the post-hoc explanation. I disagree with that claim for 2 reasions. 1) there is a lot of recent work on explainable systems in exactly that form, 2) the system presented in this paper does not offer explanations in text format. But in the section Related Work this is discussed as a missing point in existing system. So please try to be more clear how the system offers explanations and or use more precise terms.

b) Related Work. I think the discussion of existing methods should be expanded. In total there are 3 related works discussed in detail. For two the claim is that they are limited in terms of symbolic representation (page 3 right upper column). The explanation of the limitations is cursory. How is the reasoning presented in this paper going beyond the existing state-of-the-art?

Authors should discuss the relationship of this approach with explanation-based learning (EBL).

c) Details are missing from Algorithm 1. What is the computational complexity? How do you compute regions from labeled pixels? How is getRegionType implemented?

What does "in the vicinity" mean? (for computing spatial relations between p \in P,r \in R)? Is there some hyperparameter here?

Page 8 you say that all spatial relations for any pair (p,r) are calculated but then you switch to singular.

Is Algorithm 1 run on each image or run over all images? In other words, P is the set of misclassified regions across all images or always for a single image?

How do you know that the pair is representative of the error and not just some common features all regions including relations between misclassified and between classified regions. Why is frequency an indicator or reason for misclassification. Couldn't it be that these are just generally high frequency? Dont you need to control for general frequency?

I assume R is the set of all regions (so the set of correctly classified and misclassified regions). Is that correct?

d) Section 4

What is the impact of the choice to use top 20 regions, 5 classes and 100 regions?

What is the accuracy improvement when the reasoner is directly applied for correction in round 1.
What are the dynamics of accuracy over the three rounds.

How stable are these results numerically? You need to do cross-validation and report variances

Why do all this reasoning work if in the end the output is shadow or not? Seems to be rather complicated for achieving this. I get that you are trying to show that your system is more general but, in order, to make that point you have to show that the system is more general!

MINOR
page 2) "specific case of on" -> "specific case of"
page 2) "Secondly, our model focuses on the misclassifications and uses ontological knowledge, with concepts and their spatial relations, together with a geometrical processing to explain them" - this sentence is confusing. best to split it up or reduce the number of relative clauses. Also it is not clear how this is different from the first contribution claim.
page 3) "Although in the these"

page 3) ideally Figure 1 is on the same pagge as Section 3.1

page 5) softmax -> \operatorname{softmax}. There are more such issues where mathematical operators are not typeset correctly. This is just an example
page 5) Section 3.5: there is no need for explaining naming conventions (in RDF or SPARQL) in this journal
page 6) "The spacial relation" -> "The spatial relationch"
page 8) "Section4" -> "Section 4"
page 9) "special relation" -> "spatial relation"

Review #3

By Michael Cochez submitted on 25/May/2018

Suggestion: Major Revision

Review Comment:

This review was done jointly with Md. Rezaul Karim

===================================================

The paper presents an interesting idea, namely the use of a reasoner to improve the quality of a classification model, perhaps even an initial step towards merging symbolic and data driven artificial intelligence. While we like this idea, the current work does not sufficiently evaluate the impact of the reasoning component and hence we recommend a major revision of this paper.
The presentation style of the paper is of a good level and the text is well readable and clear. Besides, the paper is relevant for this special issue.

There are several major issues we currently see with the paper, they are discussed here. Apart from these, there are some minor language mistakes here and there can be corrected by the authors while carefully reviewing the final version of the manuscript.

One of the main issues with the paper is the limited amount of experiments performed and the parametrization of the model. Also, the choice of CAE seems rather conservative, other newer models could also have been at least tried, for example those which rely on attention or stacked convolutional autoencoders with reconstruction probability. Currently only one network is trained, and it is unclear whether the choice of network is even the best possible option for this use case. Also, one might expect the use of specific image segmentation techniques (specifically semantic segmentation techniques) which were not used, nor elaborately discussed in the related work section.

A related issue is that the proposed technique does indeed improve the presented model, but it is unclear whether this is due to the initial model not being the most suitable for the task in the first place. The question is hence whether the proposed technique is still able to improve other (better) models as well.

The use of the proposed OntoCity ontology seems a very reasonable one. One aspect which is, however, completely untested is what the influence of this choice are. It might, for example, be that using a different ontology (e.g. GeoNames ontology, see http://www.geonames.org/ontology/documentation.html) would give a different performance. Similarly, it is unclear what the precise effect is of the spatial constraints introduced in 3.4.1. They seem logical, but, it is not at al shown that they improve anything.

The choice of only using data from on flat city area seems a bit strange to us. One would expect that an experiment is also performed on, for example, a rural area. This way, the robustness of the proposed approach can be demonstrated.

It is unclear to us how the data is exactly flowing back from the reasoner to the network. We understood you do add extra channels for the features, like, for example shadow. What is unclear is what kind of input you send on these channels. Also, it is unclear whether there is a way to include all information derived by the reasoner. Besides, it is not explained how these channels are provided with data during the test phase.

In section 3.3, the authors mentioned that they used certain hyperparameter setting such as number of layers, filter size, optimizer without providing any justification. How did they chose those hyperparameters? Did they perform any grid/random search with cross-validation? Or did they chose them randomly? Apart from this, did they use some advanced technique while training the network such as batch normalization, dropout as a means of regularization etc.?

We suggest you would compare your approach with the following baseline approach: first train a model to predict elevation from the image. Then, you augment the input to the network you are augmenting with this information (using an additional channel as you did). If you are able to beat this baseline with the proposed approach, it is a much stronger indication that the semantic referee does indeed improve the classification performance.

The training times for the models are somewhat ambiguously reported. For round 1, you use 72 hours, while round 2-4, get 24 hours each. What is unclear is whether the rounds 2-4 continue on round 1's results or not. If they do, then the comparison is rather unfair. What you should be doing instead is creating a model without your additions, which gets 72+(x*24) hours of training time. So, you should compare your model at round 3 with a model which has received 144 hours of training time.

You do mention that you use early-stopping to stop the training, but do not specify anyhow how that was parameterized, specifically how the validation set was created is not mentioned in section 3.2.

Finally, in your conclusion, you state that the richer the ontology, the more meaningful the explanation from the reasoner. While this seems a valid statement, this is not really the question. Rather, the question should be whether "the richer the ontology, the more useful the explanation from the reasoner for the training of the model."

You do report the confusion matrix in table 2. This is interesting to see, but you should report the same table also for the classification after the reasoner has done its job. When performing (many) more experiments, or when doing experiments with more classes, reporting the RMSE would be sufficient, complete results should still be reported in either the appendix or be stored in as permanent.

The source code for your work is not available for the review. Also, you should share the trained model and perhaps some sample satellite images for testing to make it possible for readers to improve upon your work.

Besides these issues, there are still several minor concerns:
Section 1, page: authors have stated, “Machine learning algorithms and semantic web technologies have both been widely used in geographic information systems”. Is there any references to support this statement any related work etc. Then on the same page, they mentioned that Semantic Web could be used for localization. What kind of localization do they mean? Any references?
In section 3.3, the authors mentioned that the parameters were initialized using Xavier initializer. However, this can only be used for initializing network weights, not for the other network parameters. How are they initialized?
In section 3.6, authors have stated, “there are a number of ways that the output from the reasoner (i.e., the error explanation) can influence a neural network-based classifier, e.g., training set selection, data selection, architecture design, and cost function modification”. Could you argue why you made that particular choice?

In the same section, the hardware configuration text (i.e. experiment setting) can be moved to the beginning of section 4.3.

Style issue: please correct the inter word spacing on page 6. It becomes hard to read the text as the concept names and relations are not breaking properly.

It might well be that we are getting this wrong, but it appear like the numbers in table 1 should add up to 100, as you take the top 100 points. Where do these higher numbers come from? Actually, we do not get from the text what the meaning of the table exactly is.

We do not get the reason to introduce oc:intersects, really. We agree there might be cases where you do not really need to know which of the 2 it is, but having this extra information throughout the system does not seem harmful either.

In some place you mention that 'the spatial reasoner is responsible for explaining the errors'. we think this is an overstatement. To me it seems that the reasoner is currently only augmenting the knowledge about a pixel. Then, it seems like all explanation is done by humans.

How would the performance be with using all classes and not just these 5 rather easily separable classes. We would expect that your approach would show its strength much more in this case.