Syndicate

Tracking #: 2005-3218

Authors:

Marjan Alirezaie

Martin Längkvist

Michael Sioutis

Amy Loutfi

Responsible editor:

Guest Editors Semantic Deep Learning 2018

Submission type:

Full Paper

Abstract:

Recent machine learning algorithms have shown a considerable success in various computer vision tasks, including semantic
segmentation. However, they seldom perform without error. A key aspect of discovering why the algorithm has failed is usually
the task of the human who, using domain knowledge and contextual information, can discover systematic shortcomings in either
the data or the algorithm. In this paper, we propose a symbolic-based technique, called semantic referee, which is able to extract
qualitative features of the errors emerging from the machine learning framework and suggest corrections. The semantic referee
relies on a spatial reasoning method applied on ontological knowledge in order to retrieve the features of the errors in terms of
their spatial relations with their environment. The reasoner outputs a semantic augmentation for the errors that is then reported
back to the learning algorithm to learn from its mistakes. In this paper, the proposed method of the interaction between a neural
network classifier and a semantic referee shows how to improve the performance of semantic segmentation for satellite imagery
data.

Thank you for your very comprehensive response to our comments. I really liked the responses and changes made to the title, to the use of a more sophisticated DNN (U-net), evaluation on more images, and clarifying explanation as to what the system provides in terms of explainability. I have a couple of lingering questions may remain about the wide applicability of this scheme or how it may work on different ontologies, but they do not stop me from recommending that the paper be accepted.

Review #2

Anonymous submitted on 29/Sep/2018

Suggestion: Major Revision

Review Comment:

I would like to thank authors for their effort to try and improve the paper. Unfortunately, I still have issues with major aspects of the paper.

(1) originality,

The paper presents original work in image segmentation combining a spatial reasoner (RCC-8 based) with a deep neural network (CAE). The reasoner assists by augmenting classification data with three additional channels feature channels: shadow, elevation, and some notion of inconsistency. The paper claims to present the first system applied to image segmentation that explains the error of the classifier using a spatial reasoner. Authors also claim to be the first to close the loop between the reasoner and the classification.

(2) significance of the results

My main problem with the paper is to acknowledge the results as significant. The approach is promising but in the end its outcome in this particular paper and in this particular experimental result is shadow detection. The full power of adding knowledge is not investigated and the results are on the level of proof of concept. The paper still needs to improve and provide stronger evidence for the power and performance of the proposed system.

Importantly, I would not be able to reproduce the results in this paper as many crucial details are left out especially about the part that is most important (feature augmentation).

I am particularly unsure about the results. One of the feature channels elevation could just be added directly (without reasoner). Whether results are generated by this feature or other features (shadow) that do come from the reasoner is not clear to me.

(3) quality of writing

The paper is easy to follow and understand.

(4) General comments

a) Explanation

Comments on the previous version with respect to explanation have been addressed by the authors

b) Related Work. Discussion of existing methods was expanded and such comments on the previous version addressed.

c) Details are missing from Algorithm 1 (Section 3).

Authors addressed comments on the previous version of the paper. But there are also some additional issues.

Authors mention that additional constraints were added to OntoCity (Section 3.4.1). What is the impact of those constraints?

How are regions computed? What is the "geometrical process"?

Is your classification an argmax following the final softmax layer? If so, does low certainty refer to the output of the softmax layer? How are probabilities (or whatever the certainties are) aggregated for regions?

I have a hard time understanding how the channels are added and encoded (except for elevation). Shadow is added as a binary signal? I do not understand the third channel. What is the definition of "suspicious" and what are the values of the third channel.

There are three channels added to the input of which only the shadow channel is coming from the reasoner, correct? In my understanding, elevation can be added without using the reasoner. If that is the case, you need to show that the actual gain in accuracy comes from the features for which you actually need a reasoner. Otherwise, your results are possibly simply achieved by adding elevation as a feature.

d) Section 4

Comments for the previous version are addressed.

How does testing work. In training you have an iteration of region identification and then adding three channels, followed by retraining. How is this done in testing? In principle in testing you also need to first identify the regions and then augment pixels, correct? When do you stop iterating in testing?

Minor comments

It seems you are using segment,region,area interchangeably. I find that confusing and would personally prefer a single term throughout.

There are some minor issues with long sentences. Maybe try and cut up sentences with too many adjuncts.

There are also a few issues with determiners (imho).

Algorithm 1: maybe use \operatorname to avoid typo layout errors
Algorithm 1: there is an issue with tabs and layout

Review #3

By Michael Cochez submitted on 03/Oct/2018

Suggestion: Minor Revision

Review Comment:

This review was done jointly with Md. Rezaul Karim

===================================================

To us, it seems the authors have improved the paper a lot, we do appreciate the effort. For example, some steps are much clearer now (e.g., Semantic Augmentation of Errors) and they've carried out the evaluation on the satellite images of another city, giving a more robust indication of the generalizability of their approach.

Our earlier suggestions to perform a comparison to a baseline is that we suggest to feed the network also altitude information (besides only the image). To be clear, we do not want to create a network able to find altitude information. If you have a ground truth for that, there is not even a need to train a network to find it.
The point to show here is that your approach can do more than just using altitude information, i.e., you can show that your approach is not a very complicated way to 'just' find the altitude of the pixel and feed that back to the network.

Regarding the proprietary dataset used, it would be excellent if you could repeat the setup with at least one image which is publicly available. The point is that otherwise no one can continue on your research as no one is able to compare or re-evaluate the results fairly. One source of information could be the U.S. Geological Survey ( https://www.usgs.gov/ ).

In the updated paper, several issues are still apparent. One aspect is still unclear to us. Namely, how the reasoning feedback works with the neural network setting. Could you clarify this in even more detail in the paper?

For the new training used, we understand that median frequency class weighting is used to combat high-imbalance in the dataset. However, this approach has some limitations. For example, if two image segments look very different at pixel-level, it won’t cause dominant labels to be weighted less. The reason is that if dominant labels are frequently present by the same amount and the mean is not too different from the median, then the weight will be roughly equal to 1. So performing the training after a natural frequency balancing technique (e.g., a data level-based resampling technique such as random under/oversampling or cluster-based oversampling) could lead to better classification accuracy. Maybe authors could compare both techniques?

In their response, the authors mention that they have made their code available. However, this is still not linked from the paper and hence we could not do a proper review. We are certain this incompleteness will be taken care of in the next iteration of the work.

The authors should also carefully review the reference list of the paper. They have clearly not done that for this version (see e.g., reference 27 ).