IBM Watson™ Discovery Service Ideas

Remove or Strip HTML tags during ingestion and/or

Affects both ingestion and document conversion / segmentation.

Benefits usage of the ingested content making it consumable in a more basic format. Original customer use case was to not only remove / strip HTML but also segment based on HTML header level. So this content:

======

<h2>

My content for first document.

<p>

Is this and I really am not sure if or how to handle <b>stylistic markup</b>

</p>

</h2>

<h2>

And here is my second document.

</h2>

=======

Ingested content resulting in two JSON documents.

Follow up investigation on this idea:

how to handle imbedded stylistic markup

if the documents are split, should there be some relationship kept between them and the source