Hyphenation. Among the artifacts introduced by text layout is the hyphen marking the incidental division of a word at the end of a line. In Tagged PDF, such an incidental word division shall be represented by a soft hyphen character, which the Unicode mapping algorithm (see “Unicode Mapping in Tagged PDF” in 14.8.2.4, “Extraction of Character Properties”) translates to the Unicode value U+00AD. (This character is distinct from an ordinary hard hyphen, whose Unicode value is U+002D.) The producer of a Tagged PDF document shall distinguish explicitly between soft and hard hyphens so that the consumer does not have to guess which type a given character represents.

So okular could at least distinguish between soft and hard hyphens and treat them separately.

For some languages hyphenation changes the spelling. A straightforward merging would then introduce errors.

So I think that removing soft hyphens but keeping others is the best solution.