Converting content to DITA isn't a small undertaking, because you'll essentially have to retag
everything with the DITA markup. There are some automated ways of converting
content, but if your source content isn't already in a DITA friendly format, for
example, if you have lots of topics that combine lists and concepts, or that have
nested subsections (third-level headers), the conversion might require some
restructuring. Nevertheless, you can speed up the process using a combination of
HTML Tidy and Oxygen's XHTML to DITA transform.

Note: If you have large conversion projects, this method probably isn't suitable.
If you have thousands of topics to convert, for example, take a look at Stilo or
some other automated process. You may need to write custom scripts that tag content
based on your structure. If, on the other hand, you have less than 100 pages to
convert, the method described here might be just fine.

Grab and clean HTML source code

First view the source code and copy the HTML inside the body tags.

Most tools, including Microsoft Word, allow you to generate an HTML version of the content. You can view the source code in a browser page by right-clicking the page and choosing View source.

Go to HTML
Tidy and paste the copied content content through this processor
to clean it.

There are a variety of settings on the HTML Tidy page. You can just use the defaults. Paste your
source content into the HTML box, click Tidy, and
then click View Tidied HTML. You don't have to
include all the page content. Most likely when you look at the source of
a page, you'll see the navigation, header content, footer content, etc.
You might not want to bring this over. Just insert the body content.
Tidy will supply the necessary HTML head tags to make the page valid.

After cleaning the HTML, copy the entire output.

In OxygenXML, go to File > New, expand the New Document folder, and select HTML.

Save the file with a generic name such as "html template."

You'll use this same html template for converting each page. When you run the HTML to DITA transform, Oxygen will create a new file from this template.

Press Ctrl+A to highlight everything on your sample htmltemplate
file and delete it. Then paste in the HTML you copied from HTML Tidy and
save the file.

For the title of your document, add the title between h1 tags right below
the opening body tag.

The transform will look for the first h1 tag and insert this as the
document title. If you don't have an an h1 tag, the first heading level tag
will get rendered as the document title. That heading level will then
actually be removed! Therefore, it's important not to forget to add the h1
tag to your content before running the transform.

Note: If you're converting
a page with a lot of code, the transform may not recognize the code
samples unless they're wrapped in pre tags. If the
transform can't recognize the code, it may eliminate the code
section.

You could also choose Topic or Task, but if you choose Task, you'll need to make sure the content already mostly conforms to the task topic type.

Save the new file with the proper name and, if desired, choose the
.dita extension.

Compare the newly converted DITA file with the original HTML file and make
sure all the sections carried over. Before you start applying
post-processing, you want to be certain all the content is actually
there.

Although you've converted the content to DITA, there is still some clean-up and
other post-processing tasks to do.

Clean up the conversion notes

Look in the source code of the newly converted topics and address any warnings, notes, or other conversion problems.

When you look at the source of the newly converted DITA topics, you'll see that many of them have sections that have comments in them, such as this:

In this case, the original source used this class for notes. The
transform doesn't know how to map classes to note elements, so you'll
have to manually tag these sections as notes.

DITA will convert classes to an outputclass element.
(The outputclass element converts back to a
class element when you transform your DITA content
into HTML.) However, most likely the class tags on your previous
platform won't have the same meaning as your new platform.

You can bulk delete content across all DITA files by going to Find > Find/Replace in Files.

Bulk find and replace is handy for cleaning up all of these notes in bulk.

Find opportunities for re-use (DITA)

One of the reasons for converting to DITA is to harness the content re-use
capabilities. Now you should extract redundant content into separate files for
re-use. This is the tricky part. If you migrated content out of Confluence, and
you were using multiexerpt include macros to single source content into multiple
files, you'll want to assess the content and figure out how you want to single
source the material.

You have a couple of options for re-using content:

Conref. You could create a generic file to store common content,
and then use conref tags where you want to insert this content. See
Conref (re-use of content) for more details. Using conref makes
sense especially for notes and other small chunks that are re-used
across many different files.

Conditionalization. You could conditionalize the content so that
you have attributes corresponding to different outputs on parts of the
page. See DITA: Conditional profiling for more details.
Conditional profiling makes sense when you have a few variations of the
same topic for different audiences.