This article has been re-written with the help of 2 years of feedback, and
the new source code has benefited from all of the fixes and developments during
that time period. See release notes below.

Introduction

Often times you don't want to invest in learning a complex XML tool to
implement a little bit of XML processing in your application. Its SO
Easy! Just add Markup.cpp and Markup.h to your Visual C++ MFC
project, #include"Markup.h", and begin using it. There are no
other dependencies.

Features

Light: one small class that maintains one single document string with
a simple array of indexes

Fast: the parser builds the index array in one quick pass

Simple: EDOM methods make it ridiculously easy to create or process
XML strings

Independent: compiles into your program without requiring MSXML or
any tokenizer

UNICODE: can be compiled for UNICODE for Windows CE and NT/XP
platforms (define _UNICODE)

UTF-8: when not in UNICODE or MBCS builds, it works with UTF-8,
ASCII, or Windows extended sets

MBCS: can be compiled for Windows double-byte character sets such as
Chinese GB2312 (define _MBCS)

XML for Everyday Data

We often need to store and/or pass information in a file, or send a block of
information from computer A to computer B. And the issue is always the same:
How shall I format this data? Before XML, you might have
considered "env" style e.g. PATH=C:\WIN95; "ini" style (grouped in sections);
comma-delimited or otherwise delimited; or fixed character lengths. XML is now
the established answer to that question except that programmers are sometimes
discouraged by the size and complexity of XML solutions when all they need is
something convenient to help parse and format angle brackets. For good
minimalist reading on the syntax rules for XML tags, I recommend Beginning XML -
Chapter 2: Well-Formed XML posted here on the Code Project.

XML is better because of its flexible and hierarchical nature, plus its wide
acceptance. Although XML uses more characters than delimited formats, it
compresses down well if needed. The flexibility of XML becomes apparent when you
want to expand the types of information your document can contain without
requiring every consumer of the information to rewrite processing logic. You can
keep the old information identified and ordered the same way it was while adding
new attributes and elements.

CMarkup Lite Methods

CMarkup is based on the "Encapsulated" Document Object Model (EDOM), the key to
simple XML processing. Its a set of methods for XML processing with the same
general purpose as DOM
(Document Object Model). But while DOM has numerous types of objects, EDOM
defines only one object, the XML document. EDOM harks back to the original
attraction of XML which was its simplicity. To keep overhead low,
CMarkup takes a very light non-conforming non-validating approach
to XML, and it does not verify the XML is well-formed.

The CMarkup "Lite" in this article is the free version of the
CMarkup product sold at firstobject.com. CMarkup
Lite implements a subset of EDOM methods for creating and parsing XML document
strings. The Lite methods also encompass some modification functionality such as
setting an attribute or adding additional elements to an existing XML document,
but not changing the data of, or removing, XML elements. See the EDOM specification to compare
the full CMarkup with CMarkup Lite. The full
CMarkup is available in Evaluation (Educational) and licensed
Developer versions with many more methods, STL and MSXML versions, Base64, and
additional documentation. But this Lite version here at Code Project is more
than adequate for parsing and creating simple XML strings in MFC.

The CMarkup Lite methods are grouped into Creation and
Navigation categories listed below.

GetDoc is used to get the document string after adding elements
and setting attributes. The AddAttrib and SetAttrib
methods do the same thing as each other (as do AddChildAttrib and
SetChildAttrib). They will change the attribute's value if it
already exists, and add the attribute if it doesn't.

When you call SetDoc it parses the szDoc string and
populates the CMarkup object. If it fails, it returns
false, and you can call GetError for an error
description. The IsWellFormed method returns true if
the CMarkup object has at least a root element; it does not verify
well-formedness.

Using CMarkup

The CMarkup class encapsulates the XML document text, structure,
and current positions. It has methods both to add elements and to navigate and
get element attributes and data. The locations in the document where operations
are performed are governed by the current position and the current child
position. This current positioning allows you to work with the XML document
without instantiating additional objects that point into the document. At all
times, the object maintains a string representing the text of the document which
can be retrieved using GetDoc.

Check out the free
firstobject XML editor which generates C++ source code for creating and
navigating your own XML documents with CMarkup Lite.

Creating an XML Document

To create an XML document, instantiate a CMarkup object and call
AddElem to create the root element. At this point, if you called
AddElem("ORDER") your document would simply contain the empty ORDER
element <ORDER/>. Then call
AddChildElem to create elements under the root element (i.e.
"inside" the root element, hierarchically speaking). The following example code
creates an XML document and retrieves it into a CString:

This code generates the following XML. The root is the ORDER element; notice
that its start tag <ORDER> is at the beginning and
end tag </ORDER> is at the bottom. When an element
is under (i.e. inside or contained by) a parent element, the parent's start tag
is before it and the parent's end tag is after it. The ORDER element contains
one ITEM element. That ITEM element contains 3 child elements: SN, NAME, and
QTY.

As shown in the example, you can create elements under a child element by
calling IntoElem to move your current main position to where the
current child position is so you can begin adding under what was the child
element. CMarkup maintains a current position in order to keep your
source code shorter and simpler. This same position logic is used when
navigating a document.

Navigating an XML Document

The XML string created in the above example can be parsed into a
CMarkup object with the SetDoc method. You can also
navigate it right inside the same CMarkup object where it was
created; just call ResetPos if you want to reset the current
position back to the beginning of the document.

In the following example, after populating the CMarkup object
from the csDoc string, we loop through all ITEM elements under the
ORDER element and get the serial number and quantity of each item:

For each item we find, we call IntoElem before interrogating its
child elements, and then OutOfElem afterwards. As you get
accustomed to this type of navigation you will know to check in your loops to
make sure there is a corresponding OutOfElem call for every
IntoElem call.

Adding Elements and Attributes

The above example for creating a document only created one ITEM element. Here
is an example that creates multiple items loaded from a previously populated
data source, plus a SHIPMENT information element in which one of the elements
has an attribute. This code also demonstrates that instead of calling
AddChildElem, you can call IntoElem and
AddElem. It means more calls, but some people find this more
intuitive.

This code generates the following XML. The root ORDER element contains 2 ITEM
elements and a SHIPMENT element. The ITEM elements both contain SN, NAME and QTY
elements. The SHIPMENT element contains a POC element which has a type
attribute, and NAME and TEL child elements.

Finding Elements

The FindElem and FindChildElem methods go to the
next sibling element. If the optional tag name argument is specified,
then they go to the next element with a matching tag name. The element that is
found becomes the current element, and the next call to Find will go to the next
sibling or matching sibling after that current position.

When you cannot assume the order of the elements, you must reset the position
in between calling the Find method. Looking at the ITEM element in the above
example, if someone else is creating the XML and you cannot assume the SN
element is before the QTY element, then call ResetChildPos() before
finding the QTY element.

To find the item with a particular serial number, you can loop through the
ITEM elements and compare the SN element data to the serial number you are
searching for. This example differs from the original navigation example by
calling IntoElem to go into the ORDER element and use
FindElem("ITEM") instead of FindChildElem("ITEM");
either way is fine. And notice that by specifying the "ITEM" element tag name in
the Find method we ignore all other sibling elements such as the SHIPMENT
element.

Encodings

ASCII refers to the character codes under 128 that we have come to depend on,
programming in English. Conveniently if you are only using ASCII, UTF-8 encoding
is the same as your common ASCII set.

If you are using a character set not corresponding to one of the Unicode sets
UTF-8, UTF-16 or UCS-2, you really should declare it in your XML declaration for
the sake of interoperability and viewing it properly in Internet Explorer.
Character sets like ISO-8859-1 (Western European) assign characters to the
values in a byte between 128 and 255, so that every character still only uses
one byte. Windows double-byte character sets such as GB2312, Shift_JIS and
EUC-KR use one or two bytes per character. For these Windows charsets, put
_MBCS in your preprocessor definitions and make sure your user's
Operating System is set to the corresponding code page.

To prefix your XML document with an XML declaration such as <?xmlversion="1.0"encoding="ISO-8859-1"?>, pass it to
SetDoc or the CMarkup constructor. Include a CRLF at
the end as shown so that the root element goes on the next line.

Depth First Traversal

You can use the following code to loop through every element in your XML
document. In the part of the code where you process the element, every element
in the document (except the root element) will be encountered in depth first
order. For illustrative purposes, it gets the tag name of the element. If you
were searching for a particular element tag name you could break out of the loop
at this point. "Depth first" means that it traverses all of an element's
children before going to its sibling.

Loading and Saving Files

CMarkup Lite does not have Load and
Save methods. To load a file, look in the
CMarkupDlg::OnButtonParse method which loads a file into a string.
Once you have it in a string, you can put it into the CMarkup
object using SetDoc. To save it to a file, call GetDoc
to get the string and then implement your own code to write the string to your
file. When you need to implement any of your own project specific I/O error
handling, streaming, permissions/locking, and charset conversion, it is actually
good software design to keep this outside of the CMarkup class
allowing CMarkup to remain a generic class.

The Test Dialog

The Markup.exe test bed for CMarkup is a Visual Studio
6.0 MFC project (also compiles in VS .NET too). When the dialog starts, it
performs diagnostics in the RunTest function to test
CMarkup in the context of the particular build options that have
been selected. You can step through the RunTest function to see a
lot of examples of how to use CMarkup. Use the Open and Parse
button in the dialog to test a file.

In the following illustration, the Build Version is shown as "CMarkup Lite
6.5 Debug Unicode." This means that it is the debug version built with
_UNICODE defined. The RunTest completed successfully.
A parse error was encountered in the order_e.xml file. It also shows the
load and parse times, and file size.

The Test Dialog keeps track of the last file parsed and the dialog screen
position for convenience. This is kept in the registry under
HKEY_CURRENT_USER/ Software/ First Objective Software/ Markup/
Settings.

How CMarkup Works

The CMarkup strategy is to leave the data in the document string and maintain
a hierarchical arrangement of indexes mapping out the document.

increase speed: parse in one pass and maintain hierarchy of indexes

reduce overhead: do not copy or break up the text of the
document

CMarkup parses the 250k play.xml sample document in about
40 milliseconds (1/25th of a second) on a 500Mhz machine, holding it as a single
string, and allocating about 200k for a map of the 6343 elements. From then on,
navigation does not require any parsing. As a rule of thumb, the map of indexes
takes up approximately the same amount of memory as the document, so the memory
footprint of the CMarkup object should settle down around 2 times
the size of the document. For each element in the document a struct of eight
integers (32 bytes) is maintained.

Look at the start and end tags in <QTY>1</QTY>. The struct contains the offsets of the
left and right of both the start and end tags (i.e. all the < and >
signs). The reserved integer is not currently used but could be used for a
delete flag and/or level (i.e. depth) in the hierarchy to support indentation.
The other three integers are indexes to the structs for the parent, child and
next elements.

When the document is first parsed an array of these structs is built, and
then as elements are modified and inserted in the XML, the structs are modified
and added. Rather than allocating structs individually, they are allocated in an
array using a "grow-by" mechanism to reduce the number of allocations to a
handful. That is why integer array indexes rather than pointers are used for the
links. Once an element is assigned an index in the array, that index does not
change. So the index can be used as a way of referring to and locating an
element

Release Notes

This release 6.5 of CMarkup Lite's public methods are backwards
compatible with the previous release 6.1 posted here in August 2001 except for
one rare usage of IntoElem. In 6.1, if you called
IntoElem without a current child element, it would find the first
child element. Now in 6.5 when there is no current child position,
IntoElem puts the main position before the first child element so
that a subsequent call to FindElem will not bypass the first
element. So, the quick way to check this when upgrading is to scan all
occurrences of IntoElem and make sure the previous CMarkup
navigation call is FindChildElem before it. Or, if the child
element was just created with AddChildElem then its okay because
that sets the current child position too. For full details on this, see the IntoElem Changes in
Release 6.3.

Other major changes since 6.1:

Fix: MBCS double-byte text x_TextToDoc*thanks
knight_zhuge

Performance: parsing is roughly twice as fast

Debugging: see m_pMainDS and m_pChildDS class
members while debugging to see string pointers showing current main and child
positions

License

CMarkup Lite is free for compiling into your commercial,
personal and educational applications. Modify it as much as you like, but retain
the copyright notice in the source code remarks. Redistribution of the modified
or unmodified CMarkup Lite class source code is limited to your own
development team and it cannot be made publicly available or distributable as
part of any source code library or product, even if that offering is free. For
source code products that derive from or utilize CMarkup Lite,
please refer users to this article to obtain the source files for themselves.
You are encouraged to discuss this source code and share enhancements here in
the discussion board under this article. Enjoy!

Comments and Discussions

First of all, many thanks for this really great and useful class.
With reference to the infamous line 728 _tcscpy(&pDest[nLen],pFound);
how should I replace _tcscpy so that I don't get the following warning message when I compile your code with VS 2013:
warning C4996: 'wcscpy': This function or variable may be unsafe. Consider using wcscpy_s instead.
Unfortunately I can't just replace _tcscpy with wcscpy_s because wcscpy_s needs an extra size parameter. For the time being I've disabled the warning, but I'd prefer to have clean code.

I'm currently using your CMarkup lite class in a private project, and I'm planning to release it as an open source project in the near future. Is it possible to include your source code into my open source distribution? If so, are there any restrictions/requirements?

yes, but actually 2 gigs is the limit because of the signed int. I doubt there are 64-bit issues. More recent versions have been tested for 64-bit and I have only added casts and tweaks mostly to get rid of warnings since I wanted to keep the 32-bit integers and not double the memory consumption of the indexes.

CMarkup will not load XML from the web. You would need to use something else to HTTP GET the XML and then put it into CMarkup using SetDoc. Unfortunately there is no convenient tool to do this. On Windows you can use MSXML.

CMarkup *Lite* does not have the node methods to get to it. Yes, "mixed content" is allowed, though not recommended for this purpose, but I understand if you have to scan a file containing it, then you are stuck.

Hello Mr. Bryant. My company has incorporated your CMarkup class (which we downloaded from this website) into one of our products.
We would like to purchase the licensed version. Is the licensed version called "CMarkup Developers License" and is the website www.firstobject.com a valid location to obtain the same code that we got from CodeProject?

I just want to make sure we get your code, which has worked well for us.

But I still have a problem.
When I am traveling all the childs of the node,when found a child ,I need get its parent tagname,
if I use your method,but when I come back to child level,I cannot position the previous position.still get the first child,this will goto a loop.
For example,

Parent
--child1
--child2
...
--childn.

when i found child1,i need to get "Parent",and then to find child2(alse need to get "Parent"),and so on.

Actually I considered that. The OutOfElem method recalls the child element as the current child position, so that when you go back IntoElem, you are at the one you went out of, not back at the first child. There is also an alternative way of doing the looping you described where you keep the parent position and loop through the children with FindChildElem.