i found out that the xml_set_character_data_handler call back function can be called more often for the same element in particular the content is just a few chars long (happen on windows)

so a check up can give you the answer an may be for long strings too.eg:<?phpxml_set_character_data_handler($this->parser, "cdata");//...function cdata($parser, $cdata) {// ...if(isset($this->data[$this->currentItem][$this->currentField])) {$this->data[$this->currentItem][$this->currentField] .= $cdata;} else {$this->data[$this->currentItem][$this->currentField] = $cdata;} ?>

Rather than concatinating the data based on whether or not the current tag name has changed from the previous tag name I suggest always concatinating like the following with the $catData variable being unset in the endElement function:

<?php

function endElement ($parser, $data) { global $catData;

// Because we are at an element end we know any splitting is finishedunset($GLOBALS['catData']);}

function characterData ($parser, $data) { global $catData;

// Concatinate data in case splitting is taking place$catData.=$data;

}

?>

This got me around a problem with data like the following where, because characterData is not called for empty tags, the previous and current tag names were the same even though splitting was not taking place.

It would be nice if someone could complete documentation of this function. I think that the "splitting" behaviour should (at least) be mentioned within the documentation, if not explained (please!). I'm not quite sure whether the cut comes after each 1024bytes/chars of data.

My experience looks as follows:[xmlFile]... <label>slo|žka</label> <comment>koment|á&#345; složky</comment>...[/xmlFile](Places where the character-data got splitted are marked with pipes. Plus there was latin small letter 'r' with caron instead of &#345;.)

Since the splitting is not mentioned in documentation one could assume that it is a bug; especially when you work with UTF-8 and the cuts come right before some special characters.(Should the concatenating of $cData be considered to be the proper & 'final' way of processing character-data?)

Also I'd suggest to add another line in "Description" when fc has an alternate usage (instead of hiding it within the "Note" :o); in this particular case I'd prefer this:

The textfunction only receive 1024 characters at once, even if the text is 4000 characters long. In facts, the parser seems to split the data in pieces of 1024 characters. The way to handle that is to concatenate them.

example:If you have an XML tag called UNIPROT_ABSTRACT containing a 4000 characters protein description:function textfunction($parser, $text) { if ($last_tag_read=='UNIPROT_ABSTRACT') $uniprot.=$text; }The function is called 4 times and receives 1024+1024+1024+928 characters that will be concatenated in the $uniprot variable using the ".=" concatenation fonction.

re: jason at omegavortex dot com below, another way to deal with whitespace issues is:

function charData($parser,$data) { $char_data = trim($data);

if($char_data) $char_data = preg_replace('/ */',' ',$data);

$this->cdata .= $char_data; }

This means that:

<p>here is my text <a href="something">my text</a> and here is some more after some spaces at the beginning of the line</p>

comes out properly. You could do further replacements if you want to deal with tabs in your files. i only ever use spaces. if you only use trim() then you would lose the space before the <a> tag above, but trim() is a good way to check for completely empty char data, then just replace more than one space with a single space. this will preserve a single space at the beginning and end of the cdata.

This is an addition to the note posted by:wiart at yahoo dot com22-Aug-2003 05:31Which is located below.

I had similar problems manually creating XML docs and adding new-lines within my node data, e.g.

<root> ... <node attribute="something"> Here is some data. There is a lot of data, and I want to be able to read the data from a terminal window, so I add newlines to fit everything within 80 columns. </node> ...</root>

So, given the above example, my data handler gets called 3 times and the result left in my variable is:

"newlines to fit everything within 80 columns."

Instead of all of the data within "node", which I was expecting.

By using the concatenation operator; however, as suggested by the mentioned note, I was able to get what I needed. Which is of course:

"Here is some data. There is a lot of data, and I want to be able to read the data from a terminal window, so I add newlines to fit everything within 80 columns."

I just want to mention that i ran into a problem when parsing an xml file using the character data handler. If you happen to have a string which is also an internal php function stored in your xml data file and you want to output it as a string the parser dosent seem to recognize it. I found a way around this problem. In my case i was storing a string with the value read. This would not allow me to output the data so to work around this problem i added a backslash for every character in the data element.

e.g. <xml> from <element>read</element> to <element>////read</element>

i dont know if anyone has ran into this problem or not but i thought i would just put it here just so in case someone is getting stuck with this.

Thanks to Christian Stocker for clearing up my entity issues, where some entities are parsed correctly and others not.

The problem is the ''wide'' entities that have a large numeric code simply can not fit in a single byte, which is the default encoding for both source input to the parser and data output from the parser. So the parser puts out a ''?'' to say it could not store the code value. One could argue that if the input has a &1234; the output should simply copy it as &1234; instead of the ''?'' but that would still mean the parser behaves two different ways according to the code values, and anyway they don't do it.

So, we need utf8 encoding for the output, and the slightly not obvious way to say so is

$xml_parser = xml_parser_create ("UTF-8");

which means BOTH source input and data output are utf8.Remember that utf8 is a superset of basic ASCII but not of extended ASCII, so your input can contain e.g. &eacute;spelled out, but a native eacute character is wrong here.Just utf8_encode your input to be sure.

the function handler is called several times when it parses the character data. It doesn't return the entire string as it suggests. There are special exceptions that will always force the parser to stop scanning and call the character data handler. This is when:

- The parser runs into an Entity Declaration, such as &amp; (&) or &apos; (‘)- The parser finishes parsing an entity- The parser runs into the new-line character (\n)- The parser runs into a series of tab characters (\t)