Created attachment 18852[details]
Preliminary patch removing ICU dependency
I've replaced the ICU dependency by the unicode functionality of GLib, Pango, and fribidi. The attached patch is experimental. It's probably quite a bit slower as I had to add many conversions between UTF-8 and UTF-16. TextBoundariesGtk.cpp implementation is still missing, so at least text entry won't work.
Pango already contains a copy of fribidi, however, the Pango API doesn't provide access to Bidi Type information. That's why I had to add a direct dependency on fribidi. Maybe this could be eliminated by extending the API in a future version of Pango.

Created attachment 20824[details]
Updated patch
I've updated the patch to not depend on fribidi anymore using the new function available in pango 1.21.0. I've also implemented TextBoundariesGtk.cpp, so selection in text input elements basically work again.
The patch still needs work, though. It doesn't register all available encodings and the semantics of the text break iterator implementation doesn't seem to completely match the ICU iterator.

In r33380 (http://trac.webkit.org/changeset/33380) the appendOmittingBOM function was removed. So this patch can't be cleanly applied to tip of trunk currently (appendOmittingBOM is used for example in TextCodecGtk::decode()). The solution I came up with so far was to explicitly specify UTF-16LE as the target encoding in g_iconv_open() of TextCodecGtk::createIConvDecoder(). Currently I only used LE, but Glib's byte-order macros could be used to determine the correct target encoding including its byte order at compile time.
This stops libiconv from generating BOMs for the result of the conversion. Consequently, there's no need for removing the BOMs when appending them to the Vector<UChar>, allowing us to just use result.append(buffer, count / sizeof(UChar)).
What is your opinion on this approach? Do you think that's the right direction to go or do you see any risks?

> This stops libiconv from generating BOMs for the result of the conversion.
Keep in mind that while it won't *add* a BOM, it will *keep* existing BOMs (zero-width no-break space to be precise) if the original stream contains them, wherever they are located (start of the stream or not)

I have downloaded r34824 and tried the patch. I have updated it so that it can be applied except I am new to this and have no idea about the now missing appendOmittingBOM. For now to get webkit building and hopefully running on GTK-DirectFB I have just commented it out. However I was wonderig what progress had been made with this patch?
If I get it running on my platform I will be happy to try out updated patches.
For now it is just compiling and then when I run it crashes so ill sort that first but the idea of removing ICU support is great hope you can finish what you started.
dan

(In reply to comment #10)
> I have downloaded r34824 and tried the patch. I have updated it so that it can
> be applied except I am new to this and have no idea about the now missing
> appendOmittingBOM. For now to get webkit building and hopefully running on
> GTK-DirectFB I have just commented it out. However I was wonderig what
> progress had been made with this patch?
> If I get it running on my platform I will be happy to try out updated patches.
> For now it is just compiling and then when I run it crashes so ill sort that
> first but the idea of removing ICU support is great hope you can finish what
> you started.
> dan
>
(In reply to comment #10)
> I have downloaded r34824 and tried the patch. I have updated it so that it can
> be applied except I am new to this and have no idea about the now missing
> appendOmittingBOM. For now to get webkit building and hopefully running on
> GTK-DirectFB I have just commented it out. However I was wonderig what
> progress had been made with this patch?
> If I get it running on my platform I will be happy to try out updated patches.
> For now it is just compiling and then when I run it crashes so ill sort that
> first but the idea of removing ICU support is great hope you can finish what
> you started.
> dan
>
I amtrying to remove ICU lib from webkit dep.
I have applied patch to r35806 version.
Commented out appendOmittingBOM from TextCodecGtk::decode of TextCodecGtk.cpp.
When i try to lauch GtkLauncher, there is not text on my GtkLauncher, though i have set google.com as url.
If any one suggests where is problem. I could try to resolve it.

Replacing ICU lib with Glib is a great idea.
Are there anyone success with Unicode support with Glib instead of ICU lib ?
I have applied the Glib patch for Unicode, But there is no text display on my browser(gtklauncher).

(In reply to comment #11)
> I amtrying to remove ICU lib from webkit dep.
>
> I have applied patch to r35806 version.
> Commented out appendOmittingBOM from TextCodecGtk::decode of TextCodecGtk.cpp.
> When i try to lauch GtkLauncher, there is not text on my GtkLauncher, though i
> have set google.com as url.
> If any one suggests where is problem. I could try to resolve it.
Yes, it does work, I can confirm that - although the patch attachment here is not up to date with trunk. Look at my comment #8. Commenting out appendOmittingBOM is not sufficient for getting it to work. The comment should give you pointers on how to resolve this.
I contacted Jürg via Email to get more information for setting up a development environment that supports pango 1.21 (which he used to get rid of a fribidi dependency) - no reply set. I was trying to work on an updated patch that is compatible with current trunk.

(In reply to comment #6)
> The patch still needs work, though. It doesn't register all available encodings [...]
Regarding listing the encodings, I'm currently considering two alternatives:
1) Probe g_iconv_open with all the supported encodings listed in the libiconv's list of supported encodings (http://www.gnu.org/software/libiconv/). For each probe, if successful, register the respective encoding with webcore. This avoids a direct dependency to libiconv but isn't very elegant.
2) Introduce a dependency to libiconv and use its iconvlist() function directly. This function takes a callback as a parameter and calls it for each of the available encodings. Introducing this dependency doesn't necessarily mean much extra trouble, as http://library.gnome.org/devel/glib/stable/glib-building.html#dependencies lists libiconv as a dependency of glib on many platforms.
Any other suggestions? Is it acceptable to introduce the libiconv dependency in the --with-unicode-backend=glib case?

Another possible approach would be to redesign WebCore text conversion classes so that they wouldn't need to enumerate all supported encodings. We've discussed this briefly before, but at that point, all conversion libraries we needed did provide encoding enumeration APIs.

I would think that a dependency on libiconv is ok as this is commonly a dependency of glib. My only concern would be to make sure that libiconv cross-compiles better than icu otherwise we have substituted one bad package for another.

(In reply to comment #17)
> I would think that a dependency on libiconv is ok as this is commonly a
> dependency of glib. My only concern would be to make sure that libiconv
> cross-compiles better than icu otherwise we have substituted one bad package
> for another.
Note that Glib on Win32 is built without iconv these days, so it is only required for unix platforms. In that respect avoiding the dependency would be preferrable.

Created attachment 23031[details]
Overview and Comparison of Text Codecs in Safari Win/Mac and libiconv
Libiconv becomes a less favorable choice for enumerating the available encodings given the fact that glib does not rely on libiconv on all platforms but uses the standard libc iconv on a number of Unices or win32 functions on Windows.
I have created an overview of Safari Win & Mac encodings compared to the ones supported by libiconv (and most likely supported by standard libc iconv, too, as from what I understand, these implementations are often very similar). This table shows that almost all Safari encodings can be covered by using these encoding names.
The table does not yet list all the aliases for each canonical encoding. For a start, a list of typical MIME encoding names could be extracted from: http://www.iana.org/assignments/character-sets.
The canonical encoding names in the libiconv column of this table could be used for probing whether glib's g_iconv_open supports this encoding on a particular platform.
Otherwise, the only way out would be to redesign WebCore's text codecs part, as Alexey pointed out in comment #16.

Created attachment 23124[details]
patch updated for probing for iconv encodings, updated to revision 36025
I updated Jürg's patch for probing the list of codecs that are supported by iconv (as stated on the libiconv web page). TIS-620, WINDOWS-1251, EUC-KR, WINDOWS-1253 added manually. This patch now works without a dependency to libiconv, but relies on hard-coded codec tables. It could serve as an interim solution until it's decided to change the text codecs architecture.

Comment on attachment 23124[details]
patch updated for probing for iconv encodings, updated to revision 36025
When asking for review, please mark patches as review?, not review+ (which means that a review has been granted).

I have used patch-36025 for Webkit version-36053 (using glib). I tried to open few html pages and some url using GtkLauncer on fedora-8, but facing following problems:
1) html page with english language is opening fine.
2) html page with korean or japanese language(utf-16) not showing any text.
3) tried to open url like google.com, these pages are not coming up.i put some debug statements in code so find out when launcher try to access the .js pages it hung there itself and not proceeding further.

Created attachment 23167[details]
replacing icu with glib, updated for 36087, ChangeLogs now UTF-8, coding style fixes
(In reply to comment #25)
The UTF-8 troubles in the ChangeLogs are now hopefully fixed.
In addition, I tried being more coding style compliant.

(In reply to comment #26)
> 2) html page with korean or japanese language(utf-16) not showing any text.
It would be helpful if you could post the URLs you're having problems with. In my opinion, UTF-16 source-encoded web pages are really rare as using this encoding inflates the content size.
I tried some pages as found on the Alexa.com top 100 pages by country or language.
For Korea, for example http://www.dreamwiz.com/, http://www.gmarket.co.kr/ - These ones are EUC-KR encoded. They are okay from an encoding point of view, however I cannot verify rendering completely as I don't have Korean fonts on my system.
For Japanese I found that certain pages are in EUC-JP which might not be available. It depends on whether your system's iconv supports EUC-JP. My installation of iconv for example cannot instantiate a codec for this encoding. If you do a debug build, you will be able to see if there are problems instantiating a certain encoding by console error messages like: "EUC-JP not decodable => not available!" or "EUC-JP not encodable => not available!".
> 3) tried to open url like google.com, these pages are not coming up.i put some
> debug statements in code so find out when launcher try to access the .js pages
> it hung there itself and not proceeding further.
google.com, google.co.kr, google.co.jp display fine here. It would be helpful if you could provide more details and specify the exact URLs or procedure that lead to the problems you are observing.
Have you tried the same in a built of the same revision but without applying the patch? - Just compare whether it's a new issue introduced by this page or one that might have existed previously.

A TextCodec implementation is supposed to keep partial bytes in it internal state, unless decode() is called with flush argument. In the latter case, an incomplete character is an error. This is handled by ICU internally, so TextCodecICU doesn't need any such logic (but TextCodecUTF16 has it).

Its strange, but i tried on diffrent versions of webkit by applying patches(glib) on that, still iam unable to open any url on gtklauncher. Following are the version which right now iam using:
webkit->r36477
glib-patch->Patch updated for 36268
pango->pango-1.21.4
glib->glib-2.17.6
iconv->libiconv-1.12
I tried to open following url :
www.google.co.in
www.google.co.kr
www.daum.net
www.ana.co.jp
none of the above url opened in GtkLuancher.
Is some modification or some other packages are required?
Also all these url's opened well with ICU.
if some more information required please tell.

Comment on attachment 23263[details]
Replacing ICU with glib
There are a bunch of webkit-style violations in this patch, which make it difficult to r+.
It concerns me that you'd need to copy all those macros from the ICU headers. Does GLIB not provide any equivalents?
Things like where * and & go, relative to the variable name.
Single line if/else statements don't use {}
Use of c-style casts (Foo*)foo instead of static_cast<Foo*>(foo)
Each variable gets its own declaration line, no int foo, bar;
4-space indents
0 instead of NULL when referring to pointers in c++ code
true and false instead of TRUE and FALSE when dealing with c++ "bool" types
Header #define names should be identical to the header, like #define UnicodeGLib_h
Anything copied from Glib should really be in its own header file, IMO. Separating things out into their own file makes it easier to update them if they ever change upstream, or for platforms which have the necessary headers to not use our local copies of that data.
Probably the same goes of the ICU macros. UnicodeMacrosFromICU.h would be one possible name. :)
WebKit source code files are generally UT8, I'm not sure what encoding you're editing with:
+ * Copyright (C) 2008 Jürg Billeter <j@bitron.ch>
There is a lot of code in UnicodeGLib.h, much of that could be moved into a corresponding .cpp file.
:( Even though this code touches Gtk, it would be best if we could avoid manual-menory management. Manual memory management is *evil* (IMO) and gets you into all sorts of trouble. I think there was a GOwnPtr landed recently, no? That would know how to call g_free on its contents when it went out of scope? If so, one could use that instead of all the manual g_free calls.
Thanks for the patch, but I think we need another round of cleanup to this one.

(In reply to comment #34)
Thanks for your remarks, Eric. I will have a closer look at it once I find the time, maybe Jürg can support in this if he's available.
Regarding the GOwnPtr, I will try to incorporate it. I saw the discussions but it wasn't landed at the time.

Created attachment 25314[details]
Replacing ICU with glib
Another round, now updated to apply cleanly to 38619:
* lots of coding style fixes, with special focus on the ones Eric pointed out
* tried to use GOwnPtr instead of manual memory management
* externalized GLib casefolding table and macros copied from ICU to separate files
* Changes to GNUMakefile.am in JSC depending on choice of unicode backend
This is a preliminary version. Before asking for a review I need to do some more testing on pages with exotic encodings.
Kalle, if you have some time, could have a go with this one? Any feedback is welcome.

(In reply to comment #41)
> Does this pass the "fast/encoding" tests? You can run them with
> ./WebKitTools/Scripts/run-webkit-tests --gtk (--debug|--release) fast/encoding.
Unfortunately not (yet). I've started looking into it. What's the expected outcome? Would this patch have to pass all of them?
2 or 3 test cases (admittedly the easier ones) depend on a certain canonical name for the encoding which seems different in ICU and GLib (for example windows-1256 != CP1256). Would it be acceptable to modify the test cases here, e.g. doing the actual result check with a more tolerant RegExp in JS and print PASS/FAIL?
Also, some of the tests specifically checking the decoding result for some exotic encodings will not pass because libiconv underlying to the GLib routines does not support all of them. How can we account for that?

(In reply to comment #42)
> 2 or 3 test cases (admittedly the easier ones) depend on a certain canonical
> name for the encoding which seems different in ICU and GLib (for example
> windows-1256 != CP1256). Would it be acceptable to modify the test cases here,
> e.g. doing the actual result check with a more tolerant RegExp in JS and print
> PASS/FAIL?
We need to decide on a case by case basis. In particular, windows-1256 is a registered IANA name, unlike CP1256, so changing from former to latter is not an improvement. In fact, some cpXXX names do not even match windows-XXX ones with the same number (e.g. windows-949 vs. cp949).
I don't have an opinion on whether the issues you described should be dealt with now, or left for later.

(In reply to comment #42)
> Unfortunately not (yet). I've started looking into it. What's the expected
> outcome? Would this patch have to pass all of them?
The regression tests tell you what behavior you're changing. Policy for what can be right and wrong in a given port is a different question. We have a mechanism for having port-specific results for various tests, so you can have expected failures in some ports rather than others.
> 2 or 3 test cases (admittedly the easier ones) depend on a certain canonical
> name for the encoding which seems different in ICU and GLib (for example
> windows-1256 != CP1256). Would it be acceptable to modify the test cases here,
> e.g. doing the actual result check with a more tolerant RegExp in JS and print
> PASS/FAIL?
The names actually used on the web are the ones that matter. Also, consistent availability of the same names for encoding across platforms. But this is not yet extensively tested -- the test coverage is sparse.
The text encoding libraries often don't match what's needed on the web very closely. Back early in the WebKit project, when we used the Mac-specific "Text Encoding Converter" library we had a long list of encoding names that had to be added explicitly; we didn't even try to use the names in the library itself. When we switched to the ICU library, we discovered that it had a much better list of names in the library, so the list of encoding names that need to be hardcoded in WebKit itself for ICU is a lot shorter. I imagine that GLib omits many encoding names that are needed for practical compatibility with the web, so you'll need code to deal with this.
To see how this is implemented for ICU, look at the TextCodecICU::registerExtendedEncodingNames function.
To see the remnants of the former mechanism used in older versions of WebKit, look at the TextCodecMac::registerEncodingNames function and at the character-sets.txt, mac-encodings.txt, and make-charset-table.pl files in the platform/text/mac directory.
> Also, some of the tests specifically checking the decoding result for some
> exotic encodings will not pass because libiconv underlying to the GLib routines
> does not support all of them. How can we account for that?
Those test failures are specific examples of how a version of WebKit using the GLib functions would be deficient compared to versions based on other text decoding libraries. There are at least three things we can do about this:
1) Put expected test results in a platform-specific test result directory, reflecting the known shortcomings of the GLib-based support.
2) Supplement the GLib decoding with some other mechanism so we can support more encodings without pulling in an entire library, such as ICU.
3) Take the existing tests and break them into multiple tests to separate "important" from "exotic" encodings, so that the tests that cover important encodings don't have to have platform-specific results. However, this will require discussion of what encodings are important enough.

(In reply to comment #39)
> Kalle, if you have some time, could have a go with this one? Any feedback is
> welcome.
I took it for a spin, and visually it worked for me :)
Valgrind disagreed though, there were two leaks introduced. Fortunately fixing those was trivial:
--- a/WebCore/platform/graphics/gtk/FontGtk.cpp
+++ b/WebCore/platform/graphics/gtk/FontGtk.cpp
@@ -139,6 +139,7 @@ static gchar* convertUniCharToUTF8(const UChar* characters, gint length, int fro
pos += start;
len -= start;
}
+ g_free(utf8);
return g_string_free(ret, FALSE);
}
The contents of utf8 is appended to a gstring which copies the data, so the original needs to be freed.
--- a/WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp
+++ b/WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp
@@ -67,6 +67,8 @@ static TextBreakIterator* setUpIterator(bool& createdIterator, TextBreakIterator
iterator->m_type = type;
iterator->m_length = length;
+ if (createdIterator)
+ g_free(iterator->m_log_attrs);
iterator->m_log_attrs = g_new0(PangoLogAttr, length + 1);
iterator->m_index = -1;
pango_get_log_attrs(utf8.get(), utf8len, -1, 0, iterator->m_log_attrs, length + 1);
Here the PangoLogAttr array was leaked for each setup after the first one. Since this seems to be done a lot, I wonder if using the g_slice API would be a good idea performance-wise... That would require profiling to prove it's hurting though.

(In reply to comment #45)
> (In reply to comment #39)
> I took it for a spin, and visually it worked for me :)
Thanks for your feedback and taking a look, Kalle!
> Valgrind disagreed though, there were two leaks introduced. Fortunately fixing
> those was trivial:
>
> --- a/WebCore/platform/graphics/gtk/FontGtk.cpp
> +++ b/WebCore/platform/graphics/gtk/FontGtk.cpp
> @@ -139,6 +139,7 @@ static gchar* convertUniCharToUTF8(const UChar* characters,
> gint length, int fro
> pos += start;
> len -= start;
> }
> + g_free(utf8);
> return g_string_free(ret, FALSE);
> }
>
> The contents of utf8 is appended to a gstring which copies the data, so the
> original needs to be freed.
Interesting that you found this one. The patch actually doesn't touch FontGtk.cpp AFAICS. So this leak must have been there before. I think it would make sense to file this one as a separate bug. FontGtk.cpp looks like it would profit from being reworked using GOwnPtr, so another option would be to just add this problem to bug 21594.
> Here the PangoLogAttr array was leaked for each setup after the first one.
> Since this seems to be done a lot, I wonder if using the g_slice API would be a
> good idea performance-wise... That would require profiling to prove it's
> hurting though.
So far, I just incorporated your fix proposal.

Created attachment 25627[details]
Replacing ICU with glib, IDN support
New iteration:
* incorporated one of the leak fixes from Kalle for TextBreakIteratorGtk.cpp
* added IDN support using libidn (autoconf check included)
(this is required to pass /fast/encoding/url-hostname-non-ascii.html)
* reordered some encoding names, added some aliases and changed some letter cases in order to pass the testcases that check for specific encoding names
Now passes a lot more of the test cases, the remaining failing ones are mostly GBK and Hebrew encoding cases, which I am unable to test on my machine as my libiconv doesn't have them.

Comment on attachment 25627[details]
Replacing ICU with glib, IDN support
Do all these changes have to go in one big patch? It seems that the codec changes are entirely separate from the other changes. By making the patch big, you make it a lot harder to review, so you'll end up waiting longer for someone to have time to review the whole thing.
Roughly speaking, it looks good, but I don't see any reason to tie the codec changes with the others.

(In reply to comment #51)
> Roughly speaking, it looks good, but I don't see any reason to tie the codec
> changes with the others.
Darin, first of all, thanks for the positive feedback.
I am not sure if I understand how you see the codec changes separate from the rest of the patch. Could you explain in a little more detail?
TextCodecGtk.cpp and TextCodecGtk.h are new in this patch, not modifying old code and essential to the overall functioning of this patch. On the one hand, on their own in a separate patch they wouldn't add anything functional and on the other hand, the changes in TextEncodingRegistry.cpp for example wouldn't work without these new files.
Maybe my Changelog entry
+ TextCodecGtk now supports partially transmitted multibyte characters.
is misleading. TextCodecGtk was only existing in prior versions of the patch, it didn't exist in the trunk before.
So from my point view the patch cannot be split into codec changes / other changes - but I might have misunderstood your comment.

(In reply to comment #53)
> (In reply to comment #51)
> > Roughly speaking, it looks good, but I don't see any reason to tie the codec
> > changes with the others.
>
> Darin, first of all, thanks for the positive feedback.
> I am not sure if I understand how you see the codec changes separate from the
> rest of the patch. Could you explain in a little more detail?
He probably means that the changes could be split to allow individual review, not the whole 78kB of it :)
On another issue, I guess related to comment #50, there's something that introduces a seemingly infinite loop with the glib code. I haven't been able to make real sense of it, but at least it's easy to reproduce:
1. Load google.com
2. type something to the search field (I think it needs to be > somecount chars, but not sure. 1234567890 should work)
3. go back to the beginning of the line and try typing additional text there
4. consistently, glib backend goes in to a busyloop while icu does not
I've tried to print some debugs from the iterator constructor, but haven't seen anything exceptionally funny in there (but I admit that I'm not terribly well aware of what is actually happening there...).

(In reply to comment #53)
> I am not sure if I understand how you see the codec changes separate from the
> rest of the patch. Could you explain in a little more detail?
> TextCodecGtk.cpp and TextCodecGtk.h are new in this patch, not modifying old
> code and essential to the overall functioning of this patch. On the one hand,
> on their own in a separate patch they wouldn't add anything functional and on
> the other hand, the changes in TextEncodingRegistry.cpp for example wouldn't
> work without these new files.
I mean that the TextCodecGtk and TextEncodingRegistry changes together can stand alone, and don't rely on the WTF changes. I see nothing where it requires the WTF changes.
Similarly, the IDN changes don't have to go with either the codec changes or the WTF changes.
Same for the text break iterator changes.
These seem like they could be done in at least 4 separate pieces, each easier to review, and each helpful in making progress.

Can I ask if libiconv is a definate requirement now?
I have libglib without libiconv support and when I run with an older version of this patch and webkit it crashes.
If I build glib with libiconv=gnu option then webkit runs. However to get this to work for me requires some hacking so ideally I would not have a requirement to build libiconv.
Hope someone can give a yes/no answer
Cheers
Dan

libiconv is not a direct requirement of this patch. The patch in its current form requires GLib and libidn. It needs at least GLib functions g_convert_with_iconv, g_iconv_open, g_iconv_close to be available and functional. So the patch indirectly depends on what your GLib's using to implement those functions.

(In reply to comment #54)
> On another issue, I guess related to comment #50, there's something that
> introduces a seemingly infinite loop with the glib code. I haven't been able to
> make real sense of it, but at least it's easy to reproduce:
>
> 1. Load google.com
> 2. type something to the search field (I think it needs to be > somecount
> chars, but not sure. 1234567890 should work)
> 3. go back to the beginning of the line and try typing additional text there
> 4. consistently, glib backend goes in to a busyloop while icu does not
Yesterday I took time to debug this further and found the following (excerpt from mail conversation):
I did some digging with gdb and found that the busyloop
revolves inside VisiblePosition::leftVisuallyDistinctCandidate() in
WebCore/editing/VisiblePosition.cpp. There's two while(true) loops
there and the inner one never exits. After a while of
head-scratching, comparison with ICU implementation and reading ICU
reference, I came to the conclusion that the following should be the
correct fix:
@@ -197,7 +197,7 @@ int textBreakPrevious(TextBreakIterator* bi)
return i;
}
}
- return textBreakFirst(bi);
+ return TextBreakDone;
}
The icu docs for ubrk_preceding() say:
"The value returned is always smaller than offset, or UBRK_DONE."
but the glib implementation was never returning DONE (which is -1)
even for 0 since it calls the BreakFirst which (as the BreakLast
comments also state) always returns the first character of the text.
Also note that BreakNext also returns DONE instead of the last
character already.
---
Dominik agreed and will incorporate this into the next petch/patches.

Thanks to Darin's suggestions, for splitting the patch into smaller pieces, my plan is as follows. I will create 4 patches, the first one for moving WTF Unicode to GLib, the second one for moving Text Codecs from ICU to GLib, the third for switching TextBreakIterator to GLib and the fourth one for adding IDN support.
For the first two modificaitons, I have created two mostly independent patches which can be reviewed separately. Unfortunately, they overlap in changes for /configure.ac and /GNUMakefile.am. This is because I introduced a compile flag for compiling a hybrid WebKit version that has a mixed ICU/GLib Unicode backend.
Patch 2/4 can be applied after 1/4 if the changes to configure.ac and GNUMakefile.am and JavaScriptCore/wtf/unicode/Unicode.h are ignored/skipped.
I can quickly clean up the second patch in this regard once the first is landed.
My plan is that once these two would be landed, I will provide 3/4 and 4/4 containing the TextBreakIterator change and the IDN support - applying cleanly to a future SVN revision that contains 1&2.
If anyone has a suggestion for improving or simplifying this staged process, please let me know. Also, any testing/feedback on the two separated patches is very welcome.

Created attachment 26793[details]
1/4 - Moving WTF Unicode backend to GLib (v2)
(In reply to comment #62)
Darin, thanks for taking the time to review the patch. Here's another iteration that tries to address your comments.
> (From update of attachment 26711[details] [review])
> > Index: JavaScriptCore/wtf/unicode/glib/CasefoldTableFromGLib.h
> > ===================================================================
> > [...]
>
> Is this file really needed?
I got rid of it at the price of a single-character ucs4->utf8->casefold->ucs4 conversion. See below regarding the conversion costs.
> > +} casefold_table[] = {
> This is not the normal naming for identifiers in WebKit. This would typically
> be inside the WTF::Unicode namespace and be named something like caseFoldTable
> or CaseFoldTable.
gone.
> > +#include "CasefoldTableFromGLib.h"
> Do we really need to compile in a copy of the table into each source file?
gone as well.
> > + inline UChar32 foldCase(UChar32 ch)
>
> These functions are pretty long. I don't think it makes sense to inline them.
I disabled inlining for the longer ones and moved them to a new file UnicodeGLib.cpp
> > + inline int umemcasecmp(const UChar* a, const UChar* b, int len)> Converting to UTF-8 just to do the case folding is going to be very
> inefficient. This is going to result in a quite-slow GTK port if it's used
> anywhere.
>
> In general we had to work hard to eliminate the conversion from UTF-16 and
> UTF-8 and back that used to be present in WebKit and JavaScriptCore in its core
> algorithms, and move that conversion to the external API. I think you need to
> do some performance tests if you're going to introduce memory allocation and
> UTF-8 conversion into these algorithms, unless it's OK to have the performance
> of the GTK port suffer.
Looking at the public GLib Unicode API (http://library.gnome.org/devel/glib/stable/glib-Unicode-Manipulation.html) it seems difficult to call the more interesting unicode functionality with utf-16 strings directly - casefolding, collation, normalization all require utf-8. Please see below for my my view regarding these costs.
> > +#ifndef UnicodeGLibTypes_h
> > +#define UnicodeGLibTypes_h
> > +
> > +typedef uint16_t UChar;
> > +typedef int32_t UChar32;
> > +
> > +#endif
>
> Seems a little excessive to have a separate header file for these, but it might
> be OK.
Pulled that into UnicodeGLib.h
Also added a FIXME in umemcasecmp that discusses the discrepancy to the icu implementation.
> I'm going to say review- because of some of the issues with the case folding
> table and the performance of the UTF-8 conversion.
While I am aware it's probably not a very appropriate benchmark, for a first and very rough idea, I compared the ICU build against the GLib build running sunspider. You can see a perfomance regression of 2-3% in most of the string tests. I will attach these results.
My understanding is that this patch helps to reduce ROM footprint when deploying on an embedded platform by eliminating the ICU dependency, saving approximately 10MB (as Alp reports). Currently, the glib backend is optional, icu would remain default. So in my opinion, the benefit of this patch is that it gives a choice to integrators between sacrificing a little performance in favor of the binary reduction or package ICU onto their targets. Future changes in GLib could lead to removing the utf-16 to utf-8 conversions eventually.

(In reply to comment #65)
> When will the two missing patches be published ?
As I mentioned in comment #59, the plan is to publish them as soon as 1 & 2 are landed. There is not really much new in patch 3 & 4 to come, the code is already available in patch "Replacing ICU with glib, IDN support" from 1st of December. It's just split into 4 parts now. HTH.

Comment on attachment 26712[details]
2/4 - Moving Text Codecs to GLib
No comments have been made in over a month. Marking this r- to remove it from the review queue. Please email one of the gtk reviewers directly to get comments.
Feel free to post for re-review if you have had contact with a reviewer and believe this will be reviewed in a timely manner.

Comment on attachment 26793[details]
1/4 - Moving WTF Unicode backend to GLib (v2)
No comments have been made in over a month. Marking this r- to remove it from the review queue. Please email one of the gtk reviewers directly to get comments.
Feel free to post for re-review if you have had contact with a reviewer and believe this will be reviewed in a timely manner.

Comment on attachment 26793[details]
1/4 - Moving WTF Unicode backend to GLib (v2)
> + https://bugs.webkit.org/show_bug.cgi?id=15914
> + [GTK] Implement Unicode functionality using GLib
I have gone through all the comments now, and have verified that most comments have been addressed. I think the best way to move forward is to land the patch in its current state, before it gets old enough that too much has changed, now that you have made it smaller and self-contained. I have given it another pass, and found only indentation problems, which I fixed (see UnicodeGLib.h for reference, so that you can get it right for the next patches), and two missing function implementations (including the one found by Mikkel), which I added. I also fixed header inclusion on top of UnicodeGLib.h.
I have tested that the ICU backend keeps working, and passing tests. This, along with the facts that icu keeps being the default, and that your patch touches little cross platform code makes me believe that we're ready for landing it now.

(In reply to comment #78)
> I would love to see this progress, if it weren't for the bleeding edge Glib
> requirement making it unusable on my N900 :-/
Upgrading the glib in the system should be trivial, I'm sure at worst there's a handful of patches applied to it.

(In reply to comment #80)
> Emphasis on "should". Glib can't (easily) be replaced because several system
> applications explicitly require the version that is ships with.
Upgrade your glib and then just change the version number back to the previous one. Stupid workarounds for stupid applications (and good God, file bugs).

(In reply to comment #84)
> Doing more than one patch per bug, is generally a recipe for slow reviews. In
> this case, I'm scared to review this because I don't have time to read all 83
> comments before doing my review. :(
Well, in comment #51 Darin asked me to split it into smaller portions, that's why there's more than one patch. Maybe Gustavo can help reviewing this as he reviewed 1/4?

Created attachment 42210[details]
2/4 - Moving Text Codecs to GLib (v4)
(In reply to comment #88)
> Regarding this, there's an exception to the camel case rule for acronyms, and
> indeed all usages of "utf" in this file seem to be upper-cased, so I think we
> should strive for consistency. This seems to be the only problem I am able to
> spot by now. Sorry for being nit-picky, but please upload a patch with this
> fixed so that we can use the commit queue =). Thanks!
Sure, here it is. Thanks for reviewing.

Created attachment 42465[details]
2/4 - Moving Text Codecs to GLib (v5)
(In reply to comment #92)
> Looks like the patch is rooted incorrectly. It's possible that
> svn-create-patch crawled up to the wrong repository root. If you believe that
> it's a problem with svn-create-patch, please file a bug. :)
Argh, yes, there was an orphaned .svn folder one level up, looks like svn-create-patch only stopped at this one. Here's a correctly rooted version of the patch, otherwise no changes.

Comment on attachment 42465[details]
2/4 - Moving Text Codecs to GLib (v5)
If you believe svn-create-patch to have been in error in its handling of this patch creation please do file a bug. :) But yes this patch looks good. Since all patches require both an r+ and a cq+ for the commit-queue to land, I'm adding an r+ here too, mostly based on gns's previous review.

(In reply to comment #97)
> Patches 3 and 4 are still pending, hence re-opening.
They're not marked for review? Should they be?
I really think we should close this bug and start a new ones for any remaining changes.

(In reply to comment #99)
> (In reply to comment #97)
> > Patches 3 and 4 are still pending, hence re-opening.
>
> They're not marked for review? Should they be?
No, nothing to be marked for review here.
> I really think we should close this bug and start a new ones for any remaining
> changes.
I agree, I'm creating those and adding them as dependencies. So, there won't be any new patches on this one.

The patch for sub-bug 31470 was landed, so this one here can be closed. I don't seem to have the rights for it, could somebody help me?
I created a new issue for streamlining the differences in behavior between ICU and GLib unicode backend: bug 34247.