[Configuration] Problems reading Chinese text from an XMLConfiguration

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Configuration] Problems reading Chinese text from an XMLConfiguration

Matthias Bräuer
Hello,

I'm having problems reading Chinese data from an XMLConfiguration. The
configuration file is encoded in UTF-8. For instance, in my test file
the attribute 'name' of the element 'source' is called "????"
(Chinese for "My files"). When I request this value from the
configuration I get back "�??�??�??件" which obviously is the result of
some wrong character decoding.

The 'name' attribute is used as a key for a HashMap. Consequently,
searching with the key '????' (the original Chinese characters) does
not return the entry because apparently the hash code of this Unicode
string differs from what the XMLConfiguration returned. Also, printing
the name on a JLabel with a Chinese-capable font like "SimSun" gives the
wrong result listed above. However, when I write the configuration back
to a file, the correct Unicode characters are written.

I used the following code fragment to investigate the problem:

        XMLConfiguration config = null;
       
        try {
            config = new XMLConfiguration("tests/conf/sources_chinese.xml");
        }
        catch (ConfigurationException e) {
            e.printStackTrace();
        }
       
        String name = config.getString("source(0)[@name]");
        String name2 = "????";
       
When I use a debugger to check the memory content I see the correct
Chinese characters in the debug view for the (manually constructed)
'name2'. However, the variable 'name' (which is read from the
XMLConfiguration) shows the garbled "�??�??�??件" string.

It appears to me that XMLConfiguration reads and writes in a different
format than UTF-8 which, however, would be a bit strange. I had no time
today to check this in the sources. Maybe someone on the list has
experienced similar problems. Please do not point me to the Java
internationalization pages, I've browsed through these a long time. :-)

Thank you very much in advance,
Kind regards from Taiwan, Matthias




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Configuration] Problems reading Chinese text from an XMLConfiguration

Oliver Heger-2
Matthias Br�uer wrote:

> Hello,
>
> I'm having problems reading Chinese data from an XMLConfiguration. The
> configuration file is encoded in UTF-8. For instance, in my test file
> the attribute 'name' of the element 'source' is called "????"
> (Chinese for "My files"). When I request this value from the
> configuration I get back "�??�??�??件" which obviously is the result
> of some wrong character decoding.
>
> The 'name' attribute is used as a key for a HashMap. Consequently,
> searching with the key '????' (the original Chinese characters)
> does not return the entry because apparently the hash code of this
> Unicode string differs from what the XMLConfiguration returned. Also,
> printing the name on a JLabel with a Chinese-capable font like
> "SimSun" gives the wrong result listed above. However, when I write
> the configuration back to a file, the correct Unicode characters are
> written.
>
> I used the following code fragment to investigate the problem:
>
>        XMLConfiguration config = null;
>              try {
>            config = new
> XMLConfiguration("tests/conf/sources_chinese.xml");
>        }
>        catch (ConfigurationException e) {
>            e.printStackTrace();
>        }
>              String name = config.getString("source(0)[@name]");
>        String name2 = "????";
>       When I use a debugger to check the memory content I see the
> correct Chinese characters in the debug view for the (manually
> constructed) 'name2'. However, the variable 'name' (which is read from
> the XMLConfiguration) shows the garbled "�??�??�??件" string.
>
> It appears to me that XMLConfiguration reads and writes in a different
> format than UTF-8 which, however, would be a bit strange. I had no
> time today to check this in the sources. Maybe someone on the list has
> experienced similar problems. Please do not point me to the Java
> internationalization pages, I've browsed through these a long time. :-)
>
> Thank you very much in advance,
> Kind regards from Taiwan, Matthias
>
I am no expert for encoding of Chinese characters, so I am not sure
whether this really helps: XMLConfiguration allows you to specify the
exact encoding you want to use by calling the setEncoding() method. This
method must be called before load(). Did you try this?

Note also that in configuration 1.1 final there was a bug that the
encoding was not always taken into account
(http://issues.apache.org/bugzilla/show_bug.cgi?id=34204). So you might
want to check out the newest version from SVN.

HTH
Oliver

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Configuration] Problems reading Chinese text from an XMLConfiguration

Jason Lea
In reply to this post by Matthias Bräuer
XML allows you to specify the encoding of the document, otherwise it
defaults to ISO-8859-1.

they normally have something like this at the top:

<?xml  encoding="ISO-8859-1"?>

change it to

<?xml encoding="UTF-8"?>

if that isn't in the xml document or the very first item, add it.

Matthias Bräuer wrote:

>Hello,
>
>I'm having problems reading Chinese data from an XMLConfiguration. The
>configuration file is encoded in UTF-8. For instance, in my test file
>the attribute 'name' of the element 'source' is called "我的文件"
>(Chinese for "My files"). When I request this value from the
>configuration I get back "我的文件" which obviously is the result of
>some wrong character decoding.
>
>The 'name' attribute is used as a key for a HashMap. Consequently,
>searching with the key '我的文件' (the original Chinese characters) does
>not return the entry because apparently the hash code of this Unicode
>string differs from what the XMLConfiguration returned. Also, printing
>the name on a JLabel with a Chinese-capable font like "SimSun" gives the
>wrong result listed above. However, when I write the configuration back
>to a file, the correct Unicode characters are written.
>
>I used the following code fragment to investigate the problem:
>
>        XMLConfiguration config = null;
>      
>        try {
>            config = new XMLConfiguration("tests/conf/sources_chinese.xml");
>        }
>        catch (ConfigurationException e) {
>            e.printStackTrace();
>        }
>      
>        String name = config.getString("source(0)[@name]");
>        String name2 = "我的文件";
>      
>When I use a debugger to check the memory content I see the correct
>Chinese characters in the debug view for the (manually constructed)
>'name2'. However, the variable 'name' (which is read from the
>XMLConfiguration) shows the garbled "我的文件" string.
>
>It appears to me that XMLConfiguration reads and writes in a different
>format than UTF-8 which, however, would be a bit strange. I had no time
>today to check this in the sources. Maybe someone on the list has
>experienced similar problems. Please do not point me to the Java
>internationalization pages, I've browsed through these a long time. :-)
>
>Thank you very much in advance,
>Kind regards from Taiwan, Matthias
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>  
>

--
Jason Lea




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Configuration] Problems reading Chinese text from an XMLConfiguration - Problem solved, thanks

Matthias Bräuer-2
In reply to this post by Oliver Heger-2
Hello

> XML allows you to specify the encoding of the document, otherwise it
> defaults to ISO-8859-1.
> [...] change it to <?xml encoding="UTF-8"?> if that isn't in the xml
> document or the very first item, add it.

The encoding was correctly specified in the XML document. The reason for
the error was indeed the bug Oliver pointed at:

> Note also that in configuration 1.1 final there was a bug that the
> encoding was not always taken into account
> (http://issues.apache.org/bugzilla/show_bug.cgi?id=34204). So you
> might want to check out the newest version from SVN.

With the current version of Commons Configuration (1.2 dev) the encoding
is recognized correctly and all Chinese characters remain untouched.
This works even without specifiying the encoding beforehand (by calling
setEncoding() ). However, before saving a configuration to another XML
document the setEncoding() method still has to be called or the file
will not be stored in the desired UTF-8.

Thanks alot for helping me so quickly.

Best wishes,
Matthias


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]