[jira] Created: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
-----------------------------------------------------------------------------------------------------

                 Key: LANG-480
                 URL: https://issues.apache.org/jira/browse/LANG-480
             Project: Commons Lang
          Issue Type: Bug
    Affects Versions: 2.4
         Environment: doesn't matter
            Reporter: Alexander Kjäll
            Priority: Minor


Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:

import org.apache.commons.lang.*;

public class J2 {
    public static void main(String[] args) throws Exception {
        // this is the utf8 representation of the character:
        // COUNTING ROD UNIT DIGIT THREE
        // in unicode
        // codepoint: U+1D362
        byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };

        //output is: ��
        // should be: 𝍢
        System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
    }
}

Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexander Kjäll updated LANG-480:
---------------------------------

    Attachment: lang-480.patch

Here is a fix for the problem i think

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>         Attachments: lang-480.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexander Kjäll updated LANG-480:
---------------------------------

    Attachment:     (was: lang-480.patch)

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>         Attachments: lang-480.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexander Kjäll updated LANG-480:
---------------------------------

    Attachment: lang-480.patch

And of course you shouldn't develop before you dring your coffe. Here is a working patch for the same problem.

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>         Attachments: lang-480.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665770#action_12665770 ]

Joerg Schaible commented on LANG-480:
-------------------------------------

Unfortunately this patch will have to wait until minimum requirement for commons-lang is JDK 5. Currently commons-lang is still compatible to JDK 1.2. However, talk for a JDK 5 version has already started.

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>         Attachments: lang-480.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665868#action_12665868 ]

Alexander Kjäll commented on LANG-480:
--------------------------------------

That is a bit sad.

How likely do you think that the JDK 5 version to be, will it happen within this quarter?

I guess i could try to write a patch that is compatible with java 1.2, but that would require me to do my own parsing of the format that java stores characters in memory, so i would really like to avoid having that code in a library.

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>         Attachments: lang-480.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665887#action_12665887 ]

Sebb commented on LANG-480:
---------------------------

I've not looked at the code, so this may be nonsense -

Perhaps you could make the processing conditional - if it finds it's running under JVM 1.5+, then use the JVM Method, otherwise ignore the problem?

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>         Attachments: lang-480.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665893#action_12665893 ]

James Carman commented on LANG-480:
-----------------------------------

Wouldn't you have to use reflection, then?  

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>         Attachments: lang-480.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665958#action_12665958 ]

Sebb commented on LANG-480:
---------------------------

Yes, but AFAICT Class.getMethod() is available in Java 1.2.

The method could be fetched in a static block.

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>         Attachments: lang-480.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666149#action_12666149 ]

James Carman commented on LANG-480:
-----------------------------------

Of course it is. :)  My point was that we would be engaging in reflection nastiness and it might not be worth it.  I would suggest that if Alexander needs a release sooner that they do an "internal" release from the trunk with the changes applied and then "upgrade" when we get a newer release out.  I don't like the idea of building in the reflection stuff.  We get no compiler checking that way and it leads to unreadable code.

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>         Attachments: lang-480.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666155#action_12666155 ]

Alexander Kjäll commented on LANG-480:
--------------------------------------

Just my 2 cents, I don't need a release that fixes this bug, i stumbled on it by chance and wrote a patch so that the next person that have the same problem that i do won't have to dig through the library in order to understand what's going on.

I'm mainly interested in fixing this because i don't like buggy software, but i totally agree that building in reflection stuff leads to more problems than it solves in the long run.

My opinion on how to fix this is either push for the JDK 1.5 dependency, or write some code that parses the format the strings are stored in memory. The latter might sound complicated but i think it's quite straight forward.

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>         Attachments: lang-480.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell updated LANG-480:
-------------------------------

    Fix Version/s: 3.0

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: lang-480.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell updated LANG-480:
-------------------------------

    Remaining Estimate:     (was: 4h)
     Original Estimate:     (was: 4h)

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: lang-480.patch
>
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell updated LANG-480:
-------------------------------

    Description:
Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:

import org.apache.commons.lang.*;

public class J2 {
    public static void main(String[] args) throws Exception {
        // this is the utf8 representation of the character:
        // COUNTING ROD UNIT DIGIT THREE
        // in unicode
        // codepoint: U+1D362
        byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };

        //output is: ��
        // should be: 𝍢
        System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
    }
}

Should be very quick to fix, feel free to drop me an email if you want a patch.

  was:
Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:

import org.apache.commons.lang.*;

public class J2 {
    public static void main(String[] args) throws Exception {
        // this is the utf8 representation of the character:
        // COUNTING ROD UNIT DIGIT THREE
        // in unicode
        // codepoint: U+1D362
        byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };

        //output is: ��
        // should be: 𝍢
        System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
    }
}

Should be very quick to fix, feel free to drop me an email if you want a patch.


> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: lang-480.patch
>
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (LANG-480) StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LANG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell closed LANG-480.
------------------------------

    Resolution: Fixed

svn ci -m "Applying Alexander Kjall's patch from LANG-480; along with a unit test made from his example. Fixes unicode conversion above U+00FFFF being done into 2 characters"

Sending        src/java/org/apache/commons/lang/Entities.java
Sending        src/test/org/apache/commons/lang/StringEscapeUtilsTest.java
Transmitting file data ..
Committed revision 749095.

> StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LANG-480
>                 URL: https://issues.apache.org/jira/browse/LANG-480
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.4
>         Environment: doesn't matter
>            Reporter: Alexander Kjäll
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: lang-480.patch
>
>
> Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:
> import org.apache.commons.lang.*;
> public class J2 {
>     public static void main(String[] args) throws Exception {
>         // this is the utf8 representation of the character:
>         // COUNTING ROD UNIT DIGIT THREE
>         // in unicode
>         // codepoint: U+1D362
>         byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };
>         //output is: ��
>         // should be: 𝍢
>         System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
>     }
> }
> Should be very quick to fix, feel free to drop me an email if you want a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.