[jira] [Created] (LANG-720) StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (LANG-720) StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.

Sebb (Jira)
StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.
-------------------------------------------------------------------------------------------------------------------

                 Key: LANG-720
                 URL: https://issues.apache.org/jira/browse/LANG-720
             Project: Commons Lang
          Issue Type: Bug
          Components: lang.*, lang.text.translate.*
    Affects Versions: 3.0
            Reporter: Taro Yabuki


Hello.

I use StringEscapeUtils.escapeXml(input) to escape special characters for XML.
This method outputs wrong results when input contains characters in Supplementary Planes.

String str1 = "\uD842\uDFB7" + "A";
String str2 = StringEscapeUtils.escapeXml(str1);

// The value of str2 must be equal to the one of str1,
// because str1 does not contain characters to be escaped.
// However, str2 is diffrent from str1.

System.out.println(URLEncoder.encode(str1, "UTF-16BE")); //%D8%42%DF%B7A
System.out.println(URLEncoder.encode(str2, "UTF-16BE")); //%D8%42%DF%B7%FF%FD

The cause of this problem is that the loop to translate input character by character is wrong.
In CharSequenceTranslator.translate(CharSequence input, Writer out),
loop counter "i" moves from 0 to Character.codePointCount(input, 0, input.length()),
but it should move from 0 to input.length().


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LANG-720) StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.

Sebb (Jira)

     [ https://issues.apache.org/jira/browse/LANG-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Taro Yabuki updated LANG-720:
-----------------------------

    Attachment: CharSequenceTranslator.java.20110714.diff

Patch for org/apache/commons/lang3/text/translate/CharSequenceTranslator.java.

> StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-720
>                 URL: https://issues.apache.org/jira/browse/LANG-720
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*, lang.text.translate.*
>    Affects Versions: 3.0
>            Reporter: Taro Yabuki
>              Labels: patch
>         Attachments: CharSequenceTranslator.java.20110714.diff
>
>
> Hello.
> I use StringEscapeUtils.escapeXml(input) to escape special characters for XML.
> This method outputs wrong results when input contains characters in Supplementary Planes.
> String str1 = "\uD842\uDFB7" + "A";
> String str2 = StringEscapeUtils.escapeXml(str1);
> // The value of str2 must be equal to the one of str1,
> // because str1 does not contain characters to be escaped.
> // However, str2 is diffrent from str1.
> System.out.println(URLEncoder.encode(str1, "UTF-16BE")); //%D8%42%DF%B7A
> System.out.println(URLEncoder.encode(str2, "UTF-16BE")); //%D8%42%DF%B7%FF%FD
> The cause of this problem is that the loop to translate input character by character is wrong.
> In CharSequenceTranslator.translate(CharSequence input, Writer out),
> loop counter "i" moves from 0 to Character.codePointCount(input, 0, input.length()),
> but it should move from 0 to input.length().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LANG-720) StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.

Sebb (Jira)
In reply to this post by Sebb (Jira)

    [ https://issues.apache.org/jira/browse/LANG-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065387#comment-13065387 ]

Gary D. Gregory commented on LANG-720:
--------------------------------------

The patch does not break any unit test with the latest from SVN but it is missing a unit test.

Perhaps we should hold off since we are in the middle of a VOTE.

> StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-720
>                 URL: https://issues.apache.org/jira/browse/LANG-720
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*, lang.text.translate.*
>    Affects Versions: 3.0
>            Reporter: Taro Yabuki
>              Labels: patch
>         Attachments: CharSequenceTranslator.java.20110714.diff
>
>
> Hello.
> I use StringEscapeUtils.escapeXml(input) to escape special characters for XML.
> This method outputs wrong results when input contains characters in Supplementary Planes.
> String str1 = "\uD842\uDFB7" + "A";
> String str2 = StringEscapeUtils.escapeXml(str1);
> // The value of str2 must be equal to the one of str1,
> // because str1 does not contain characters to be escaped.
> // However, str2 is diffrent from str1.
> System.out.println(URLEncoder.encode(str1, "UTF-16BE")); //%D8%42%DF%B7A
> System.out.println(URLEncoder.encode(str2, "UTF-16BE")); //%D8%42%DF%B7%FF%FD
> The cause of this problem is that the loop to translate input character by character is wrong.
> In CharSequenceTranslator.translate(CharSequence input, Writer out),
> loop counter "i" moves from 0 to Character.codePointCount(input, 0, input.length()),
> but it should move from 0 to input.length().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (LANG-720) StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.

Sebb (Jira)
In reply to this post by Sebb (Jira)

     [ https://issues.apache.org/jira/browse/LANG-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Benson resolved LANG-720.
------------------------------

    Resolution: Fixed

I was also going to ask for a unit test, but wanted to improve my understanding of the situation anyway, so adapted the posted problem code.  Even though we are currently voting on the release of 3.0.0 from RC4 I don't see why we can't fix this in trunk; the RC tag is already cut.  I have used the concept of the patch to rewrite the entire method in question, primarily to avoid the modification of a counter variable within a for loop.

Committed revision 1146844.

> StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-720
>                 URL: https://issues.apache.org/jira/browse/LANG-720
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*, lang.text.translate.*
>    Affects Versions: 3.0
>            Reporter: Taro Yabuki
>              Labels: patch
>         Attachments: CharSequenceTranslator.java.20110714.diff
>
>
> Hello.
> I use StringEscapeUtils.escapeXml(input) to escape special characters for XML.
> This method outputs wrong results when input contains characters in Supplementary Planes.
> String str1 = "\uD842\uDFB7" + "A";
> String str2 = StringEscapeUtils.escapeXml(str1);
> // The value of str2 must be equal to the one of str1,
> // because str1 does not contain characters to be escaped.
> // However, str2 is diffrent from str1.
> System.out.println(URLEncoder.encode(str1, "UTF-16BE")); //%D8%42%DF%B7A
> System.out.println(URLEncoder.encode(str2, "UTF-16BE")); //%D8%42%DF%B7%FF%FD
> The cause of this problem is that the loop to translate input character by character is wrong.
> In CharSequenceTranslator.translate(CharSequence input, Writer out),
> loop counter "i" moves from 0 to Character.codePointCount(input, 0, input.length()),
> but it should move from 0 to input.length().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LANG-720) StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.

Sebb (Jira)
In reply to this post by Sebb (Jira)

    [ https://issues.apache.org/jira/browse/LANG-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065462#comment-13065462 ]

Gary D. Gregory commented on LANG-720:
--------------------------------------

OK, thanks for the redo.

I think we should cut another RC to pick this up.

> StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-720
>                 URL: https://issues.apache.org/jira/browse/LANG-720
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*, lang.text.translate.*
>    Affects Versions: 3.0
>            Reporter: Taro Yabuki
>              Labels: patch
>         Attachments: CharSequenceTranslator.java.20110714.diff
>
>
> Hello.
> I use StringEscapeUtils.escapeXml(input) to escape special characters for XML.
> This method outputs wrong results when input contains characters in Supplementary Planes.
> String str1 = "\uD842\uDFB7" + "A";
> String str2 = StringEscapeUtils.escapeXml(str1);
> // The value of str2 must be equal to the one of str1,
> // because str1 does not contain characters to be escaped.
> // However, str2 is diffrent from str1.
> System.out.println(URLEncoder.encode(str1, "UTF-16BE")); //%D8%42%DF%B7A
> System.out.println(URLEncoder.encode(str2, "UTF-16BE")); //%D8%42%DF%B7%FF%FD
> The cause of this problem is that the loop to translate input character by character is wrong.
> In CharSequenceTranslator.translate(CharSequence input, Writer out),
> loop counter "i" moves from 0 to Character.codePointCount(input, 0, input.length()),
> but it should move from 0 to input.length().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LANG-720) StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.

Sebb (Jira)
In reply to this post by Sebb (Jira)

    [ https://issues.apache.org/jira/browse/LANG-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065463#comment-13065463 ]

Matt Benson commented on LANG-720:
----------------------------------

was going to punt to the dev list ;)

I just used a sports metaphor.  :|

> StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-720
>                 URL: https://issues.apache.org/jira/browse/LANG-720
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*, lang.text.translate.*
>    Affects Versions: 3.0
>            Reporter: Taro Yabuki
>              Labels: patch
>         Attachments: CharSequenceTranslator.java.20110714.diff
>
>
> Hello.
> I use StringEscapeUtils.escapeXml(input) to escape special characters for XML.
> This method outputs wrong results when input contains characters in Supplementary Planes.
> String str1 = "\uD842\uDFB7" + "A";
> String str2 = StringEscapeUtils.escapeXml(str1);
> // The value of str2 must be equal to the one of str1,
> // because str1 does not contain characters to be escaped.
> // However, str2 is diffrent from str1.
> System.out.println(URLEncoder.encode(str1, "UTF-16BE")); //%D8%42%DF%B7A
> System.out.println(URLEncoder.encode(str2, "UTF-16BE")); //%D8%42%DF%B7%FF%FD
> The cause of this problem is that the loop to translate input character by character is wrong.
> In CharSequenceTranslator.translate(CharSequence input, Writer out),
> loop counter "i" moves from 0 to Character.codePointCount(input, 0, input.length()),
> but it should move from 0 to input.length().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LANG-720) StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.

Sebb (Jira)
In reply to this post by Sebb (Jira)

     [ https://issues.apache.org/jira/browse/LANG-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Benson updated LANG-720:
-----------------------------

    Fix Version/s: 3.0.1

> StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-720
>                 URL: https://issues.apache.org/jira/browse/LANG-720
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*, lang.text.translate.*
>    Affects Versions: 3.0
>            Reporter: Taro Yabuki
>              Labels: patch
>             Fix For: 3.0.1
>
>         Attachments: CharSequenceTranslator.java.20110714.diff
>
>
> Hello.
> I use StringEscapeUtils.escapeXml(input) to escape special characters for XML.
> This method outputs wrong results when input contains characters in Supplementary Planes.
> String str1 = "\uD842\uDFB7" + "A";
> String str2 = StringEscapeUtils.escapeXml(str1);
> // The value of str2 must be equal to the one of str1,
> // because str1 does not contain characters to be escaped.
> // However, str2 is diffrent from str1.
> System.out.println(URLEncoder.encode(str1, "UTF-16BE")); //%D8%42%DF%B7A
> System.out.println(URLEncoder.encode(str2, "UTF-16BE")); //%D8%42%DF%B7%FF%FD
> The cause of this problem is that the loop to translate input character by character is wrong.
> In CharSequenceTranslator.translate(CharSequence input, Writer out),
> loop counter "i" moves from 0 to Character.codePointCount(input, 0, input.length()),
> but it should move from 0 to input.length().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LANG-720) StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.

Sebb (Jira)
In reply to this post by Sebb (Jira)

    [ https://issues.apache.org/jira/browse/LANG-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067496#comment-13067496 ]

Henri Yandell commented on LANG-720:
------------------------------------

Note that we'll release this in 3.0.1. 3.0 will go out with this as a known issue and 3.0.1 will follow (August).

> StringEscapeUtils.escapeXml(input) outputs wrong results when an input contains characters in Supplementary Planes.
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-720
>                 URL: https://issues.apache.org/jira/browse/LANG-720
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*, lang.text.translate.*
>    Affects Versions: 3.0
>            Reporter: Taro Yabuki
>              Labels: patch
>             Fix For: 3.0.1
>
>         Attachments: CharSequenceTranslator.java.20110714.diff
>
>
> Hello.
> I use StringEscapeUtils.escapeXml(input) to escape special characters for XML.
> This method outputs wrong results when input contains characters in Supplementary Planes.
> String str1 = "\uD842\uDFB7" + "A";
> String str2 = StringEscapeUtils.escapeXml(str1);
> // The value of str2 must be equal to the one of str1,
> // because str1 does not contain characters to be escaped.
> // However, str2 is diffrent from str1.
> System.out.println(URLEncoder.encode(str1, "UTF-16BE")); //%D8%42%DF%B7A
> System.out.println(URLEncoder.encode(str2, "UTF-16BE")); //%D8%42%DF%B7%FF%FD
> The cause of this problem is that the loop to translate input character by character is wrong.
> In CharSequenceTranslator.translate(CharSequence input, Writer out),
> loop counter "i" moves from 0 to Character.codePointCount(input, 0, input.length()),
> but it should move from 0 to input.length().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira