[jira] [Created] (CODEC-127) Non-ascii characters in test source files

classic Classic list List threaded Threaded
53 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (CODEC-127) Non-ascii characters in test source files

Gilles (Jira)
Non-ascii characters in test source files
-----------------------------------------

                 Key: CODEC-127
                 URL: https://issues.apache.org/jira/browse/CODEC-127
             Project: Commons Codec
          Issue Type: Bug
            Reporter: Sebb


Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.

This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.

I think we should have a rule of using Unicode escapes for all such non-ascii characters.
It's particularly important for non-ISO-8859-1 characters.

Some example classes with non-ascii characters:

{code}
binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
{code}

The characters are probably not correct above, because I used a crude perl script to find them:

{code}
perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
{code}

language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.

Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:

if (Character.isLetter('\ufffd'))

which is an "unknown" character.

Similarly for binary\Base64Test.java:96.

It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.

[Possibly the characters got mangled at some point, or maybe they have always been wrong]

The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in test source files

Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084604#comment-13084604 ]

Gary D. Gregory commented on CODEC-127:
---------------------------------------

The build deals with this by specifying the encoding in key places.

In eclipse, I set the encoding to UTF-8 for the source folders.

Seeing the real chars in the source is nicer but means you may have to deal with your IDE.

An alternative would be to save IDE settings in SVN. How about that?

-- Posted from Bugbox for iPhone

> Non-ascii characters in test source files
> -----------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in test source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084611#comment-13084611 ]

Sebb commented on CODEC-127:
----------------------------

The problem is that it's not possible to see what the test data is in the IDE (apart from the German chars).

Also, unless you tell SVN the encoding (e.g. via mime-type), diff e-mails (and possibly conversion to local EOL) may suffer.

Saving IDE settings in SVN is a non-starter, because there are many different IDEs, and it's anyway not possible to have the settings automatically picked up, as far as I know.

Have a look again at the non-ISO-8858-1 characters and see if they are correct. I suspect not, as they all appear to be the unspecified character (\ufffd), at least when treated as UTF-8.

> Non-ascii characters in test source files
> -----------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in test source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084619#comment-13084619 ]

Gary D. Gregory commented on CODEC-127:
---------------------------------------

I see now, what a mess.

> Non-ascii characters in test source files
> -----------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in test source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084743#comment-13084743 ]

Sebb commented on CODEC-127:
----------------------------

Here's the full list of lines containing non-ASCII characters:

{code}
java/org/apache/commons/codec/language/ColognePhonetic.java:264    private static final char[][] PREPROCESS_MAP = new char[][]{{'\u00C4', 'A'}, // ├âÔÇ×
java/org/apache/commons/codec/language/ColognePhonetic.java:265        {'\u00DC', 'U'}, // ├â┼ô
java/org/apache/commons/codec/language/ColognePhonetic.java:266        {'\u00D6', 'O'}, // ├âÔÇô
java/org/apache/commons/codec/language/ColognePhonetic.java:267        {'\u00DF', 'S'} // ├â┼©
java/org/apache/commons/codec/language/ColognePhonetic.java:388     * Converts the string to upper case and replaces germanic umlauts, and the ├óÔé¼┼ô├â┼©├óÔé¼´┐¢.
test/org/apache/commons/codec/binary/Base64Test.java:96        byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
test/org/apache/commons/codec/language/ColognePhoneticTest.java:110            {"m├Ânchengladbach", "664645214"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:130        String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
test/org/apache/commons/codec/language/ColognePhoneticTest.java:137            {"Meyer", "M├╝ller"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:143            {"ganz", "G├ñnse"},
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1222        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
test/org/apache/commons/codec/language/SoundexTest.java:367        if (Character.isLetter('´┐¢')) {
test/org/apache/commons/codec/language/SoundexTest.java:369                Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:375            Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:387        if (Character.isLetter('´┐¢')) {
test/org/apache/commons/codec/language/SoundexTest.java:389                Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:395            Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93        String[] names = { "├ícz", "├ítz", "Ign├ícz", "Ign├ítz", "Ign├íc" };
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47                { "Nu├▒ez", "spanish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49                { "─îapek", "czech", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52                { "K├╝├º├╝k", "turkish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55                { "Ceau┼ƒescu", "romanian", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57                { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é", "greek", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58                { "ðƒÐâÐêð║ð©ð¢", "cyrillic", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59                { "ÎøÎö΃", "hebrew", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60                { "├ícz", "any", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61                { "├ítz", "any", EXACT } });
{code}

Note the comment at ColognePhonetic.java:388 - this does not seem to make sense in any encoding, but I could be wrong.

> Non-ascii characters in test source files
> -----------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Issue Comment Edited] (CODEC-127) Non-ascii characters in test source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084743#comment-13084743 ]

Sebb edited comment on CODEC-127 at 8/14/11 12:04 AM:
------------------------------------------------------

Here's the full list of lines containing non-ASCII characters:

{code}
java/org/apache/commons/codec/language/ColognePhonetic.java:264    private static final char[][] PREPROCESS_MAP = new char[][]{{'\u00C4', 'A'}, // ├âÔÇ×
java/org/apache/commons/codec/language/ColognePhonetic.java:265        {'\u00DC', 'U'}, // ├â┼ô
java/org/apache/commons/codec/language/ColognePhonetic.java:266        {'\u00D6', 'O'}, // ├âÔÇô
java/org/apache/commons/codec/language/ColognePhonetic.java:267        {'\u00DF', 'S'} // ├â┼©
java/org/apache/commons/codec/language/ColognePhonetic.java:388     * Converts the string to upper case and replaces germanic umlauts, and the ├óÔé¼┼ô├â┼©├óÔé¼´┐¢.
test/org/apache/commons/codec/binary/Base64Test.java:96        byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
test/org/apache/commons/codec/language/ColognePhoneticTest.java:110            {"m├Ânchengladbach", "664645214"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:130        String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
test/org/apache/commons/codec/language/ColognePhoneticTest.java:137            {"Meyer", "M├╝ller"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:143            {"ganz", "G├ñnse"},
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1222        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
test/org/apache/commons/codec/language/SoundexTest.java:367        if (Character.isLetter('´┐¢')) {
test/org/apache/commons/codec/language/SoundexTest.java:369                Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:375            Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:387        if (Character.isLetter('´┐¢')) {
test/org/apache/commons/codec/language/SoundexTest.java:389                Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:395            Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93        String[] names = { "├ícz", "├ítz", "Ign├ícz", "Ign├ítz", "Ign├íc" };
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47                { "Nu├▒ez", "spanish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49                { "─îapek", "czech", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52                { "K├╝├º├╝k", "turkish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55                { "Ceau┼ƒescu", "romanian", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57                { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é", "greek", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58                { "ðƒÐâÐêð║ð©ð¢", "cyrillic", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59                { "ÎøÎö΃", "hebrew", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60                { "├ícz", "any", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61                { "├ítz", "any", EXACT } });
{code}

Note the comment at ColognePhonetic.java:388 - this does not seem to make sense in any encoding, but I could be wrong.
[You'll need to look at it in the source file itself - the Perl script I used is crude and does not display non-ASCII properly]

The other dubious entris are:

Base64Test.java:96
DoubleMetaphoneTest.java:1222
DoubleMetaphoneTest.java:1227
and most of the SoundexTest.java entries.

      was (Author: [hidden email]):
    Here's the full list of lines containing non-ASCII characters:

{code}
java/org/apache/commons/codec/language/ColognePhonetic.java:264    private static final char[][] PREPROCESS_MAP = new char[][]{{'\u00C4', 'A'}, // ├âÔÇ×
java/org/apache/commons/codec/language/ColognePhonetic.java:265        {'\u00DC', 'U'}, // ├â┼ô
java/org/apache/commons/codec/language/ColognePhonetic.java:266        {'\u00D6', 'O'}, // ├âÔÇô
java/org/apache/commons/codec/language/ColognePhonetic.java:267        {'\u00DF', 'S'} // ├â┼©
java/org/apache/commons/codec/language/ColognePhonetic.java:388     * Converts the string to upper case and replaces germanic umlauts, and the ├óÔé¼┼ô├â┼©├óÔé¼´┐¢.
test/org/apache/commons/codec/binary/Base64Test.java:96        byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
test/org/apache/commons/codec/language/ColognePhoneticTest.java:110            {"m├Ânchengladbach", "664645214"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:130        String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
test/org/apache/commons/codec/language/ColognePhoneticTest.java:137            {"Meyer", "M├╝ller"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:143            {"ganz", "G├ñnse"},
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1222        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
test/org/apache/commons/codec/language/SoundexTest.java:367        if (Character.isLetter('´┐¢')) {
test/org/apache/commons/codec/language/SoundexTest.java:369                Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:375            Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:387        if (Character.isLetter('´┐¢')) {
test/org/apache/commons/codec/language/SoundexTest.java:389                Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:395            Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93        String[] names = { "├ícz", "├ítz", "Ign├ícz", "Ign├ítz", "Ign├íc" };
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47                { "Nu├▒ez", "spanish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49                { "─îapek", "czech", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52                { "K├╝├º├╝k", "turkish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55                { "Ceau┼ƒescu", "romanian", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57                { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é", "greek", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58                { "ðƒÐâÐêð║ð©ð¢", "cyrillic", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59                { "ÎøÎö΃", "hebrew", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60                { "├ícz", "any", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61                { "├ítz", "any", EXACT } });
{code}

Note the comment at ColognePhonetic.java:388 - this does not seem to make sense in any encoding, but I could be wrong.
 

> Non-ascii characters in test source files
> -----------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in test source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084752#comment-13084752 ]

Sebb commented on CODEC-127:
----------------------------

Just done a comparison of the various versions of ColognePhonetic.java in trunk.

The corruption of the comments on PREPROCESS_MAP occurred between r1080701 and r1087901 (April 1st, ironically).

This also corrupted other comments, and the string at line 382.
The SVN log message says "Annotate with @Override and @Deprecated" - were those added automatically perhaps?

> Non-ascii characters in test source files
> -----------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in test source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084753#comment-13084753 ]

Sebb commented on CODEC-127:
----------------------------

SoundexTest appears to have been corrupted in r1075426 => r1080414.
Log comment says "Keep these files in UTF-8 encoding for proper Javadoc processing"
However, I suspect the file was originally in ISO-8859-1, not UTF-8.


> Non-ascii characters in test source files
> -----------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in test source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084845#comment-13084845 ]

Gary D. Gregory commented on CODEC-127:
---------------------------------------

Sebb: Thank you for your Javadoc fixes in trunk and branches/generics.

> Non-ascii characters in test source files
> -----------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (CODEC-127) Non-ascii characters in source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

     [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary D. Gregory updated CODEC-127:
----------------------------------

    Summary: Non-ascii characters in source files  (was: Non-ascii characters in test source files)

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084847#comment-13084847 ]

Gary D. Gregory commented on CODEC-127:
---------------------------------------

Fixed:
{code:java}
java/org/apache/commons/codec/language/ColognePhonetic.java:264    private static final char[][] PREPROCESS_MAP = new char[][]{{'\u00C4', 'A'}, // ├âÔÇ×
java/org/apache/commons/codec/language/ColognePhonetic.java:265        {'\u00DC', 'U'}, // ├â┼ô
java/org/apache/commons/codec/language/ColognePhonetic.java:266        {'\u00D6', 'O'}, // ├âÔÇô
java/org/apache/commons/codec/language/ColognePhonetic.java:267        {'\u00DF', 'S'} // ├â┼©
{code}

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084849#comment-13084849 ]

Gary D. Gregory commented on CODEC-127:
---------------------------------------

Fixed:
{code:java}
java/org/apache/commons/codec/language/ColognePhonetic.java:388     * Converts the string to upper case and replaces germanic umlauts, and the ├óÔé¼┼ô├â┼©├óÔé¼´┐¢.
{code}

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084850#comment-13084850 ]

Gary D. Gregory commented on CODEC-127:
---------------------------------------

Sebb: "The SVN log message says "Annotate with @Override and @Deprecated" - were those added automatically perhaps?"

Yes, more thank likely, using Eclipse.

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085104#comment-13085104 ]

Gary D. Gregory commented on CODEC-127:
---------------------------------------

Sebb:

I get errors when I try your perl script on Windows with the latest perl (64 bit) from ActiveState. Rather than use this space to figure out why, can you please run it again and check if we are done with this ticket?

Thank you,
Gary

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085110#comment-13085110 ]

Sebb commented on CODEC-127:
----------------------------

What error do you get? Just curious.

I now get:

{code}
commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:110      {"m├Ânchengladbach", "664645214"},
commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:130      String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
commons-codec-generics/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1222     this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
commons-codec-generics/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227     this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
commons-codec-generics/src/test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93 String[] names = { "ácz", "átz", "Ignácz", "Ignátz", "Ignác" };
commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47           { "Nu├▒ez", "spanish", EXACT },
commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49           { "─îapek", "czech", EXACT },
commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52           { "K├╝├º├╝k", "turkish", EXACT },
commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55           { "Ceau┼ƒescu", "romanian", EXACT },
commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57           { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é", "greek", EXACT },
commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58           { "ðƒÐâÐêð║ð©ð¢", "cyrillic", EXACT },
commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59           { "ÎøÎö΃", "hebrew", EXACT },
commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60           { "├ícz", "any", EXACT },
commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61           { "├ítz", "any", EXACT } });
{code}

and

{code}
commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:110         {"m├Ânchengladbach", "664645214"},
commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:130       String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:137          {"Meyer", "M├╝ller"},
commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:143          {"ganz", "G├ñnse"},
commons-codec/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227      this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
commons-codec/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1232      this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
commons-codec/src/test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93  String[] names = { "├ícz", "├ítz", "Ign├ícz", "Ign├ítz", "Ign├íc" };
commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47           { "Nu├▒ez", "spanish", EXACT },
commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49           { "─îapek", "czech", EXACT },
commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52           { "K├╝├º├╝k", "turkish", EXACT },
commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55           { "Ceau┼ƒescu", "romanian", EXACT },
commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57           { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é", "greek", EXACT },
commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58           { "ðƒÐâÐêð║ð©ð¢", "cyrillic", EXACT },
commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59           { "ÎøÎö΃", "hebrew", EXACT },
commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60           { "├ícz", "any", EXACT },
commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61           { "├ítz", "any", EXACT } });
{code}

This was using an updated version of the script that uses File::Find to process directory traversal better.
(Some lines shortened above by manually removing leading spaces)

I think all the actual errors have now been fixed.

The remaining lines contain some non-ASCII characters which could be replaced by Unicode escapes for better portability.
However, that would make it harder to read the code in some cases.
So I'm thinking of using Unicode escapes in the Strings, but adding the original as an end-of-line comment.
The comments might still get mangled, but at least the code would not, and it would be easy to reconstruct the comments from the Unicode.

WDYT?

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085115#comment-13085115 ]

Gary D. Gregory commented on CODEC-127:
---------------------------------------

That sounds good. Today, the code is not editable/maintainable.

There does not seem to be anything I can do in Eclipse to fix this just for viewing the chars correctly.

If the comments are left mangled, then they are not maintainable. If you change the code, then the comment should match. So I would not leave the comments mangled.

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085116#comment-13085116 ]

Gary D. Gregory commented on CODEC-127:
---------------------------------------

If I run the command as is, I get:
{quote}
Can't open perl script "ne": No such file or directory
{quote}

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (CODEC-127) Non-ascii characters in source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

     [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebb updated CODEC-127:
-----------------------

    Description:
Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.

This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.

I think we should have a rule of using Unicode escapes for all such non-ascii characters.
It's particularly important for non-ISO-8859-1 characters.

Some example classes with non-ascii characters:

{code}
binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
{code}

The characters are probably not correct above, because I used a crude perl script to find them:

{code}
perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
{code}

language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.

Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:

if (Character.isLetter('\ufffd'))

which is an "unknown" character.

Similarly for binary\Base64Test.java:96.

It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.

[Possibly the characters got mangled at some point, or maybe they have always been wrong]

The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

  was:
Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.

This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.

I think we should have a rule of using Unicode escapes for all such non-ascii characters.
It's particularly important for non-ISO-8859-1 characters.

Some example classes with non-ascii characters:

{code}
binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
{code}

The characters are probably not correct above, because I used a crude perl script to find them:

{code}
perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
{code}

language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.

Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:

if (Character.isLetter('\ufffd'))

which is an "unknown" character.

Similarly for binary\Base64Test.java:96.

It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.

[Possibly the characters got mangled at some point, or maybe they have always been wrong]

The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)


Typo - missing hyphen for flags

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085128#comment-13085128 ]

Sebb commented on CODEC-127:
----------------------------

If you change Eclipse to set the container / resource / text file encoding to UTF-8 (since that is what the POM says) the files should display correctly assuming they really are UTF-8.

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CODEC-127) Non-ascii characters in source files

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085134#comment-13085134 ]

Gary D. Gregory commented on CODEC-127:
---------------------------------------

All better with the test source folder set to UTF-8, which I thought I had done, but obviously not.

I am now a lot less worried about maintenance because the files are editable given the right editor settings. I am inclined to leave things as is.

Perhaps each file need a prominent Javadoc about using UTF-8 in editors.

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G├ñnse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to find them:
> {code}
> perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


123