[text] Invalid unicode sequences on .substring of RandomStringGenerator

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[text] Invalid unicode sequences on .substring of RandomStringGenerator

Amey Jadiye
Hi Folks,

While working on RandomStringGenerator I found when I'm doing .substring on
generated random string its failing intermittently with sequence of
surrogate pair.
same bug was raised in commons-lang
https://issues.apache.org/jira/browse/LANG-100

Is this possible bug with RandomStringGenerator ? or is this expected ?

@Test
public void testSubStringWithSurrogatePair() {
    final int size = 5000;
    final Charset charset = Charset.forName("UTF-8");
    RandomStringGenerator generator = new
RandomStringGenerator.Builder().build();
    String orig = generator.generate(size).substring(0,2500);

    final byte[] bytes = orig.getBytes(charset);
    final String copy = new String(bytes, charset);

    for (int i=0; i < orig.length() && i < copy.length(); i++) {
        final char o = orig.charAt(i);
        final char c = copy.charAt(i);
        assertEquals("differs at " + i + "(" + Integer.toHexString(new
Character(o).hashCode()) + "," +
                Integer.toHexString(new Character(c).hashCode()) + ")", o,
c);
    }

}

Regards,
Amey


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [text] Invalid unicode sequences on .substring of RandomStringGenerator

Bruno P. Kinoshita-3
Hi Amey,

You created a byte array from the original string (which may contain surrogate chars). But then you created a copy string with `final String copy = new String(bytes, charset);`. There will be encoding to UTF-8, which may fail to encode some values, leading to the error you reported I suspect.

If you try `final String copy = new String(bytes);` there will be still encoding to the default system charset as well.

So I think the safest is to compare codepoints. Perhaps with something like this:

    @Test
    public void testSubStringWithSurrogatePair() {
        for (int j = 0; j < 10; j++) {
            final int size = 5000;            RandomStringGenerator generator = new RandomStringGenerator.Builder().build();            String orig = generator.generate(size).substring(0, 2500);
            final String copy = new String(orig);
            for (int i = 0; i < orig.length() && i < copy.length(); i++) {                final int o = orig.codePointAt(i);                final int c = copy.codePointAt(i);                assertEquals(String.format("Differs where j = %d, i = %d, o = %d, and c = %d", j, i, o, c), o, c);            }        }
    }

Running it 10 times, I was able to consistently reproduce the initial issue. It would always fail, about 4 out of 10. I think [rng] or somewhere in another commons component I kind of remember seeing unit tests for random generated values using loops? But may be mistaken (I don't trust my own memory). So feel free to leave that part out if you prefer. I tried the code above with j going up to 1000. After a few seconds, the test passed too.
Doing `final String copy = new String(orig);` the value of the original string is completely copied onto the new string. So comparing the codepoints should do the trick. We may even want to add another assert statement before the for loop to confirm both strings have the same length?
Hope that helps,Bruno


________________________________


From: Amey Jadiye <[hidden email]>
To: Commons Developers List <[hidden email]>
Sent: Monday, 11 September 2017 12:15 AM
Subject: [text] Invalid unicode sequences on .substring of RandomStringGenerator



Hi Folks,


While working on RandomStringGenerator I found when I'm doing .substring on

generated random string its failing intermittently with sequence of

surrogate pair.

same bug was raised in commons-lang

https://issues.apache.org/jira/browse/LANG-100


Is this possible bug with RandomStringGenerator ? or is this expected ?


@Test

public void testSubStringWithSurrogatePair() {

    final int size = 5000;

    final Charset charset = Charset.forName("UTF-8");

    RandomStringGenerator generator = new

RandomStringGenerator.Builder().build();

    String orig = generator.generate(size).substring(0,2500);


    final byte[] bytes = orig.getBytes(charset);

    final String copy = new String(bytes, charset);


    for (int i=0; i < orig.length() && i < copy.length(); i++) {

        final char o = orig.charAt(i);

        final char c = copy.charAt(i);

        assertEquals("differs at " + i + "(" + Integer.toHexString(new

Character(o).hashCode()) + "," +

                Integer.toHexString(new Character(c).hashCode()) + ")", o,

c);

    }


}


Regards,

Amey



---------------------------------------------------------------------

To unsubscribe, e-mail: [hidden email]

For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [text] Invalid unicode sequences on .substring of RandomStringGenerator

Amey Jadiye
Thanks much for checking this Bruno.

On Mon, Sep 11, 2017 at 3:05 PM, Bruno P. Kinoshita <
[hidden email]> wrote:

> Hi Amey,
>
> You created a byte array from the original string (which may contain
> surrogate chars). But then you created a copy string with `final String
> copy = new String(bytes, charset);`. There will be encoding to UTF-8, which
> may fail to encode some values, leading to the error you reported I suspect.
>
> I did this purposefully for checking LANG-100 issue, issue was about the
string conversion from UTF-16(default) to UTF-8  back and forth.
My expectation was this test should pass clean.


> If you try `final String copy = new String(bytes);` there will be still
> encoding to the default system charset as well.
>
> So I think the safest is to compare codepoints. Perhaps with something
> like this:
>
>     @Test
>     public void testSubStringWithSurrogatePair() {
>         for (int j = 0; j < 10; j++) {
>             final int size = 5000;            RandomStringGenerator
> generator = new RandomStringGenerator.Builder().build();
> String orig = generator.generate(size).substring(0, 2500);
>             final String copy = new String(orig);
>             for (int i = 0; i < orig.length() && i < copy.length(); i++)
> {                final int o = orig.codePointAt(i);                final
> int c = copy.codePointAt(i);                assertEquals(String.format("Differs
> where j = %d, i = %d, o = %d, and c = %d", j, i, o, c), o, c);
> }        }
>     }
>
> Running it 10 times, I was able to consistently reproduce the initial
> issue. It would always fail, about 4 out of 10. I think [rng] or somewhere
> in another commons component I kind of remember seeing unit tests for
> random generated values using loops? But may be mistaken (I don't trust my
> own memory). So feel free to leave that part out if you prefer. I tried the
> code above with j going up to 1000. After a few seconds, the test passed
> too.
>

yeah I did same kind of testing and found it happens intermittently,
whenever surrogate pair comes on last position where I cut string, i.e.
0 to 2500, so if there is pair on 2500 it will be cut in half and issue
comes which i found obvious now. And Yes I cant keep it as is for the sake
of not introducing LANG-100 again in commons-text.


> Doing `final String copy = new String(orig);` the value of the original
> string is completely copied onto the new string. So comparing the
> codepoints should do the trick. We may even want to add another assert
> statement before the for loop to confirm both strings have the same length?
>
Whatever you suggested here work fine with no issues, even length is same,
goal of doing this was to return exact length of string which asked by used.


> Hope that helps,Bruno
>
>
Now I want community advice that why the RandomStringGenerator's
.generate(int count) method designed in such way that it will return given
number of codepoints and not the actual length of String ? I'm ok with this
approach as well but can we have one more .generate which can return the
actual String of given length ? I found when I pass .50 it returns me ~70
length of string, as commons-dev its good but as application-dev its weird.

Regards,
Amey




> ________________________________
>
>
> From: Amey Jadiye <[hidden email]>
> To: Commons Developers List <[hidden email]>
> Sent: Monday, 11 September 2017 12:15 AM
> Subject: [text] Invalid unicode sequences on .substring of
> RandomStringGenerator
>
>
>
> Hi Folks,
>
>
> While working on RandomStringGenerator I found when I'm doing .substring on
>
> generated random string its failing intermittently with sequence of
>
> surrogate pair.
>
> same bug was raised in commons-lang
>
> https://issues.apache.org/jira/browse/LANG-100
>
>
> Is this possible bug with RandomStringGenerator ? or is this expected ?
>
>
> @Test
>
> public void testSubStringWithSurrogatePair() {
>
>     final int size = 5000;
>
>     final Charset charset = Charset.forName("UTF-8");
>
>     RandomStringGenerator generator = new
>
> RandomStringGenerator.Builder().build();
>
>     String orig = generator.generate(size).substring(0,2500);
>
>
>     final byte[] bytes = orig.getBytes(charset);
>
>     final String copy = new String(bytes, charset);
>
>
>     for (int i=0; i < orig.length() && i < copy.length(); i++) {
>
>         final char o = orig.charAt(i);
>
>         final char c = copy.charAt(i);
>
>         assertEquals("differs at " + i + "(" + Integer.toHexString(new
>
> Character(o).hashCode()) + "," +
>
>                 Integer.toHexString(new Character(c).hashCode()) + ")", o,
>
> c);
>
>     }
>
>
> }
>
>
> Regards,
>
> Amey
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: [hidden email]
>
> For additional commands, e-mail: [hidden email]
>



--

---------------------------------------------------------------------

To unsubscribe, e-mail: [hidden email]

For additional commands, e-mail: [hidden email]