[jira] [Created] (CSV-58) Unicode escapes are lost if escape character is backslash

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (CSV-58) Unicode escapes are lost if escape character is backslash

Gilles Sadowski (Jira)
Unicode escapes are lost if escape character is backslash
---------------------------------------------------------

                 Key: CSV-58
                 URL: https://issues.apache.org/jira/browse/CSV-58
             Project: Commons CSV
          Issue Type: Bug
            Reporter: Sebb


The current escape parsing converts <esc><char> to plain <char> if the <char> is not one of the special characters to be escaped.

This can affect unicode escapes if the <esc> character is backslash.

One way round this is to specifically check for <char> == 'u', but it seems wrong to only do this for 'u'.

Another solution would be to leave <esc><char> as is unless the <char> is one of the special characters.

There are several possible ways to treat unrecognised escapes:
- treat it as if the escape char had not been present (current behaviour)
- leave the escape char as is
- throw an exception

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CSV-58) Unicode escapes are lost if escape character is backslash

Gilles Sadowski (Jira)

    [ https://issues.apache.org/jira/browse/CSV-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229365#comment-13229365 ]

Emmanuel Bourg commented on CSV-58:
-----------------------------------

Are you sure? The unicode escape sequences are transformed before reaching the parser.
               

> Unicode escapes are lost if escape character is backslash
> ---------------------------------------------------------
>
>                 Key: CSV-58
>                 URL: https://issues.apache.org/jira/browse/CSV-58
>             Project: Commons CSV
>          Issue Type: Bug
>            Reporter: Sebb
>
> The current escape parsing converts <esc><char> to plain <char> if the <char> is not one of the special characters to be escaped.
> This can affect unicode escapes if the <esc> character is backslash.
> One way round this is to specifically check for <char> == 'u', but it seems wrong to only do this for 'u'.
> Another solution would be to leave <esc><char> as is unless the <char> is one of the special characters.
> There are several possible ways to treat unrecognised escapes:
> - treat it as if the escape char had not been present (current behaviour)
> - leave the escape char as is
> - throw an exception

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CSV-58) Unicode escapes are lost if escape character is backslash

Gilles Sadowski (Jira)
In reply to this post by Gilles Sadowski (Jira)

    [ https://issues.apache.org/jira/browse/CSV-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229415#comment-13229415 ]

Sebb commented on CSV-58:
-------------------------

If unicode parsing is not selected, the unicode sequences lose their escape character so cannot then be parsed later.

This is really about more than just unicode escape sequences, though that is what alerted me to the issue.

The whole business of escape handling needs to be very carefully documented (and tested!) to ensure predictable behaviour.
               

> Unicode escapes are lost if escape character is backslash
> ---------------------------------------------------------
>
>                 Key: CSV-58
>                 URL: https://issues.apache.org/jira/browse/CSV-58
>             Project: Commons CSV
>          Issue Type: Bug
>            Reporter: Sebb
>
> The current escape parsing converts <esc><char> to plain <char> if the <char> is not one of the special characters to be escaped.
> This can affect unicode escapes if the <esc> character is backslash.
> One way round this is to specifically check for <char> == 'u', but it seems wrong to only do this for 'u'.
> Another solution would be to leave <esc><char> as is unless the <char> is one of the special characters.
> There are several possible ways to treat unrecognised escapes:
> - treat it as if the escape char had not been present (current behaviour)
> - leave the escape char as is
> - throw an exception

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CSV-58) Unicode escapes are lost if escape character is backslash

Gilles Sadowski (Jira)
In reply to this post by Gilles Sadowski (Jira)

    [ https://issues.apache.org/jira/browse/CSV-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229762#comment-13229762 ]

Emmanuel Bourg commented on CSV-58:
-----------------------------------

Understood. The whole escaping logic is dubious, there are a lot of corner cases. I'm trying to understand who actually use unicode and control character escapes in CSV files. It seems at least HSQLDB accept them when reading, but prefers using quotes when writing.
               

> Unicode escapes are lost if escape character is backslash
> ---------------------------------------------------------
>
>                 Key: CSV-58
>                 URL: https://issues.apache.org/jira/browse/CSV-58
>             Project: Commons CSV
>          Issue Type: Bug
>            Reporter: Sebb
>
> The current escape parsing converts <esc><char> to plain <char> if the <char> is not one of the special characters to be escaped.
> This can affect unicode escapes if the <esc> character is backslash.
> One way round this is to specifically check for <char> == 'u', but it seems wrong to only do this for 'u'.
> Another solution would be to leave <esc><char> as is unless the <char> is one of the special characters.
> There are several possible ways to treat unrecognised escapes:
> - treat it as if the escape char had not been present (current behaviour)
> - leave the escape char as is
> - throw an exception

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (CSV-58) Unicode escapes are lost if escape character is backslash

Gilles Sadowski (Jira)
In reply to this post by Gilles Sadowski (Jira)

     [ https://issues.apache.org/jira/browse/CSV-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Emmanuel Bourg updated CSV-58:
------------------------------

      Component/s: Parser
    Fix Version/s: 1.0
   

> Unicode escapes are lost if escape character is backslash
> ---------------------------------------------------------
>
>                 Key: CSV-58
>                 URL: https://issues.apache.org/jira/browse/CSV-58
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Parser
>            Reporter: Sebb
>             Fix For: 1.0
>
>
> The current escape parsing converts <esc><char> to plain <char> if the <char> is not one of the special characters to be escaped.
> This can affect unicode escapes if the <esc> character is backslash.
> One way round this is to specifically check for <char> == 'u', but it seems wrong to only do this for 'u'.
> Another solution would be to leave <esc><char> as is unless the <char> is one of the special characters.
> There are several possible ways to treat unrecognised escapes:
> - treat it as if the escape char had not been present (current behaviour)
> - leave the escape char as is
> - throw an exception

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CSV-58) Unicode escapes are lost if escape character is backslash

Gilles Sadowski (Jira)
In reply to this post by Gilles Sadowski (Jira)

    [ https://issues.apache.org/jira/browse/CSV-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231924#comment-13231924 ]

Sebb commented on CSV-58:
-------------------------

I think the default should be to retain the original source characters if the escape sequence is not recognised.
This will allow the application to take further action if necessary.

Failing that, throw an exception. Silently dropping the escape character seems the worst choice as the default.

There's also the issue of what meta-characters should be de-escaped.
It seems reasonable to include the encapsulator and CR, LF, possibly also the delimiter.

But should any escapes - apart from the encapsulator itself - be processed in an encapsulated token?
There's no need to do so.

Maybe escape handling should be overrideable by the user.

               

> Unicode escapes are lost if escape character is backslash
> ---------------------------------------------------------
>
>                 Key: CSV-58
>                 URL: https://issues.apache.org/jira/browse/CSV-58
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Parser
>            Reporter: Sebb
>             Fix For: 1.0
>
>
> The current escape parsing converts <esc><char> to plain <char> if the <char> is not one of the special characters to be escaped.
> This can affect unicode escapes if the <esc> character is backslash.
> One way round this is to specifically check for <char> == 'u', but it seems wrong to only do this for 'u'.
> Another solution would be to leave <esc><char> as is unless the <char> is one of the special characters.
> There are several possible ways to treat unrecognised escapes:
> - treat it as if the escape char had not been present (current behaviour)
> - leave the escape char as is
> - throw an exception

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (CSV-58) Eascape handling needs rethinking

Gilles Sadowski (Jira)
In reply to this post by Gilles Sadowski (Jira)

     [ https://issues.apache.org/jira/browse/CSV-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebb updated CSV-58:
--------------------

    Summary: Eascape handling needs rethinking  (was: Unicode escapes are lost if escape character is backslash)
   

> Eascape handling needs rethinking
> ---------------------------------
>
>                 Key: CSV-58
>                 URL: https://issues.apache.org/jira/browse/CSV-58
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Parser
>            Reporter: Sebb
>             Fix For: 1.0
>
>
> The current escape parsing converts <esc><char> to plain <char> if the <char> is not one of the special characters to be escaped.
> This can affect unicode escapes if the <esc> character is backslash.
> One way round this is to specifically check for <char> == 'u', but it seems wrong to only do this for 'u'.
> Another solution would be to leave <esc><char> as is unless the <char> is one of the special characters.
> There are several possible ways to treat unrecognised escapes:
> - treat it as if the escape char had not been present (current behaviour)
> - leave the escape char as is
> - throw an exception

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (CSV-58) Escape handling needs rethinking

Gilles Sadowski (Jira)
In reply to this post by Gilles Sadowski (Jira)

     [ https://issues.apache.org/jira/browse/CSV-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebb updated CSV-58:
--------------------

    Summary: Escape handling needs rethinking  (was: Eascape handling needs rethinking)
   

> Escape handling needs rethinking
> --------------------------------
>
>                 Key: CSV-58
>                 URL: https://issues.apache.org/jira/browse/CSV-58
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Parser
>            Reporter: Sebb
>             Fix For: 1.0
>
>
> The current escape parsing converts <esc><char> to plain <char> if the <char> is not one of the special characters to be escaped.
> This can affect unicode escapes if the <esc> character is backslash.
> One way round this is to specifically check for <char> == 'u', but it seems wrong to only do this for 'u'.
> Another solution would be to leave <esc><char> as is unless the <char> is one of the special characters.
> There are several possible ways to treat unrecognised escapes:
> - treat it as if the escape char had not been present (current behaviour)
> - leave the escape char as is
> - throw an exception

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (CSV-58) Escape handling needs rethinking

Gilles Sadowski (Jira)
In reply to this post by Gilles Sadowski (Jira)

    [ https://issues.apache.org/jira/browse/CSV-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401180#comment-13401180 ]

Anirudha Khanna commented on CSV-58:
------------------------------------

I came across a concrete use case for rethinking the escape handling in Commons CSV. MySQL out format represents NULL as \N. This is a special character that should be retained as is. In this regard, I have made an attempt to modify the escape handling in the Commons CSV parser. I have not made corresponding changes to CSV writer and will submit a patch in that regards soon to. Looking forward to thoughts.
               

> Escape handling needs rethinking
> --------------------------------
>
>                 Key: CSV-58
>                 URL: https://issues.apache.org/jira/browse/CSV-58
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Parser
>            Reporter: Sebb
>             Fix For: 1.0
>
>
> The current escape parsing converts <esc><char> to plain <char> if the <char> is not one of the special characters to be escaped.
> This can affect unicode escapes if the <esc> character is backslash.
> One way round this is to specifically check for <char> == 'u', but it seems wrong to only do this for 'u'.
> Another solution would be to leave <esc><char> as is unless the <char> is one of the special characters.
> There are several possible ways to treat unrecognised escapes:
> - treat it as if the escape char had not been present (current behaviour)
> - leave the escape char as is
> - throw an exception

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (CSV-58) Escape handling needs rethinking

Gilles Sadowski (Jira)
In reply to this post by Gilles Sadowski (Jira)

     [ https://issues.apache.org/jira/browse/CSV-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anirudha Khanna updated CSV-58:
-------------------------------

    Attachment: commons-csv.diff

Patch consisting of changes to expand escape handling in CSV Parser
               

> Escape handling needs rethinking
> --------------------------------
>
>                 Key: CSV-58
>                 URL: https://issues.apache.org/jira/browse/CSV-58
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Parser
>            Reporter: Sebb
>             Fix For: 1.0
>
>         Attachments: commons-csv.diff
>
>
> The current escape parsing converts <esc><char> to plain <char> if the <char> is not one of the special characters to be escaped.
> This can affect unicode escapes if the <esc> character is backslash.
> One way round this is to specifically check for <char> == 'u', but it seems wrong to only do this for 'u'.
> Another solution would be to leave <esc><char> as is unless the <char> is one of the special characters.
> There are several possible ways to treat unrecognised escapes:
> - treat it as if the escape char had not been present (current behaviour)
> - leave the escape char as is
> - throw an exception

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira