[text][TEXT-32] Regarding more edit distances.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[text][TEXT-32] Regarding more edit distances.

Rob Tompkins
Hello,

With the thought that we want more "edit distances”/“similarity scores” in the codebase for the potential 1.0 release of TEXT, I’ve opened an associated Jira (TEXT-32). I was wondering if any folks had any input about further ideas.

The first idea that I stumbled upon was an edit distance based upon the longest common substring. It feels a tad coarse, but that doesn’t necessarily mean that it’s not worth including.

Other thoughts and ideas?

Cheers,
-Rob
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [text][TEXT-32] Regarding more edit distances.

Bruno P. Kinoshita-3
Hi Rob,

LCS can still be useful for bioinformatics/genetics. So I'd say that's worth including. In Java, if I ever needed it, I would probably look for it at Biojava (which I just did and couldn't easily find it there).


As for the other string distances, I always look at this GitHub project:

https://github.com/tdebatty/java-string-similarity

And also Talend (I think Data Quality has some string distances). However, I think having the API design, and some string distances implemented could be enough for a 1.0. Then we can add more, and release more
versions.


Cheers
Bruno



----- Original Message -----

> From: Rob Tompkins <[hidden email]>
> To: Commons Developers List <[hidden email]>
> Sent: Monday, 19 December 2016 3:47 PM
> Subject: [text][TEXT-32] Regarding more edit distances.
>
> Hello,
>
> With the thought that we want more "edit distances”/“similarity scores” in
> the codebase for the potential 1.0 release of TEXT, I’ve opened an associated
> Jira (TEXT-32). I was wondering if any folks had any input about further ideas.
>
> The first idea that I stumbled upon was an edit distance based upon the longest
> common substring. It feels a tad coarse, but that doesn’t necessarily mean that
> it’s not worth including.
>
> Other thoughts and ideas?
>
> Cheers,
> -Rob
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]