[sandbox] New sandbox component

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[sandbox] New sandbox component

Bruno P. Kinoshita
Hello all, 
At the moment I'm working with data matching and record linkage, and had to port some existing string comparison algorithms found in several open source projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]).
At that time I noticed LANG-591 [1], which suggests a more complex levenshtein distance algorithm. There are several other algorithms too (damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex, metaphone). Instead of trying to put them all in, say, [lang], I'd like to experiment with a new [text] component in the sandbox, if there are no objections. 
I will take a look at the existing code and its license, but most of these algorithms have good Wiki pages with pseudo code available; as well as academic papers. 
Maybe this component could be useful for other projects like [lang], Lucene, larsga/Duke, and Talend Open Studio. And even though my initial use case for this would be string comparison, I think it could support other use cases too.
Thoughts on this? Anyone else interested on such a component? 
Thanks!Bruno
[1] https://issues.apache.org/jira/browse/LANG-591 
Reply | Threaded
Open this post in threaded view
|

Re: [sandbox] New sandbox component

Benedikt Ritter-4
No objections from my site. I think this is a good idea. Just let me know
if you need help with the bootstraping of the new project. Maybe we should
even announce this on announce@. There my be other projects interested in a
library like this (for example Apache Tika [1])

Benedikt

[1] http://tika.apache.org/

2014-10-27 0:41 GMT+01:00 Bruno P. Kinoshita <[hidden email]>:

> Hello all,
> At the moment I'm working with data matching and record linkage, and had
> to port some existing string comparison algorithms found in several open
> source projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]).
> At that time I noticed LANG-591 [1], which suggests a more complex
> levenshtein distance algorithm. There are several other algorithms too
> (damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex,
> metaphone). Instead of trying to put them all in, say, [lang], I'd like to
> experiment with a new [text] component in the sandbox, if there are no
> objections.
> I will take a look at the existing code and its license, but most of these
> algorithms have good Wiki pages with pseudo code available; as well as
> academic papers.
> Maybe this component could be useful for other projects like [lang],
> Lucene, larsga/Duke, and Talend Open Studio. And even though my initial use
> case for this would be string comparison, I think it could support other
> use cases too.
> Thoughts on this? Anyone else interested on such a component?
> Thanks!Bruno
> [1] https://issues.apache.org/jira/browse/LANG-591




--
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter
Reply | Threaded
Open this post in threaded view
|

Re: [sandbox] New sandbox component

Luc Maisonobe-2
Le 27/10/2014 08:45, Benedikt Ritter a écrit :

> No objections from my site. I think this is a good idea. Just let me know
> if you need help with the bootstraping of the new project. Maybe we should
> even announce this on announce@. There my be other projects interested in a
> library like this (for example Apache Tika [1])
>
> Benedikt
>
> [1] http://tika.apache.org/
>
> 2014-10-27 0:41 GMT+01:00 Bruno P. Kinoshita <[hidden email]>:
>
>> Hello all,
>> At the moment I'm working with data matching and record linkage, and had
>> to port some existing string comparison algorithms found in several open
>> source projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]).

There is also an implementation of the Meyer algorithm in [collections],
package org.apache.commons.collections4.sequence.

best regards,
Luc

>> At that time I noticed LANG-591 [1], which suggests a more complex
>> levenshtein distance algorithm. There are several other algorithms too
>> (damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex,
>> metaphone). Instead of trying to put them all in, say, [lang], I'd like to
>> experiment with a new [text] component in the sandbox, if there are no
>> objections.
>> I will take a look at the existing code and its license, but most of these
>> algorithms have good Wiki pages with pseudo code available; as well as
>> academic papers.
>> Maybe this component could be useful for other projects like [lang],
>> Lucene, larsga/Duke, and Talend Open Studio. And even though my initial use
>> case for this would be string comparison, I think it could support other
>> use cases too.
>> Thoughts on this? Anyone else interested on such a component?
>> Thanks!Bruno
>> [1] https://issues.apache.org/jira/browse/LANG-591
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [sandbox] New sandbox component

Bruno P. Kinoshita
In reply to this post by Benedikt Ritter-4
Hi Benedikt!
> Just let me know if you need help with the bootstraping of the new project.
Yes, please :)

> Maybe we should even announce this on announce@. There my be other projects interested in a library like this (for example Apache Tika [1])
Good idea! Should we drop a note there once the project has been created or after we already have some code in there?

 Thanks!Bruno


      From: Benedikt Ritter <[hidden email]>
 To: Commons Developers List <[hidden email]>; Bruno P. Kinoshita <[hidden email]>
 Sent: Monday, October 27, 2014 5:45 AM
 Subject: Re: [sandbox] New sandbox component
   
No objections from my site. I think this is a good idea. Just let me know if you need help with the bootstraping of the new project. Maybe we should even announce this on announce@. There my be other projects interested in a library like this (for example Apache Tika [1])

Benedikt

[1] http://tika.apache.org/



2014-10-27 0:41 GMT+01:00 Bruno P. Kinoshita <[hidden email]>:

Hello all, 
At the moment I'm working with data matching and record linkage, and had to port some existing string comparison algorithms found in several open source projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]).
At that time I noticed LANG-591 [1], which suggests a more complex levenshtein distance algorithm. There are several other algorithms too (damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex, metaphone). Instead of trying to put them all in, say, [lang], I'd like to experiment with a new [text] component in the sandbox, if there are no objections. 
I will take a look at the existing code and its license, but most of these algorithms have good Wiki pages with pseudo code available; as well as academic papers. 
Maybe this component could be useful for other projects like [lang], Lucene, larsga/Duke, and Talend Open Studio. And even though my initial use case for this would be string comparison, I think it could support other use cases too.
Thoughts on this? Anyone else interested on such a component? 
Thanks!Bruno
[1] https://issues.apache.org/jira/browse/LANG-591 



--
http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter

   
Reply | Threaded
Open this post in threaded view
|

Re: [sandbox] New sandbox component

Bruno P. Kinoshita
In reply to this post by Luc Maisonobe-2
Thanks Luc! Wasn't aware of that one.
Bruno
 

      From: Luc Maisonobe <[hidden email]>
 To: [hidden email]
 Sent: Monday, October 27, 2014 7:10 AM
 Subject: Re: [sandbox] New sandbox component
   
Le 27/10/2014 08:45, Benedikt Ritter a écrit :

> No objections from my site. I think this is a good idea. Just let me know
> if you need help with the bootstraping of the new project. Maybe we should
> even announce this on announce@. There my be other projects interested in a
> library like this (for example Apache Tika [1])
>
> Benedikt
>
> [1] http://tika.apache.org/
>
> 2014-10-27 0:41 GMT+01:00 Bruno P. Kinoshita <[hidden email]>:
>
>> Hello all,
>> At the moment I'm working with data matching and record linkage, and had
>> to port some existing string comparison algorithms found in several open
>> source projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]).

There is also an implementation of the Meyer algorithm in [collections],
package org.apache.commons.collections4.sequence.

best regards,
Luc



>> At that time I noticed LANG-591 [1], which suggests a more complex
>> levenshtein distance algorithm. There are several other algorithms too
>> (damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex,
>> metaphone). Instead of trying to put them all in, say, [lang], I'd like to
>> experiment with a new [text] component in the sandbox, if there are no
>> objections.
>> I will take a look at the existing code and its license, but most of these
>> algorithms have good Wiki pages with pseudo code available; as well as
>> academic papers.
>> Maybe this component could be useful for other projects like [lang],
>> Lucene, larsga/Duke, and Talend Open Studio. And even though my initial use
>> case for this would be string comparison, I think it could support other
>> use cases too.
>> Thoughts on this? Anyone else interested on such a component?
>> Thanks!Bruno
>> [1] https://issues.apache.org/jira/browse/LANG-591
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



   
Reply | Threaded
Open this post in threaded view
|

Re: [sandbox] New sandbox component

Benedikt Ritter-4
In reply to this post by Bruno P. Kinoshita
2014-10-27 12:32 GMT+01:00 Bruno P. Kinoshita <[hidden email]>:

> Hi Benedikt!
> > Just let me know if you need help with the bootstraping of the new
> project.
> Yes, please :)
>

I'll give folks some more time to share their thoughts about this and
create the new project then.


>
> > Maybe we should even announce this on announce@. There my be other
> projects interested in a library like this (for example Apache Tika [1])
> Good idea! Should we drop a note there once the project has been created
> or after we already have some code in there?
>

The latter seems appropriate to me.


>
>  Thanks!Bruno
>
>
>       From: Benedikt Ritter <[hidden email]>
>  To: Commons Developers List <[hidden email]>; Bruno P. Kinoshita
> <[hidden email]>
>  Sent: Monday, October 27, 2014 5:45 AM
>  Subject: Re: [sandbox] New sandbox component
>
> No objections from my site. I think this is a good idea. Just let me know
> if you need help with the bootstraping of the new project. Maybe we should
> even announce this on announce@. There my be other projects interested in
> a library like this (for example Apache Tika [1])
>
> Benedikt
>
> [1] http://tika.apache.org/
>
>
>
> 2014-10-27 0:41 GMT+01:00 Bruno P. Kinoshita <[hidden email]>:
>
> Hello all,
> At the moment I'm working with data matching and record linkage, and had
> to port some existing string comparison algorithms found in several open
> source projects (fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]).
> At that time I noticed LANG-591 [1], which suggests a more complex
> levenshtein distance algorithm. There are several other algorithms too
> (damerau-levenshtein, jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex,
> metaphone). Instead of trying to put them all in, say, [lang], I'd like to
> experiment with a new [text] component in the sandbox, if there are no
> objections.
> I will take a look at the existing code and its license, but most of these
> algorithms have good Wiki pages with pseudo code available; as well as
> academic papers.
> Maybe this component could be useful for other projects like [lang],
> Lucene, larsga/Duke, and Talend Open Studio. And even though my initial use
> case for this would be string comparison, I think it could support other
> use cases too.
> Thoughts on this? Anyone else interested on such a component?
> Thanks!Bruno
> [1] https://issues.apache.org/jira/browse/LANG-591
>
>
>
> --
>
> http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter
>
> --
>
> <http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter>
>
> <http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter>
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>