[Text] JaccardSimilarity

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Text] JaccardSimilarity

Alex Herbert
A quick question about the JaccardSimilarity class:

Q. Why does it round the similarity to 2 decimal places?

This is not documented.

It is also done in the complimentary JaccardDistance class.

Looking at the history in git it seems to have always been that way.
First commit was 2016-11-27.

Thanks,

Alex



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Text] JaccardSimilarity

Bruno P. Kinoshita-2
 Hi Alex,
Can't recall why it was done that way. When the initial code for the edit distances was created, some Java libraries like Simmetrics, java-string-similarity, Lucene, and also R/Python code were used to verify the output of the edit distances.
Maybe we used Math.round just to get a test passing, which I agree it had to be documented.
But even better if we just drop the Math.round and instead update the tests with that assertEquals(expected, actual, threshold) method, with a good enough threshold.
What do you think?
CheersBruno

    On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert <[hidden email]> wrote:  
 
 A quick question about the JaccardSimilarity class:

Q. Why does it round the similarity to 2 decimal places?

This is not documented.

It is also done in the complimentary JaccardDistance class.

Looking at the history in git it seems to have always been that way.
First commit was 2016-11-27.

Thanks,

Alex



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

 
Reply | Threaded
Open this post in threaded view
|

Re: [Text] JaccardSimilarity

Alex Herbert
Hi Bruno,

> On 7 Mar 2019, at 21:18, Bruno P. Kinoshita <[hidden email]> wrote:
>
> Hi Alex,
> Can't recall why it was done that way. When the initial code for the edit distances was created, some Java libraries like Simmetrics, java-string-similarity, Lucene, and also R/Python code were used to verify the output of the edit distances.
> Maybe we used Math.round just to get a test passing, which I agree it had to be documented.
> But even better if we just drop the Math.round and instead update the tests with that assertEquals(expected, actual, threshold) method, with a good enough threshold.
> What do you think?

I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so it is noted if someone upgrades. They can always restore functionality to as-it-was by doing a round on the output of the class.

If I understand the metric correctly (intersect over union) to have a difference in the 3rd decimal place would require the union of the two character sets to be above 200, i.e. a string containing over 200 unique characters, e.g.

A) 0/200 = 0
B) 1/200 = 0.005
C) 2/200 = 0.01

In this case result A and C can be distinguished but not B and C due to round up.

So in practical terms it would not make a difference unless using a large character set. For ASCII strings there is no difference.

I’ve already made the test using the python distance.jaccard function from the distance library in the PR for Text-155. So changing the test is simple. It’s just the decision on whether to do it.

Alex


> CheersBruno
>
>    On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert <[hidden email]> wrote:  
>
> A quick question about the JaccardSimilarity class:
>
> Q. Why does it round the similarity to 2 decimal places?
>
> This is not documented.
>
> It is also done in the complimentary JaccardDistance class.
>
> Looking at the history in git it seems to have always been that way.
> First commit was 2016-11-27.
>
> Thanks,
>
> Alex
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Text] JaccardSimilarity

Bruno P. Kinoshita-3
 >I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so it is noted if someone upgrades. They can always restore functionality to as-it-was by doing a round on the output of the class. 
+1
>I’ve already made the test using the python distance.jaccard function from the distance library in the PR for Text-155. So changing the test is simple. It’s just the decision on whether to do it.
I think we can aim at implementing this for 1.7 (which from the looks of it will have several bug fixes & improvements!).
CheersBruno


    On Friday, 8 March 2019, 10:54:32 am NZDT, Alex Herbert <[hidden email]> wrote:  
 
 Hi Bruno,

> On 7 Mar 2019, at 21:18, Bruno P. Kinoshita <[hidden email]> wrote:
>
> Hi Alex,
> Can't recall why it was done that way. When the initial code for the edit distances was created, some Java libraries like Simmetrics, java-string-similarity, Lucene, and also R/Python code were used to verify the output of the edit distances.
> Maybe we used Math.round just to get a test passing, which I agree it had to be documented.
> But even better if we just drop the Math.round and instead update the tests with that assertEquals(expected, actual, threshold) method, with a good enough threshold.
> What do you think?

I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so it is noted if someone upgrades. They can always restore functionality to as-it-was by doing a round on the output of the class.

If I understand the metric correctly (intersect over union) to have a difference in the 3rd decimal place would require the union of the two character sets to be above 200, i.e. a string containing over 200 unique characters, e.g.

A) 0/200 = 0
B) 1/200 = 0.005
C) 2/200 = 0.01

In this case result A and C can be distinguished but not B and C due to round up.

So in practical terms it would not make a difference unless using a large character set. For ASCII strings there is no difference.

I’ve already made the test using the python distance.jaccard function from the distance library in the PR for Text-155. So changing the test is simple. It’s just the decision on whether to do it.

Alex


> CheersBruno
>
>    On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert <[hidden email]> wrote: 
>
> A quick question about the JaccardSimilarity class:
>
> Q. Why does it round the similarity to 2 decimal places?
>
> This is not documented.
>
> It is also done in the complimentary JaccardDistance class.
>
> Looking at the history in git it seems to have always been that way.
> First commit was 2016-11-27.
>
> Thanks,
>
> Alex
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
 
Reply | Threaded
Open this post in threaded view
|

Re: [Text] JaccardSimilarity

Alex Herbert


> On 8 Mar 2019, at 00:01, Bruno P. Kinoshita <[hidden email]> wrote:
>
>> I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so it is noted if someone upgrades. They can always restore functionality to as-it-was by doing a round on the output of the class.
> +1
>> I’ve already made the test using the python distance.jaccard function from the distance library in the PR for Text-155. So changing the test is simple. It’s just the decision on whether to do it.
> I think we can aim at implementing this for 1.7 (which from the looks of it will have several bug fixes & improvements!).
> CheersBruno

I'll put the changes into a Jira and PR.

Alex


>
>
>    On Friday, 8 March 2019, 10:54:32 am NZDT, Alex Herbert <[hidden email]> wrote:  
>
> Hi Bruno,
>
>> On 7 Mar 2019, at 21:18, Bruno P. Kinoshita <[hidden email]> wrote:
>>
>> Hi Alex,
>> Can't recall why it was done that way. When the initial code for the edit distances was created, some Java libraries like Simmetrics, java-string-similarity, Lucene, and also R/Python code were used to verify the output of the edit distances.
>> Maybe we used Math.round just to get a test passing, which I agree it had to be documented.
>> But even better if we just drop the Math.round and instead update the tests with that assertEquals(expected, actual, threshold) method, with a good enough threshold.
>> What do you think?
>
> I’d favour dropping the round and adding it to the Changes.xml via a Jira ticket so it is noted if someone upgrades. They can always restore functionality to as-it-was by doing a round on the output of the class.
>
> If I understand the metric correctly (intersect over union) to have a difference in the 3rd decimal place would require the union of the two character sets to be above 200, i.e. a string containing over 200 unique characters, e.g.
>
> A) 0/200 = 0
> B) 1/200 = 0.005
> C) 2/200 = 0.01
>
> In this case result A and C can be distinguished but not B and C due to round up.
>
> So in practical terms it would not make a difference unless using a large character set. For ASCII strings there is no difference.
>
> I’ve already made the test using the python distance.jaccard function from the distance library in the PR for Text-155. So changing the test is simple. It’s just the decision on whether to do it.
>
> Alex
>
>
>> CheersBruno
>>
>>     On Friday, 8 March 2019, 4:49:52 am NZDT, Alex Herbert <[hidden email]> wrote:  
>>
>> A quick question about the JaccardSimilarity class:
>>
>> Q. Why does it round the similarity to 2 decimal places?
>>
>> This is not documented.
>>
>> It is also done in the complimentary JaccardDistance class.
>>
>> Looking at the history in git it seems to have always been that way.
>> First commit was 2016-11-27.
>>
>> Thanks,
>>
>> Alex
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]