![]() ![]() ![]() We will use Java and lucene package to try to simulate wee little bit of what Google does, using common First names of individuals. Here, we are going to use it on a much smaller scale. Again, Google does it on a humongous scale running on tons of systems, for millions of lookups on millions of terabytes of data sets and for pretty long sentences as well. So how does Google do it? Lets take a sneak peak of how it is done. If you search with a wrongly spelled word, you usually find the correct spelling displayed as one of the first options in the drop down list of suggestions. “Lets ‘ google’ it”, or “Did you try ‘ googling’ it?” There is even an adjective, “ googleable”. Now it is so common to search using google, that it has been turned into a verb. I'm using Lucene 5.1.0 libraries, although I'm still using sesame 2.8.11 (still have to migrate my tool to RDF4j), I don't know if that could cause to still process the floating point value in the fuzzy searches.Google – We all know what it is. It is not a very important issue because I could force the users to only set integer values (0, 1 or 2), but I would like to understand completely the behaviour of Lucene because I'd like to add to each result its similarity value (not the score, but a similarity percentage based on the edit distance). If I use a value between 0.85 and 1 I obtained 0 results.Īlso, if I use a decimal value higher than 1 I get a Parse.Exception expressing that fractional edit dinstances are not allowed. If I use a value between 0.715 and 0.85 I obtained 936 results If I use a value between 0 and 0.715 I obtained 966 results. Those results are what I expect, however I've tried with values lower than 1 and the results are a bit confusing (I know it doesn't make sense to use those values, but are the ones I've been trying before because I thought they were the only allowed values and I obtained different amount of results in different ranges between 0 and 1): ![]() In a test I've made I obtained 936 results setting the number of edits to 1, and 966 if I set it to an integer of 2 or higher (although >2 will still be 2 right?). I still find an odd behaviour considering that explanation on how the similarity value works in fuzzy searches. Thank you very much for your attention, it would be enough if you link me a website where it is explained how does the Lucene query engine works for fuzzy searches. I calculated (similarity("aaa", "aba") + similarity ("ccc", "ccd")) / 2Īnd the result doesn't match either with the behaviour of the Lucene query engine. Query = "aaa ccc" (effectively it is "+aaa~0.5 +ccc~0.5") Having this in mind I've also tried to calculate the average of the matches in pairs of words, e.g: And all the tests with a similarity value lower than the threshold are getting the same number of results. It seems that all the results I'm obtaining are getting the same similarity value because if I specify a value higher than a certain threshold they are all filtered out. I require all the words of the input to appear in the results, and I apply the same similarity value to all of the words. Similarity=(longerLength(x,y) - levenshteinDistance(x,y)) / longerLength(x,y)Īnd it doesn't match with the tests I made (e.g: I obtain 0.6 for a concrete result but it still appears if I use a similarity value filter of 0.7). I've tried to calculate it for myself with this formula: My issue is if I specify a value for this parameter I'd need to retrieve the concrete value obtained for each result and I don't find any site explaining how it is computed (specially how the Levensthein distance is converted to a percentage). The required similarity value (between 0 and 1). ![]() They also say that an additional parameter can be used to specify I need your help to undestand the fuzzy search functionality of the LuceneSail.Īccording to Lucene query parser syntax documentation ( ), Lucene supports fuzzy searches based on the Levensthein distance adding the "~" symbol to any word of the input of a fulltext query. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |