Add custom Analyzer to handle misspells 💬 by Caio 7 years ago (log)
This patch introduces a PhoneticEnglishAnalyzer class that tokenizes an input into a series of stemmed tokens (just like the EnglishAnalyzer) _along_ with a phonetic representation of said tokens. I can't seem to learn how to write "cinnamon" correctly, always forgetting to double the first "n" up. This analyzer is an attempt of covering this (and similar) cases without serious performance penalties (say, using SpellCheck on every token). Given how this analyzer outputs two tokens for every token that the previous analyzer would, I'm expecting a growth of about 2x in the index size (somewhat less given that many things sounds VERY similar in English and that I've added [0,99] to the stop word set) I also expect a somewhat serious performance impact on my tests since we'll be hitting two inverted indices instead of one and, while they'll be matching mostly the same documents, all the jumping around means cache evictions and whatnot and that's gonna be noticeable on something that's heavily CPU bound such as this. Performance: Benchmark Mode Cnt Score Error Units QueryBenchmark.withPolicy thrpt 5 39.419 ± 0.123 ops/s QueryBenchmark.withPolicyAndFacets thrpt 5 8.886 ± 0.161 ops/s (withPolicyAndFacets adjusted to query for `bacon garlic` as `pork` started matching many more documents after the change) Still a bit too high of a cost in my opinion, so this remains tabled.