caio.co/de/cerberus


Add custom Analyzer to handle misspells 💬 by Caio 7 years ago (log)
This patch introduces a PhoneticEnglishAnalyzer class that
tokenizes an input into a series of stemmed tokens (just
like the EnglishAnalyzer) _along_ with a phonetic representation
of said tokens.

I can't seem to learn how to write "cinnamon" correctly, always
forgetting to double the first "n" up. This analyzer is an attempt
of covering this (and similar) cases without serious performance
penalties (say, using SpellCheck on every token).

Given how this analyzer outputs two tokens for every token that
the previous analyzer would, I'm expecting a growth of about 2x
in the index size (somewhat less given that many things sounds
VERY similar in English and that I've added [0,99] to the stop
word set)

I also expect a somewhat serious performance impact on my tests
since we'll be hitting two inverted indices instead of one and,
while they'll be matching mostly the same documents, all the
jumping around means cache evictions and whatnot and that's
gonna be noticeable on something that's heavily CPU bound such
as this.

Performance:

Benchmark                            Mode  Cnt   Score   Error  Units
QueryBenchmark.withPolicy           thrpt    5  39.419 ± 0.123  ops/s
QueryBenchmark.withPolicyAndFacets  thrpt    5   8.886 ± 0.161  ops/s

(withPolicyAndFacets adjusted to query for `bacon garlic` as `pork`
started matching many more documents after the change)

Still a bit too high of a cost in my opinion, so this remains tabled.

Tree


Cerberus

Cerberus is the Search and Metadata Retrieval library for gula.recipes and makes use of Lucene for searching and Chronicle-Map for metadata persistence / memory-mapping.

Build

mvn install

Test

mvn test