Log for src/
-
Experimental MoreLikeThis-based similarity 💬 by Caio 6 years ago
This patch exposes a `Searcher.findSimilar(String, int)` method that expects a plaintext recipe (name + ingredients + instructions) as input. The reason I wasn't liking the MoreLikeThis results is that since I split the FULLTEXT index field into separate fields, the logic in MoreLikeThis couldn't be applied properly given the String-based interface I decided to use. Elaborating: - MoreLikeThis works by looking at the term statistics for the queried text and compares against the frequencies of the documents stored in the index - Passing a single string to represent the recipe meant that it needed a way to know which of the given terms should only happen in the name (or only in ingredients, or instructions) So I had two approaches to tackle this: 1. Manually build a boolean query with a `MoreLikeThis.like()` clause for each field we care and boost accordingly 2. Add back a "full text" index and query on it. Alternative #1 sounded annoying to tweak, #2 sounded simple but costing an increase in the index size. I chose #2 because we have RAM to spare, so `vmtouch`-ing the index and letting Lucene fly is still very applicable. This patch adds a QA-type test where we just verify that searching for a recipe that's in the index actually returns said recipe in the top 5 results.
-
Automated code formatting run by Caio 6 years ago
-
Always compute accurate counts 💬 by Caio 6 years ago
Lucene 8.0.0 introduced some query-time optimizations that make totalHits an approximate value when the result comes straight from the IndexSearcher. References: - https://issues.apache.org/jira/browse/LUCENE-8135 - https://issues.apache.org/jira/browse/LUCENE-8060 I didn't catch these in my previous upgrade attempts because I was always using a FacetsCollector, which triggers the "unoptimized" code. This patch simply makes us *always* call `count()` so that our totalHits are always accurate.
-
Upgrade to lucene 8.0.0 (again) 💬 by Caio 6 years ago
Now with a better search path (ref: da33d428) there isn't a much noticeable throughput loss for the facetedQuery and basicQuery is actually faster[^1]. Lucene 7.7.1: Benchmark Mode Cnt Score Error Units QueryBenchmark.basicQuery thrpt 3 98.053 ± 6.045 ops/s QueryBenchmark.facetedQuery thrpt 3 10.936 ± 0.203 ops/s Lucene 8.0.0: Benchmark Mode Cnt Score Error Units QueryBenchmark.basicQuery thrpt 3 103.662 ± 3.564 ops/s QueryBenchmark.facetedQuery thrpt 3 10.328 ± 0.741 ops/s [^1]: Benchmarks were run on a noisy system, so I'm not taking these numbers as face value, but there's no indication of the previously observed 20+% throughput loss that made me block the upgrade.
-
Use a FacetsCollector IFF facets will be collected 💬 by Caio 6 years ago
Searching without FacetsCollector is drastically faster as it does not need to, erm, collect facets. Why could have guessed? :-P This patch changes the search behaviour as follows: * If query.maxFacets() == 0: A search is done the fastest way possible; That's more than 2x faster than a plain search before this patch. * If query.maxFacets() > 0: We `count()` the number of matching docs first and use it to decide whether to allow computing facets or not (SearchPolicy.shouldComputeFacets now takes an int): counting is super fast and doesn't significantly affect the timings in our current 1+M index size.
-
Persist IndexConfiguration to disk upon request 💬 by Caio 6 years ago
For some reason, FacetsConfig settings are not written and must be set correctly otherwise counts will be wrong in some cases. One alternative would be to pass the CategoryExtractor around, but this means that I always need to know about how to extract facets even if all I'm doing is querying; This patch takes an approach that I consider better: we persist the necessary FacetsConfig setting to the base directory we load from so that a Searcher will always do the right thing.
-
Remove braces surrounding lambda 💬 by Caio 6 years ago
Purely cosmetic.
-
Remove dead code by Caio 6 years ago
-
And make LadelData.children() a Map<String, Long> 💬 by Caio 6 years ago
One less useless abstraction and the result API is drastically easier to use, just look at the tests.
-
Change SearchResult.facets() from List to Map by Caio 6 years ago
-
Remove Diet-related strictness 💬 by Caio 6 years ago
I might regret this later, but right now it doesn't seem like I need to care about diet names here. The only reason I can think of now is to avoid indexing rubbish like a bunch of invalid diet names because for some reason our source data started emitting such. I'll take the opportunity to drop this awkward class. If I end up needing some strictness here, I'll do it via the client side as well, like it was done for taxonomy facets.
-
Remove forgotten println() by Caio 6 years ago
-
Drop the Serializer 💬 by Caio 6 years ago
Useless and poorly named
-
Drop the Loader 💬 by Caio 6 years ago
This is being moved to casserole since it's what controls which facets to be indexed/collected.
-
Add the "diet" category to the test indexer 💬 by Caio 6 years ago
This is all that was really needed to make the tests pass again (sans API changes from previous patches, of course). I'm happy with it!
-
IndexField.FACET_DIET doesn't exist anymore by Caio 6 years ago
-
Adjust to new Searcher.Builder exception path by Caio 6 years ago
-
Make taxonomy facets configurable by the client 💬 by Caio 6 years ago
This commit makes the Indexer now accept an optional CategoryExtractor that is responsible to extract categories and labels from any recipe. What this effectively mean is that we support generic facets now and, by default, we don't index any facet. CategoryExtractorTest has an example on how this is all fitting together. I'm not particularly happy about the state Searcher.Builder() is at right now, but that's more cosmetic...
-
Re-add support for facets 💬 by Caio 6 years ago
Taxonomy-based single-index static facets. Not super fast, but saves us from doing range collections along with taxonomy ones. I don't have performance numbers from when we would expose all facets, so this is just comparing what's in master (Just a threshold-based count for diets) with the new implementation that has everything that's available as filters _but_ doesn't compute the threshold-aware counts: Old: QueryBenchmark.facetedQuery thrpt 5 3.584 ± 0.012 ops/s New: QueryBenchmark.facetedQuery thrpt 5 4.108 ± 0.013 ops/s So, still a 50+% throughput loss when compared to `basicQuery`, but now with all the counts that casserole cares about. Now we'll be able to use the SearchPolicy to conditionally allow computing the facets so that filters finally have counts :-) Since this patch doesn't implement the threshold-aware diet facet, the tests at SearcherTest.dietThreshold() have been modified to not care about facets.
-
Revert "Upgrade to Lucene 8.0.0" 💬 by Caio 6 years ago
This reverts commit 642a2cca848815702ea74895ffbd1cc5ffed418c. This upgrade has a performance regression I'm not willing to swallow before understanding what's going on: 7.7.1: QueryBenchmark.basicQuery thrpt 5 9.090 ± 0.090 ops/s 8.0.0: QueryBenchmark.basicQuery thrpt 5 6.528 ± 0.031 ops/s (Yes, these sorry numbers are from the production box)
-
Match Indexer.addRecipe style to SearcherImpl.toLuceneQuery 💬 by Caio 6 years ago
It's a bit more verbose but I find it easier to "get it".
-
Upgrade to Lucene 8.0.0 by Caio 6 years ago
-
Get rid of FileSystem, rework Builders 💬 by Caio 6 years ago
Ever since IndexConfiguration was introduced, FileSystem stated looking (even more) awkward. So today I finally got around getting rid of it. As this was getting quite involved I decided to clean up the Indexer and Searcher builders so things would be easier to grok.
-
Add more metadata to the db, adjust model.Recipe 💬 by Caio 6 years ago
This patch changes our database format to not contain the instructions and adds some of the metadata from model.Recipe into it (namely: timing and nutrition details). The end result is a database file that's less than half the original size: 933M new/chronicle.db 2.0G old/chronicle.db
-
Make Diet.isKnown() public by Caio 6 years ago
-
Adapt to new data model version 💬 by Caio 6 years ago
Glutenfree tags were unused and Vegan is too fussy to predict reliably with what I'm doing at the moment, so we replace it with Vegetarian.
-
Make Searcher implementations package private by Caio 6 years ago
-
Extract SearcherWithPolicy 💬 by Caio 6 years ago
Now this is a LOT nicer :-)
-
Drop QueryInterpreter: Move logic inside SearchImpl 💬 by Caio 6 years ago
It bothers me that I pass around a bunch of stuff just so that I can create this QueryInterpreter that I don't have any intention of providing alternative implementation for now. The SearchPolicy addition in particular was what flagged that the setup was quite awkward. Next step: Use inheritance to build a searcher that handles policies instead of the current error prone `searchPolicy != null`.
-
Extract the Searcher interface by Caio 6 years ago