Log
-
[maven-release-plugin] prepare release 0.5.2 by Caio 7 years ago
-
Add similar_ids to the model 💬 by Caio 7 years ago
Precomputed list of similar recipe ids for the generalized case.
-
[maven-release-plugin] prepare release 0.5.1 by Caio 7 years ago
-
Drop unused index fields 💬 by Caio 7 years ago
These are unused since 7244fb1319e5fe6ee03c151a4d102b3399917267
-
[maven-release-plugin] prepare release 0.5.0 by Caio 7 years ago
-
Make SearchPolicy allow rewriting queries 💬 by Caio 7 years ago
This patch changes the query inspection interface to return a Query object, thus allowing for conditional rewrites client-side.
-
Only query FULL_RECIPE when searching 💬 by Caio 7 years ago
This simplifies things for some more complicated policy logic I'd like to implement. I'll leave query smartness for the graph model and generalized logic instead of queryparser magic.
-
Tune MoreLikeThis query building parameters 💬 by Caio 7 years ago
Filter stop words, use a real threshold for doc frequency instead of a percentage.
-
Drop support for `findSimilar(long, int)` 💬 by Caio 7 years ago
While I'm not particularly bothered by the index size increase that this feature requires, it's simpler and a lot more efficient to just pre-compute the similarities and persist in the db.
-
Reference getFieldNameForDiet directly 💬 by Caio 7 years ago
We already do a wildcard static import of the IndexField class
-
Ensure `findSimilar(long, int)` respects maxResults 💬 by Caio 7 years ago
The only way this can happen with a MoreLikeThis-based implementation is if we somehow end up with several distinct recipes with exactly the same content, so this test is more for my future self when trying to make this faster.
-
[maven-release-plugin] prepare release 0.4.0 by Caio 7 years ago
-
Expose findSimilar(recipeId, maxResults) 💬 by Caio 7 years ago
This patch exposes a `Searcher.findSimilar(long, int)` API so that we can make use of already indexed data to find similar recipes to a given (known) recipe id. It's effectively more efficient `Searcher.findSimilar(String, int)` for when we already know the recipe we're querying for similarities.
-
Store term vectors in the FULL_RECIPE index by Caio 7 years ago
-
Add support for searching for a recipeId 💬 by Caio 7 years ago
This patch adds a new LondPoint to the index to allow us to query based on the recipeId. This is useful for using functionality that requires the (out of my control) document Id that lucene generates.
-
Merge branch 'morelikethis' by Caio 7 years ago
-
[maven-release-plugin] prepare release 0.3.1 by Caio 7 years ago
-
Experimental MoreLikeThis-based similarity 💬 by Caio 7 years ago
This patch exposes a `Searcher.findSimilar(String, int)` method that expects a plaintext recipe (name + ingredients + instructions) as input. The reason I wasn't liking the MoreLikeThis results is that since I split the FULLTEXT index field into separate fields, the logic in MoreLikeThis couldn't be applied properly given the String-based interface I decided to use. Elaborating: - MoreLikeThis works by looking at the term statistics for the queried text and compares against the frequencies of the documents stored in the index - Passing a single string to represent the recipe meant that it needed a way to know which of the given terms should only happen in the name (or only in ingredients, or instructions) So I had two approaches to tackle this: 1. Manually build a boolean query with a `MoreLikeThis.like()` clause for each field we care and boost accordingly 2. Add back a "full text" index and query on it. Alternative #1 sounded annoying to tweak, #2 sounded simple but costing an increase in the index size. I chose #2 because we have RAM to spare, so `vmtouch`-ing the index and letting Lucene fly is still very applicable. This patch adds a QA-type test where we just verify that searching for a recipe that's in the index actually returns said recipe in the top 5 results. -
Automated code formatting run by Caio 7 years ago
-
Always compute accurate counts 💬 by Caio 7 years ago
Lucene 8.0.0 introduced some query-time optimizations that make totalHits an approximate value when the result comes straight from the IndexSearcher. References: - https://issues.apache.org/jira/browse/LUCENE-8135 - https://issues.apache.org/jira/browse/LUCENE-8060 I didn't catch these in my previous upgrade attempts because I was always using a FacetsCollector, which triggers the "unoptimized" code. This patch simply makes us *always* call `count()` so that our totalHits are always accurate.
-
[maven-release-plugin] prepare release 0.3.0 by Caio 7 years ago
-
Upgrade to lucene 8.0.0 (again) 💬 by Caio 7 years ago
Now with a better search path (ref: da33d428) there isn't a much noticeable throughput loss for the facetedQuery and basicQuery is actually faster[^1]. Lucene 7.7.1: Benchmark Mode Cnt Score Error Units QueryBenchmark.basicQuery thrpt 3 98.053 ± 6.045 ops/s QueryBenchmark.facetedQuery thrpt 3 10.936 ± 0.203 ops/s Lucene 8.0.0: Benchmark Mode Cnt Score Error Units QueryBenchmark.basicQuery thrpt 3 103.662 ± 3.564 ops/s QueryBenchmark.facetedQuery thrpt 3 10.328 ± 0.741 ops/s [^1]: Benchmarks were run on a noisy system, so I'm not taking these numbers as face value, but there's no indication of the previously observed 20+% throughput loss that made me block the upgrade. -
Use a FacetsCollector IFF facets will be collected 💬 by Caio 7 years ago
Searching without FacetsCollector is drastically faster as it does not need to, erm, collect facets. Why could have guessed? :-P This patch changes the search behaviour as follows: * If query.maxFacets() == 0: A search is done the fastest way possible; That's more than 2x faster than a plain search before this patch. * If query.maxFacets() > 0: We `count()` the number of matching docs first and use it to decide whether to allow computing facets or not (SearchPolicy.shouldComputeFacets now takes an int): counting is super fast and doesn't significantly affect the timings in our current 1+M index size.
-
Persist IndexConfiguration to disk upon request 💬 by Caio 7 years ago
For some reason, FacetsConfig settings are not written and must be set correctly otherwise counts will be wrong in some cases. One alternative would be to pass the CategoryExtractor around, but this means that I always need to know about how to extract facets even if all I'm doing is querying; This patch takes an approach that I consider better: we persist the necessary FacetsConfig setting to the base directory we load from so that a Searcher will always do the right thing.
-
Remove braces surrounding lambda 💬 by Caio 7 years ago
Purely cosmetic.