Log for src/
-
Remove logback dependency by Caio 5 years ago
-
Simplify IndexConfiguration by Caio 5 years ago
-
Get rid of Indexer.Builder 💬 by Caio 5 years ago
Same reasoning as in 6935ad40bbce0da7484ef239ad93b5b9d1379e05
-
Get rid of Searcher.Builder 💬 by Caio 5 years ago
Whilst the pattern is super useful for things like SearchQuery, there's zero advantages to using it here and then the type-checking loss becomes kinda inexcusable :-)
-
SearchQuery: Replace dietThreshold() with diet() 💬 by Caio 5 years ago
This patch changes the query model to only allow a single chosen diet (and its corresponding threshold). This is being done mostly to simplify usage: the only code I have that makes use of the feature is unpublished and for reporting only - The web UI only uses one and I have no plans of allowing more than one.
-
Speed up database loading 💬 by Caio 5 years ago
Turns out that slurping a 14MB file (offsets.sdb) during initialization added 30s to the production server startup time lol. On my development box, which has an old but real direct read-write SSD the cost of doing this is negligible. This patch simply memory maps the offsets file before reading so now we're back to reasonable startup times in production.
-
Remove Util's tempDir on exit too by Caio 5 years ago
-
Cleanup temporary directories after tests 💬 by Caio 5 years ago
This patch uses Junit's `@TempDir` to replace usages of `Files.createTempDirectory(Path)` where possible. There's still one unmanaged directory in the Util class that I'll have to manually add exit handlers if I care enough.
-
Drop RecipeMetadataDatabase.findAllById(List<Long>) 💬 by Caio 5 years ago
No point in supporting this at the moment.
-
SDB: Add some basic tests by Caio 5 years ago
-
Organize the new code a little bit 💬 by Caio 5 years ago
Wrapping up for today, I'll pick the tests up next round.
-
Drop ChronicleMap 💬 by Caio 5 years ago
o/
-
WIP: Simple database to replace ChronicleMap 💬 by Caio 5 years ago
The main reason I'm using ChronicleMap is because I wanted an easy Map interface with persistence and off-heap memory. Getting rid of ChronicleMap reduces complexity, increases performance (though that is negligible, db get()s are far from being the bottleneck) and allows me to push Jdk12 forward. This patch implements a working alternative which can be simplified to a flat file with one recipe serialized (as a flatbuffer) after another. In order to speed up loading I also write to an extra file which contains the total number of recipes and recipe_id to offset associations. The offset lookup table is kept in heap backed by HPCC's primitive collections. Very little validation is done and the code is totally susceptible to bad input attacks, but I have full control of it, so :shrug: A trivial benchmark such as: ``` public class MyBenchmark { @State(Scope.Benchmark) public static class MyState { RecipeMetadataDatabase chronicle; RecipeMetadataDatabase sdb; long[] ids = new long[] {289492, 707192, 1061982, 1708006, 1659287, 1653257, 901573, 1557621, 1639379}; public MyState() { var cerberusPath = System.getProperty("cerberus"); var sdbPath = System.getProperty("sdb"); this.chronicle = ChronicleRecipeMetadataDatabase.open(Path.of(cerberusPath)); this.sdb = new SimpleRecipeMetadataDatabase(Path.of(sdbPath)); } } public MyBenchmark() {} private void check(RecipeMetadataDatabase db, long[] ids) { for (long id : ids) { var recipe = db.findById(id); assert recipe.isPresent(); if (recipe.get().getRecipeId() != id) { throw new RuntimeException("oof!"); } } } @Benchmark public void getChronicle(MyState state) { check(state.chronicle, state.ids); } @Benchmark public void getSdb(MyState state) { check(state.sdb, state.ids); } } ``` Shows that at least things aren't broken. Yet. > Benchmark Mode Cnt Score Error Units > MyBenchmark.getChronicle thrpt 5 62327.352 ± 214.437 ops/s > MyBenchmark.getSdb thrpt 5 2697423.234 ± 28008.573 ops/s
-
Search: Rewrite the query before using it 💬 by Caio 5 years ago
The lucene query derived from SearchQuery is used for count() and search(); Both calls trigger a `IndexSearcher.rewrite(Query)` call, so this patch reduces the duplicated work by rewriting it before usage. Note that this does not mean `rewrite(Query)` won't be called again, just that its subsequent calls will be cheaper.
-
SearchQuery: Expose a few helpful derived methods by Caio 5 years ago
-
Start allowing empty SearchQuery as an input 💬 by Caio 5 years ago
This patch doesn't do much by itself, but when using a searcher with policy now one will be able to inspect the parsed query, see that it was empty and do whatever it wants (In my current case: rewrite the query as a MatchAllDocsQuery).
-
Introduce SearchPolicy.rewriteParsedSimilarityQuery 💬 by Caio 5 years ago
Works similarly to rewriteParsedFulltextQuery, but for similarity queries.
-
Remove maxDocFreq restriction from moreLikeThis 💬 by Caio 5 years ago
I'm forgoing these performance-related things in favor of search policy logic, so that direct cerberus usage doesn't need to be influenced by production performance tunings.
-
Remove moreLikeThis StopWords setup 💬 by Caio 5 years ago
This set is the same one used by the analyzer, so the tokens will never exist in the stream.
-
Add similar_ids to the model 💬 by Caio 5 years ago
Precomputed list of similar recipe ids for the generalized case.
-
Drop unused index fields 💬 by Caio 5 years ago
These are unused since 7244fb1319e5fe6ee03c151a4d102b3399917267
-
Make SearchPolicy allow rewriting queries 💬 by Caio 5 years ago
This patch changes the query inspection interface to return a Query object, thus allowing for conditional rewrites client-side.
-
Only query FULL_RECIPE when searching 💬 by Caio 5 years ago
This simplifies things for some more complicated policy logic I'd like to implement. I'll leave query smartness for the graph model and generalized logic instead of queryparser magic.
-
Tune MoreLikeThis query building parameters 💬 by Caio 5 years ago
Filter stop words, use a real threshold for doc frequency instead of a percentage.
-
Drop support for `findSimilar(long, int)` 💬 by Caio 5 years ago
While I'm not particularly bothered by the index size increase that this feature requires, it's simpler and a lot more efficient to just pre-compute the similarities and persist in the db.
-
Reference getFieldNameForDiet directly 💬 by Caio 5 years ago
We already do a wildcard static import of the IndexField class
-
Ensure `findSimilar(long, int)` respects maxResults 💬 by Caio 5 years ago
The only way this can happen with a MoreLikeThis-based implementation is if we somehow end up with several distinct recipes with exactly the same content, so this test is more for my future self when trying to make this faster.
-
Expose findSimilar(recipeId, maxResults) 💬 by Caio 5 years ago
This patch exposes a `Searcher.findSimilar(long, int)` API so that we can make use of already indexed data to find similar recipes to a given (known) recipe id. It's effectively more efficient `Searcher.findSimilar(String, int)` for when we already know the recipe we're querying for similarities.
-
Store term vectors in the FULL_RECIPE index by Caio 5 years ago
-
Add support for searching for a recipeId 💬 by Caio 5 years ago
This patch adds a new LondPoint to the index to allow us to query based on the recipeId. This is useful for using functionality that requires the (out of my control) document Id that lucene generates.