Log
-
Release tique-0.3.0 by Caio 4 years ago
-
Support for conversion into weighted queries by Caio 4 years ago
-
Upgrade to tantivy 0.12 by Caio 4 years ago
-
Benchmark keyword-based similarity 💬 by Caio 4 years ago
This patch introduces the `check_sim` command, which is what I'm using to explore the indexing quality. In a gist, for every recipe in the index it: 1. Extracts the top 20 keywords 2. Tries to find 10+1 recipes similar to it 3. Measures how many of the found recipes are also in the "canonical" `Recipe.similar_recipe_ids` The command simply dumps a csv to STDOUT and I run analysis over it. This is an example of the output: > 1721097,cornbread;dressing;salsa;salad;lettuce,11,5,10,0 > 862391,salmon;fillet;fillets;baked;piece,11,1,10,0 > 1206600,mousse;peppers;partly;bruised;yogurt,11,0,10,0 > 944326,watercress;oranges;sectioned;navel;tough,11,0,10,0 > 243820,roast;freeze;barbecue;cooker;freezer,11,5,10,0 And a breakdown of the first line: > 1721097: The recipe id > cornbread;dressing;salsa;salad;lettuce: top 5 keywords > 11: Number of similar recipes we found (Step 2) > 5: 5 of the found neighbors were also in the `similar_recipe_ids` set > 10: `Recipe.similar_recipe_ids.len()` > 0: Did we manage to find $self in the 10+1 nearest neighbors? At which index?
-
Expose indexed ids via `DatabaseReader.ids()` by Caio 4 years ago
-
Allow iterating over sorted (by relevance) Terms 💬 by Caio 4 years ago
Knowing the ordered sequence of most relevant terms is very useful and `limit` is unlikely to be a number which makes the `into_sorted_vec` step prohibitive, so this patch simply makes Keywords hold a sorted Vec instead of a BinaryHeap.
-
Update Cargo.lock by Caio 4 years ago
-
Prepare for 0.2.0 release by Caio 4 years ago
-
Regenerate README 💬 by Caio 4 years ago
`cargo readme > README.markdown`
-
Document `tique::topterms` by Caio 4 years ago
-
Ensure fields are `text` with frequencies by Caio 4 years ago
-
Swap `visit(score, doc)` with `visit(doc, score)` 💬 by Caio 4 years ago
Aha! I made it backwards to make it easier to output consistently. The consistency part makes sense, but driving a container with score before the item being contained was too confusing.
-
Initial TopTerms implementation 💬 by Caio 4 years ago
TopTerms reads the index and extracts the most relevant terms in a given document or any arbitrary text input. You can use it to build keywords for your documents or, more interestingly, use the result as a query to find similar documents. It's pretty much a reimplementation of Lucene's MoreLikeThis. I don't particularly like this approach in prod (too many knobs, dependency on the index to formulate a query), but it yields pretty good results with little effort. Ref: https://lucene.apache.org/core/8_4_1/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html
-
Expose the topk module internally by Caio 4 years ago
-
Cleanly report code generation errors by Caio 4 years ago
-
Make the README badges clickable by Caio 4 years ago
-
Integrate with the new `cantine_derive` traits by Caio 4 years ago
-
Update Cargo.lock by Caio 4 years ago
-
Move Filterable::Schema::try_from out of TryFrom by Caio 4 years ago
-
Use less clash-able names for the generated types by Caio 4 years ago
-
Map Filter-related stuff into the Filterable trait 💬 by Caio 4 years ago
I'm quite unhappy about the `<Type as Filterable>::Query` thinger to disambiguate the very unambiguous associated type, but I don't think I'll be able to get rid of it.
-
Rename `Feature*` to `Aggregable*` 💬 by Caio 4 years ago
I'm still missing something about lifetimes (perhaps just practice?) for I can't seem to be able to rephrase all this into something that would allow both owned and borrowed types- I mean, without inverting the control and making the API terribly annoying to use...
-
Simplify FeatureCollector type 💬 by Caio 4 years ago
This patch adds a `Feature` restriction to the `FeatureCollector` struct to make things simpler. A smaller turbofish in pubic methods is always nice.
-
Use a simpler Feature trait 💬 by Caio 4 years ago
Flattening it all solves the "how to expose the generated structs" problem and simplifies the code by a *lot*. Besides, the previous incantation with support for multiple query types was cute and all, but useless in practice: a custom query means implementing business logic, and the main purpose purpose of this thing is avoiding that.
-
More realistic AggregationQuery integration test by Caio 4 years ago
-
Split FilterQuery and AggregationQuery proc macros by Caio 4 years ago
-
Stop generating custom collector code 💬 by Caio 4 years ago
It works! My only gripe is that this new API loses the ability to use an embedded database (say, a hashmap) to read the metadata since it requires static lifetimes everywhere. I'm focusing on bytes fast fields, so requiring ownership is fine for now. It would be nice to implement `Feature<Query> for &Input` and use the same collector code tho, so we'll see...
-
Make FeatureCollector public by Caio 4 years ago
-
Extract FeatureForSegment trait by Caio 4 years ago
-
Extract FeatureForDoc trait by Caio 4 years ago