-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
context: detect if README is required #63199
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Sourcegraph repo has many README files and our ranking ranks other README files higher, for example if they contain the string "README", which leads to a vague answer.
It feels like we should consistently match the top-level README. Do you plan to follow-up with a fix? Maybe boosting exact matches on filename higher in Zoekt? Brainstorming ...
--- a/score.go
+++ b/score.go
@@ -118,8 +118,13 @@ func calculateTermFrequency(cands []*candidateMatch, df termDocumentFrequency) m
termFreqs := map[string]int{}
for _, cand := range cands {
term := string(cand.substrLowered)
+ length := uint32(len(cand.substrLowered))
if cand.fileName {
- termFreqs[term] += 5
+ if cand.byteOffset == 0 && cand.byteMatchSz == length {
+ termFreqs[term] += 5
+ } else {
+ termFreqs[term] += 2
+ }
} else {
termFreqs[term]++
}
@@ -1113,7 +1113,7 @@ func TestNewPlanJob(t *testing.T) { | |||
`), | |||
}, | |||
{ | |||
query: `context:global repo:sourcegraph/.* what's going on'? lang:go`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a Marvin Gaye fan? :)
"explain", | ||
} | ||
|
||
func needsReadmeContext(input string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be really nice to also match queries containing only the repo name, like
- "what does batch-change-buddy/jtest do?"
- "what does go-ctags do?"
Do you think it'd be possible to pull out the repo names and also match against these? They would count as "project signifiers".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a regex to match project names on a best-effort basis. I added a couple of tests to check corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we pull out the repo names being searched and match against those? They are available in the query string.
This approach could be noisy and would also miss things like "what does go-ctags do?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't understand exactly what you mean with "Couldn't we pull out the repo names being searched and match against those?". I don't think we can reliably identify "go-ctags" as repository unless the user has explicitly selected it as context. We could check the DB but that is quite expensive.
I will remove the repo logic for now and follow-up once we chatted a bit more about this.
Yes, definitely. Your suggestion looks promising. I will take a look. I wonder whether we could add something that prioritizes shorter paths in general, not just exact matches. |
This adds a thin layer to detect entities in a query which are not explicity mentioned. As a first application, we detect if a README should probably be returned, following the approach in the client, see /vscode/src/chat/chat-view/context.ts. Test plan: added unit test
fdf33c2
to
1fdbf07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Only comment: I feel like we should solidify the repo name matching, or just remove it entirely.
I'm leaning towards removing it, then in the client we can make sure to rewrite things like "What does @batch-change-buddy/jtest
do?" to "What does batch-change-buddy/jtest repo do?" I suggest that here: https://linear.app/sourcegraph/issue/SPLF-98/rewrite-mentioned-entities-to-natural-language
Cody context searches are always guaranteed to include |
This adds a thin layer to expand a query with keywords which are not explicitly mentioned. As a first application, we detect if a README should be returned. See /vscode/src/chat/chat-view/context.ts for how a similar logic is implemented in the client.
Test plan:
sourcegraph/sourcegraph
andsourcegraph/zoekt
as selected context. For Zoekt we return/README.md
in the top 3, which leads to a good answer. The Sourcegraph repo has many README files and our ranking ranks other README files higher, for example if they contain the string "README", which leads to a vague answer.Addresses SPLF-94