Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

context: detect if README is required #63199

Merged
merged 5 commits into from
Jun 14, 2024
Merged

context: detect if README is required #63199

merged 5 commits into from
Jun 14, 2024

Conversation

stefanhengl
Copy link
Member

@stefanhengl stefanhengl commented Jun 11, 2024

This adds a thin layer to expand a query with keywords which are not explicitly mentioned. As a first application, we detect if a README should be returned. See /vscode/src/chat/chat-view/context.ts for how a similar logic is implemented in the client.

Test plan:

  • new and updated unit tests
  • Cody web client: I tested the "what does this codebase do?" for sourcegraph/sourcegraph and sourcegraph/zoekt as selected context. For Zoekt we return /README.md in the top 3, which leads to a good answer. The Sourcegraph repo has many README files and our ranking ranks other README files higher, for example if they contain the string "README", which leads to a vague answer.

Addresses SPLF-94

@cla-bot cla-bot bot added the cla-signed label Jun 11, 2024
@github-actions github-actions bot added team/product-platform team/search-platform Issues owned by the search platform team labels Jun 11, 2024
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Sourcegraph repo has many README files and our ranking ranks other README files higher, for example if they contain the string "README", which leads to a vague answer.

It feels like we should consistently match the top-level README. Do you plan to follow-up with a fix? Maybe boosting exact matches on filename higher in Zoekt? Brainstorming ...

--- a/score.go
+++ b/score.go
@@ -118,8 +118,13 @@ func calculateTermFrequency(cands []*candidateMatch, df termDocumentFrequency) m
        termFreqs := map[string]int{}
        for _, cand := range cands {
                term := string(cand.substrLowered)
+               length := uint32(len(cand.substrLowered))
                if cand.fileName {
-                       termFreqs[term] += 5
+                       if cand.byteOffset == 0 && cand.byteMatchSz == length {
+                               termFreqs[term] += 5
+                       } else {
+                               termFreqs[term] += 2
+                       }
                } else {
                        termFreqs[term]++
                }

@@ -1113,7 +1113,7 @@ func TestNewPlanJob(t *testing.T) {
`),
},
{
query: `context:global repo:sourcegraph/.* what's going on'? lang:go`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a Marvin Gaye fan? :)

internal/search/codycontext/query_parser.go Outdated Show resolved Hide resolved
internal/search/codycontext/query_parser.go Outdated Show resolved Hide resolved
"explain",
}

func needsReadmeContext(input string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be really nice to also match queries containing only the repo name, like

  • "what does batch-change-buddy/jtest do?"
  • "what does go-ctags do?"

Do you think it'd be possible to pull out the repo names and also match against these? They would count as "project signifiers".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a regex to match project names on a best-effort basis. I added a couple of tests to check corner cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we pull out the repo names being searched and match against those? They are available in the query string.

This approach could be noisy and would also miss things like "what does go-ctags do?"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand exactly what you mean with "Couldn't we pull out the repo names being searched and match against those?". I don't think we can reliably identify "go-ctags" as repository unless the user has explicitly selected it as context. We could check the DB but that is quite expensive.

I will remove the repo logic for now and follow-up once we chatted a bit more about this.

@stefanhengl
Copy link
Member Author

stefanhengl commented Jun 12, 2024

It feels like we should consistently match the top-level README. Do you plan to follow-up with a fix?

Yes, definitely. Your suggestion looks promising. I will take a look. I wonder whether we could add something that prioritizes shorter paths in general, not just exact matches.

This adds a thin layer to detect entities in a query which are not
explicity mentioned. As a first application, we detect if a README
should probably be returned, following the approach in the client, see
/vscode/src/chat/chat-view/context.ts.

Test plan:
added unit test
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Only comment: I feel like we should solidify the repo name matching, or just remove it entirely.

I'm leaning towards removing it, then in the client we can make sure to rewrite things like "What does @batch-change-buddy/jtest do?" to "What does batch-change-buddy/jtest repo do?" I suggest that here: https://linear.app/sourcegraph/issue/SPLF-98/rewrite-mentioned-entities-to-natural-language

@stefanhengl stefanhengl merged commit 92373c3 into main Jun 14, 2024
13 checks passed
@stefanhengl stefanhengl deleted the sh/context/readme branch June 14, 2024 08:30
@jtibshirani
Copy link
Member

I guess I don't understand exactly what you mean with "Couldn't we pull out the repo names being searched and match against those?"

Cody context searches are always guaranteed to include repo filters, so I was thinking we could check for those and pull out the repo names. But I am +1 that we removed it, I like the "rewrite the initial query" option better (https://linear.app/sourcegraph/issue/SPLF-98/rewrite-mentioned-entities-to-natural-language).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed team/product-platform team/search-platform Issues owned by the search platform team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants