feat: Refactor repositories download contents#4153
feat: Refactor repositories download contents#4153stevehipwell wants to merge 7 commits intogoogle:masterfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4153 +/- ##
==========================================
- Coverage 93.83% 93.68% -0.15%
==========================================
Files 209 210 +1
Lines 19685 19695 +10
==========================================
- Hits 18472 18452 -20
- Misses 1015 1047 +32
+ Partials 198 196 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@gmlewis can we get this merged? |
gmlewis
left a comment
There was a problem hiding this comment.
I'm quite concerned about this PR because it appears to me that the behavior of following redirects has been deleted and there are many unit tests that have also simply been deleted without comment or explanation. One of the great things about unit tests is that when major refactors are performed like this one, if the unit tests are left alone we can easily detect regressions. As it is in this PR, however, where a major refactor happens and unit tests are also heavily refactored and/or deleted, it is hard to tell what is actually happening.
Can this be broken down into 3 PRs?
- Update the openapi_operations.yaml file - I'll do that myself momentarily.
- Refactor the download methods without modifying unit tests
- Refactor and/or delete unit tests
|
@gmlewis let me take a look, but the main problem here is that the tests appear to be tightly coupled to the implementation with mocks designed to make the test pass rather than to mirror the actual API. I'll add the deleted tests back, but the mocks will need to be refactored to add the schema required download_url to the content payload. On a slight tangent, shouldn't the mock payloads be validated against the schema? |
Yes, they probably should. I don't remember when GitHub v3 API docs started sharing schemas for endpoints, but it is possible that these were written prior to that. I think my biggest concern is following redirects because I remember a bunch of issues devoted solely to this topic, and to my shock and disappointment, I don't see any of the unit tests actually testing out following redirects and I could have sworn that it took a good deal of effort to get those unit tests to pass at one point. :-( |
@gmlewis there are no redirects in the removed tests. The old code pattern was just ignoring the presence of FYI the following example snippet will error using the current code but pass with the updated code as the last file requested is at an index greater than 1000 and has a size of greater than 1mb so won't have returned content. package main
import (
"context"
"fmt"
"io"
"os"
"github.com/google/go-github/v84/github"
)
// downloadContents downloads the contents of a file in a repository and returns it as a byte slice.
func downloadContents(ctx context.Context, client *github.Client, owner, repo, path, ref string) ([]byte, error) {
rc, _, err := client.Repositories.DownloadContents(ctx, owner, repo, path, &github.RepositoryContentGetOptions{Ref: ref})
if err != nil {
return nil, err
}
defer rc.Close()
by, err := io.ReadAll(rc)
if err != nil {
return nil, err
}
fmt.Printf("Downloaded %v/%v/%v as %d bytes\n", owner, repo, path, len(by))
return by, nil
}
func main() {
client := github.NewClient(nil)
t := []struct {
owner string
repo string
path string
ref string
}{
{"google", "go-github", "README.md", "master"},
{"github", "rest-api-description", "descriptions/api.github.com/api.github.com.2026-03-10.yaml", "main"},
{"ScoopInstaller", "Main", "bucket/yq.json", "master"},
{"stevehipwell", "scoop-main-bucket", "bucket/zzztest.bin", "test-content"},
}
for _, v := range t {
if _, err := downloadContents(context.Background(), client, v.owner, v.repo, v.path, v.ref); err != nil {
fmt.Printf("Error: %v\n", err)
os.Exit(1)
}
}
} |
|
@gmlewis I've added back the removed tests and undone some of the cosmetic changes to make the diff clearer that none of the actual tests have changed (it's only the mocks). I haven't rebased to fix the conflict yet in case you want to look at anything first? |
Thank you, @stevehipwell! |
Signed-off-by: Steve Hipwell <steve.hipwell@gmail.com>
Signed-off-by: Steve Hipwell <steve.hipwell@gmail.com>
19bc2aa to
d25a1ae
Compare
|
@gmlewis I've rebased this and it should be good to go. @alexandear I've updated the example to be closer to the other patterns and to have a valid comment. |
| return nil, fileContent, resp, err | ||
| } | ||
|
|
||
| for _, contents := range dirContents { |
There was a problem hiding this comment.
After closer inspection, the docs here:
https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28#get-repository-content
say that contents from a repo directory can be downloaded with this endpoint.
Before, I said I was concerned about losing the functionality of following redirects, specifically in these lines 204-220. However, this code is not following redirects, it is downloading the contents of a directory.
Are we losing that capability in this PR?
I'm wondering why there are no unit tests that exercise the ability to download the contents from a repo directory?
There was a problem hiding this comment.
I'm not sure that I follow your concern, the code in lines 204-220 is only triggered when a file is larger than 1 mb or the input is invalid (a dir not a file). For files larger than 1mb the updated code uses the download link already returned instead of making an additional API call and iterating through all of the dir files. For invalid input the updated code errors early while this code runs all the way to the end and errors.
I can add a test to show this behaviour? As there wasn't already a test and you asked for the tests to be aligned I didn't add one when I spotted that it was missing earlier.
There was a problem hiding this comment.
I've added tests for calling a directory to show that it errors.
There was a problem hiding this comment.
If I'm reading https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28#get-repository-content correctly, the provided link can point to a repo directory and ALL the contents of that directory will be downloaded. Am I reading that wrong? Or are you saying that even though the docs claim this feature, it doesn't actually work?
I don't have time to investigate this myself at the moment, so any insight you can provide would help tremendously.
There was a problem hiding this comment.
I've investigated and when calling the endpoint on a directory any child directories we get back have an empty download link. If you can download a whole directory you probably need to use the raw content type.
Also I don't see that in the description, where are you seeing it?
There was a problem hiding this comment.
Some context:
- Add DownloadContentsWithMeta to receive RepositoryContent #1810
- Improve DownloadContents and DownloadContentsWithMeta methods #3573
Also the unit tests that are modified in this PR show that a list could historically be returned which represented the names of items within a directory.
When I'm off my phone I'll look at the official docs again and quote the part that I'm concerned about.
There was a problem hiding this comment.
The API will return a list if you ask for the parent dir contents, my point here is that it's unnecessary.
The first PR you link above just copies the download function and also returns the metadata. The second PR adds a check for the content in the initial API call.
AFAIK the content API has always returned the download URL for a file, so the dir call and loop has always been unnecessary. Remember both calls are going to the same API and I can't believe that even GitHub would skip the download URL in the specific response and make you make second call that's also limited on response.
There was a problem hiding this comment.
The API will return a list if you ask for the parent dir contents, my point here is that it's unnecessary.
The first PR you link above just copies the download function and also returns the metadata. The second PR adds a check for the content in the initial API call.
AFAIK the content API has always returned the download URL for a file, so the dir call and loop has always been unnecessary. Remember both calls are going to the same API and I can't believe that even GitHub would skip the download URL in the specific response and make you make second call that's also limited on response.
OK, I'm trying to write an example that lists the contents in a directory, and I'm not getting it to work.
Here are the paragraphs that concern me:
Gets the contents of a file or directory in a repository. Specify the file path or directory with the path parameter. If you omit the path parameter, you will receive the contents of the repository's root directory.
application/vnd.github.object+json: Returns the contents in a consistent object format regardless of the content type. For example, instead of an array of objects for a directory, the response will be an object with an entries attribute containing the array of objects.
If the content is a directory, the response will be an array of objects, one object for each item in the directory. When listing the contents of a directory, submodules have their "type" specified as "file". Logically, the value should be "submodule". This behavior exists for backwards compatibility purposes. In the next major version of the API, the type will be returned as "submodule".
Before we rip out functionality that someone might miss, though, I would like another set of eyes on this.
@alexandear - what are your thoughts about ripping out the for loops that are being removed in this PR?
Will anyone miss them?
If I'm reading @stevehipwell's arguments correctly, he is saying that they never actually did anything, although we have a hint of proof that at one point they did something because he had to remove parts of the unit tests (that contained objects with arrays) to get tests to pass... so that is another one of my concerns.
There was a problem hiding this comment.
@gmlewis I'm not saying they didn't do anything, I'm saying the implementation was inefficient and unnecessary. The mocks needed updating because they were implemented to make the tests pass.
So from first principals; the new mocks actually match the API schema and the new code functions correctly and mirrors the behaviour of the old code. The only difference in functionality is the new code doesn't fail when getting the content from a file that's larger than 1mb and at an index of greater than 1000 in its directory.
There was a problem hiding this comment.
@gmlewis I'm not saying they didn't do anything, I'm saying the implementation was inefficient and unnecessary. The mocks needed updating because they were implemented to make the tests pass.
So from first principals; the new mocks actually match the API schema and the new code functions correctly and mirrors the behaviour of the old code. The only difference in functionality is the new code doesn't fail when getting the content from a file that's larger than 1mb and at an index of greater than 1000 in its directory.
OK, thank you, @stevehipwell. Sounds good to me.
I know @alexandear already approved, but let's please just wait for one more confirmation before merging.
Thank you for your patience with me! I appreciate it.
Signed-off-by: Steve Hipwell <steve.hipwell@gmail.com>
280f864 to
52d75be
Compare
This PR refactors the behaviour of
DownloadContents&DownloadContentsWithMetawith the former now being a direct passthrough to the latter as the only difference was the signature. The code has been refactored to use the API directly instead of via an unnecessary layer of indirection.I've added an OpenAPI update to this PR as it proves that the updated code works against GitHub.
This change is required for #4151.