Skip to content
This repository has been archived by the owner on Sep 11, 2020. It is now read-only.

Performance and memory issues cloning large repositories #447

Open
osklyar opened this issue Jun 21, 2017 · 2 comments
Open

Performance and memory issues cloning large repositories #447

osklyar opened this issue Jun 21, 2017 · 2 comments

Comments

@osklyar
Copy link

osklyar commented Jun 21, 2017

When cloning large repositories, with respect to the occupied space and less so with respect to the number of commits go-git uses some sort of a different strategy than git resulting in massive memory footprint and very long clone times. Here cloning a repo that unpacks into 1.5Gb and contains ca. 110k commits, go-git uses up to 5Gb RAM and runs over 4m while git uses 290Mb and runs in about 1m (tested with geat based on go-git):

➜  date && geat clone git@gitserver:myrepo && date            
Mit Jun 21 14:17:28 CEST 2017
=> clone: myrepo cloned from git@gitserver:myrepo into origin
Mit Jun 21 14:21:42 CEST 2017
➜  du -s myrepo 
1463492	myrepo
➜  date && git clone git@gitserver:myrepo && date
Mit Jun 21 14:23:58 CEST 2017
Cloning into 'myrepo'...
remote: Counting objects: 974758, done.
remote: Compressing objects: 100% (167444/167444), done.
remote: Total 974758 (delta 798392), reused 973743 (delta 797550)
Receiving objects: 100% (974758/974758), 791.39 MiB | 67.73 MiB/s, done.
Resolving deltas: 100% (798392/798392), done.
Mit Jun 21 14:25:09 CEST 2017

Memory requirements scales more or less linearly with the commit number and repository size, below e.g. a smaller repo with quite a lot of commits and go-git uses about 8x more memory than git. On the performance side, the growth of the repository size leads to much faster degradation: for the 1.5 Gb repo about the difference is 4 times, for a 10 times smaller repo below the times are about the same for git and go-git while git shows approximately the same times as for 1.5Gb repo.

Cloning github.com:moby/moby with 32k commits and 170Mb overall unpacked size takes about the same 1m20s with both git and go-git. Memory wise, go-git loses uses the max of 320Mb (2x the repo size) and git 45Mb (0.25x the repo size):

➜  date && geat clone git@github.com:moby/moby && date
Mit Jun 21 13:32:03 CEST 2017
=> clone: moby cloned from git@github.com:moby/moby into origin
Mit Jun 21 13:32:38 CEST 2017
➜  date && git clone git@github.com:moby/moby && date
Mit Jun 21 13:33:05 CEST 2017
Cloning into 'moby'...
remote: Counting objects: 229544, done.
remote: Compressing objects: 100% (35/35), done.
remote: Total 229544 (delta 23), reused 17 (delta 14), pack-reused 229495
Receiving objects: 100% (229544/229544), 127.47 MiB | 5.22 MiB/s, done.
Resolving deltas: 100% (152573/152573), done.
Mit Jun 21 13:33:35 CEST 2017
@smola
Copy link
Collaborator

smola commented Jun 23, 2017

@osklyar Thanks for the report.
Given that we're approaching a stable release of v4, it's time to focus on performance and fix long-standing issues on that front. So we'll be working on this soon.

@imoverclocked
Copy link

I have hit some performance issues during clone as well. My repository .git dir is ~24MB after git gc --aggressize; git repack -a -d but cloning seems to take about 1m15s on my Core i7 based MacBook. Using the standard git tools, the same process is done in less than 1s. Watching the clone progress via:

git.CloneOptions{
        URL:           gitDir(repo.baseDir),
        ReferenceName: ref.Target(),
        SingleBranch:  true,
        Progress:      os.Stderr,
}

shows it get through this output in ~10s:

Counting objects: 196522, done.
Compressing objects: 100% (41983/41983), done.
Total 196522 (delta 153594), reused 196483 (delta 153563)

but then the process drags on with ~130% cpu and no output. I grabbed a pprof 30s profile and generated a graph:

pprof001.pdf

It seems that a large amount of time is spent in seek syscalls ultimately coming from packfile.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants