Sometimes we accidentally commit large files and credential files to the git history. If the dirty commits are deep down in the history, it will be scary to fix it. But large files in the git history will slow down git operation (e.g., changing branch), and it is to an extent that working with the repository is troublesome.
There is no way to remove update git history without changing the commit SHA, because commit SHA are related to its parent SHA and commited files.
Overview
The following notes are the things I encoutered when fixing the dirty git commits in project libyt
. This fixes the root repository yt-project/libyt
and the downstream fork cindytsai/libyt
.
If you are following this, please read the whole thing before you make any changes to your repo.
ALWAYS MAKE A BACK UP!!!
Prerequsites and General Guidelines
Prerequsites:
- Have permission to directly push to the repo.
- Install
bfg-repo-cleaner
General guidelines:
- The repo to be modified should not have pull-request. Otherwise, we are pulling old git history to our new git history again.
Steps
- Clone a Bare Repo and Back Up – Clone a bare repo and back up your repository.
- Update Git History – Update the git history using
bfg
. - Force Push – Force push the updated git history so that GitHub now has the updated git history. This applies to all the branches in the repo.
- Update Downstream Forks – Update all the other branches in the downstream forks.
Step 1 – Clone a Bare Repo and Back Up
Clone a bare repo which only contains the full git history without source code files. We will use this bare repo and modify the git history. To make a back up, just copy this bare repo.
1
git clone --mirror <git://something/your-repo.git>
If something is wrong, push the back up bare repo back. We will need permission to directly force push to the original repository.
1
git push
Step 2 – Update Git History
We can use the command to check the files or folders’ git history. This helps us identify what will be removed later on in bfg
.
For example, this lists the files’ git history in folder data
:
1
2
# Assume we are now inside the bare repo folder that constains git history to be modified.
git log --stat --all -- data
Since the large files are all under data
, we use bfg
to remove them from the git history:
1
2
# Assume we are now inside the bare repo folder
bfg --delete-folders data
Every branch’s (here, main
and valgrind
) commit SHA is modified. The dirty commits D
and the following modified commits m
in the git history tree are also shown.
Finally, delete the dirty history:
1
git reflog expire --expire=now --all && git gc --prune=now --aggressive
Step 3 – Force Push
After fixing the git history in the bare repo, force push back to the origin repo. This requires permission to directly push to the origin.
1
git push
It’s totally fine to see “remote rejected” if there is no currently open pull-request. They are just merged pull-requested. See below.
What about the releases and tags?
The commit SHA attached to releases or tags will also be updated. Release tags remains the same, only the binded commit SHA changes.
Before fixing the git history:
What if there is merged pull-request to the old git history?
There is no way to update the PR’s reference, which is why we get “remote rejected”. They are read-only refs created by GitHub and may come from other repositories. It is ok to leave it as it is, since we already merged them.
See this post for more details.
Step 4 – Update Downstream Forks
Till now, we have a clean git history in the upstream, but our fork is still the messed up git history.
The big picture:
- Find the node where your fork starts to diverge compare to the new git history.
- Use
cherry-pick
orrebase
to move old commits that are attached to the old git history to the new git history.
Finding the Diverge Node
This is what my fork (cindytsai/libyt
) looks like after the upstream repo yt-project/libyt
has updated its git history and has pushed to GitHub. Clearly, we can see that every git comment is duplicated. One is from the fixed git history, and the other is from the old dirty git history.
d920af
is from the old dirty git history, and a0ac34
is the fixed git history. bc129d
is the commit attached to the old dirty git history.
We need to move bc129d
to a0ac34
either using cherry-pick
or rebase
.
Cherry-Pick
First, we need to checkout the diverge node (here a0ac34
) on the fixed git history. Then we cherry-pick commits one-by-one:
1
2
# Assume the git history is checkout on the diverge node of the fixed git history.
git cherry-pick bc129d
This applies commit bc129d
on top of a0ac34
. Finally, we have our local changes or branches in relation to the new fixed git history.
Rebase while Keeping the Timestamp
We can use rebase
and move commits from a branch to another branch. See this post.
If we care about the timestamp, we can also use this in additional to rebase
.