Home [Git] Remove accidentally committed large data files or secrets in git history
Post
Cancel

[Git] Remove accidentally committed large data files or secrets in git history

Sometimes we accidentally commit large files and credential files to the git history. If the dirty commits are deep down in the history, it will be scary to fix it. But large files in the git history will slow down git operation (e.g., changing branch), and it is to an extent that working with the repository is troublesome.

There is no way to remove update git history without changing the commit SHA, because commit SHA are related to its parent SHA and commited files.

Overview

The following notes are the things I encoutered when fixing the dirty git commits in project libyt. This fixes the root repository yt-project/libyt and the downstream fork cindytsai/libyt.

If you are following this, please read the whole thing before you make any changes to your repo.

ALWAYS MAKE A BACK UP!!!

Prerequsites and General Guidelines

Prerequsites:

General guidelines:

  • The repo to be modified should not have pull-request. Otherwise, we are pulling old git history to our new git history again.

Steps

  1. Clone a Bare Repo and Back Up – Clone a bare repo and back up your repository.
  2. Update Git History – Update the git history using bfg.
  3. Force Push – Force push the updated git history so that GitHub now has the updated git history. This applies to all the branches in the repo.
  4. Update Downstream Forks – Update all the other branches in the downstream forks.

Step 1 – Clone a Bare Repo and Back Up

Clone a bare repo which only contains the full git history without source code files. We will use this bare repo and modify the git history. To make a back up, just copy this bare repo.

1
git clone --mirror <git://something/your-repo.git>

If something is wrong, push the back up bare repo back. We will need permission to directly force push to the original repository.

1
git push

Step 2 – Update Git History

We can use the command to check the files or folders’ git history. This helps us identify what will be removed later on in bfg.

For example, this lists the files’ git history in folder data:

1
2
# Assume we are now inside the bare repo folder that constains git history to be modified.
git log --stat --all -- data

Since the large files are all under data, we use bfg to remove them from the git history:

1
2
# Assume we are now inside the bare repo folder
bfg --delete-folders data

Every branch’s (here, main and valgrind) commit SHA is modified. The dirty commits D and the following modified commits m in the git history tree are also shown.

Update Git History

Finally, delete the dirty history:

1
git reflog expire --expire=now --all && git gc --prune=now --aggressive

Step 3 – Force Push

After fixing the git history in the bare repo, force push back to the origin repo. This requires permission to directly push to the origin.

1
git push

Force push with releases and tags

It’s totally fine to see “remote rejected” if there is no currently open pull-request. They are just merged pull-requested. See below.

What about the releases and tags?

The commit SHA attached to releases or tags will also be updated. Release tags remains the same, only the binded commit SHA changes.

Before fixing the git history: Before Fix

After fixing the git history: After Fix

What if there is merged pull-request to the old git history?

There is no way to update the PR’s reference, which is why we get “remote rejected”. They are read-only refs created by GitHub and may come from other repositories. It is ok to leave it as it is, since we already merged them.

See this post for more details.

Step 4 – Update Downstream Forks

Till now, we have a clean git history in the upstream, but our fork is still the messed up git history.

The big picture:

  1. Find the node where your fork starts to diverge compare to the new git history.
  2. Use cherry-pick or rebase to move old commits that are attached to the old git history to the new git history.

Finding the Diverge Node

This is what my fork (cindytsai/libyt) looks like after the upstream repo yt-project/libyt has updated its git history and has pushed to GitHub. Clearly, we can see that every git comment is duplicated. One is from the fixed git history, and the other is from the old dirty git history.

Find the Diverge Node

d920af is from the old dirty git history, and a0ac34 is the fixed git history. bc129d is the commit attached to the old dirty git history.

We need to move bc129d to a0ac34 either using cherry-pick or rebase.

Cherry-Pick

First, we need to checkout the diverge node (here a0ac34) on the fixed git history. Then we cherry-pick commits one-by-one:

1
2
# Assume the git history is checkout on the diverge node of the fixed git history.
git cherry-pick bc129d

This applies commit bc129d on top of a0ac34. Finally, we have our local changes or branches in relation to the new fixed git history.

Rebase while Keeping the Timestamp

We can use rebase and move commits from a branch to another branch. See this post.

If we care about the timestamp, we can also use this in additional to rebase.

Reference

This post is licensed under CC BY 4.0 by the author.