Migrating Subversion to Git

Wondering how to migrate a Subversion repository to a Git repository? If so, this is the blog for you. We will cover that and various issues that can be encountered. The steps discussed here are basically just a summary of BitBucket’s step-by-step procedure for doing the migration and rely on some tools/scripts provided by Bitbucket (BB). However, in some situations there may be additional preparation of an SVN repository that is required before the BB steps will “just work”. That is discussed towards the end of this article.

Procedure

Download BitBucket’s helper utility (Java jar file)
Verify that you have all the necessary command line tools installed and in your path by running: java -jar svn-migration-scripts.jar verify
If errors are returned, install the necessary tools (e.g. git).

NOTE: The migration must be done on a case-sensitive file system. Linux is fine, Windows is out, and OSX is do-able. For OSX, you can use the BB scripts to mount a case-sensitive file system by calling: java -jar svn-migration-scripts.jar create-disk-image <X> GitMigration. Where <X> is the size of the disk image to create, in Gigabytes (make sure it’s big enough).

Extract Author Information

To convert from SVN “usernames” to Git “full names and emails” using the BB scripts, run: java -jar svn-migration-scripts.jar authors <repo-url> > authors.txt
Manually edit the authors file to tell Git how to map the names. The initial authors file will look like:
1. dan = dan <dan@mycompany.com>
2. dave = dave <dave@mycompany.com>
Edit the file so it looks like this instead:
1. dan = Daniel Santos <dan@219design.com>
2. dave = Dave Bim-Merle <dave@219design.com>
Save the new file

Convert the Repository

If your SVN repository is using the standard SVN layout of “trunk/branches/tags”, then run the following: git svn clone –stdlayout –prefix=”” –authors-file=authors.txt <svn-repo-url> <git-repo-name>
If your SVN repository is not using the standard layout, then you can pass in the specific repo paths to the trunk, branches, and tags folders. This might be the case when your trunk/branches/tags are in a “master” repo that has multiple subprojects. Then you can do: git svn clone –prefix=”” –authors-file=authors.txt –trunk=<path-to-trunk> –branches=<path-to-branches1> –branches=<path-to-branches2> –tags=<path-to-tags> <svn repo url> <git repo name>

NOTE: You can have multiple branch paths specified.

NOTE: This differs from the instructions on BB. The instructions on BB do not include specifying the prefix as empty. Git changed its default behavior from 1.x to 2.x. In 1.x the default prefix when doing an svn clone was to use an empty prefix. In 2.x the default is to add a prefix named “origin”. This does 2 things. First, it screws up later BB tools for converting SVN branches to Git branches (although there is technically a different way to fix that problem). Second, it will add the prefix “origin/” to the name of all of your branches and tags once they are converted to Git branches and tags. Whether or not it is desirable to have the “origin” prefix is up to you, but I decided I would rather not have the extra clutter.

NOTE: Remember to put the git repo in a case-sensitive file system (see above)

Convert Branches and Tags

The git utilities turn your SVN branches and tags into git remote branches and tags pointing to a remote with the name specified in the prefix (see above). This is useful for continuing to maintain the SVN repository, but not as useful if you want to do a 1-way conversion. Hence why I dropped the prefix because when doing a 1-way conversion it just adds extra clutter. The BB utilities will help you convert the remote branches and tags (pointing back to SVN) into local branches and tags that then look like they were always done in Git, to begin with.

Using the BB scripts run: java -Dfile.encoding=utf-8 -jar svn-migration-scripts.jar clean-git This will do a dry run of the conversion and show you what it is about to do.
When ready (this is a 1-way destructive process after which you cannot support both SVN and Git together) run: java -Dfile.encoding=utf-8 -jar svn-migration-scripts.jar clean-git –force
This will do the conversion

NOTE: If you didn’t specify an empty prefix in the git svn clone step, then you will first see all your branches converted/created first and then at the end the script will delete all of the created branches. It also seems to screw up the tags as well. You can get around this by passing the “–no-delete” option after the force option. However, this still doesn’t fix issues with tags. Also, the conversion will keep all branches ever made. You’ll then have to go into Git and manually delete all branches that were already merged back in the SVN repository and deleted.

Dealing With Very Non-Standard SVN Layout

The above steps should work fine for both standard SVN layouts and simple, non-standard layouts where the location of the trunk, branches, and tags folders never changed. If you ever changed, moved, or renamed the trunk, branches, and tags folders you are going to have serious issues with the conversion. At this point, it is a good idea to consider whether or not you really need to maintain the history or if it would be ok to just start a git repository fresh from the latest SVN version as if it was the first commit into git. However, if you really want to maintain history, there is a way to do it – described here.

How To Maintain History

Basically, you need to “massage” your old SVN repository data into a new SVN repository structure that looks as-if you started with the standard trunk, branches, and tags layout and never changed it. The basic outline goes as follows:

Create an SVN dump file
Modify that file manually (or preferably with scripts so you can save work/progress)
Recreate a new SVN repo from the dump file
1. An SVN dump file is a human-readable (and editable) file that contains all the information necessary to recreate an SVN repo exactly (if you want to).
2. Each commit has the meta-data as human-readable for adding/deleting/renaming directories (and other stuff too).
3. You can fix trunk, branches, and tags modifications (moves, renames, etc.) by editing the commits that did those changes. It’s easiest to do this on the actual SVN server, at least the dump and recreate portions.

Create the Initial Dumpfile

Start by creating the initial SVN dump file. On the SVN server run: svnadmin dump <path-to-repo> > <dumpfile-name>
This will dump the entire repo to the dump file. However, if the stuff you want is in subprojects of a master repo, then you will only want to filter for those subdirectories. You can do that with: svnadmin dump <path-to-repo> | svndumpfilter include <single-quoted-list-of-subdirectories-in-the-repo> > <dumpfile-name>

When you use an include filter, the process will end up with empty revision numbers for the revisions that didn’t affect your project. It’s generally better to get rid of those. However, if you refer to specific SVN revision numbers anywhere (e.g. commits, documentation), then you will not preserve that information. However, I would recommend it if you can live without them.

To do that: svnadmin dump <path-to-repo> | svndumpfilter include <list> –drop-empty-revs –renumber-revs > <dumpfile-name>
Also, you may run into issues with missing merge sources if you moved your branches around, so the full recommended command to run is: svnadmin dump <path-to-repo> | svndumpfilter include <list> –drop-empty-revs –renumber-revs –skip-missing-merge-sources > <dumpfile-name>
This will create the starting point dumpfile that you will then start editing

Massaging the Dumpfile

First, make a backup of the starting dumpfile. You will likely need to make multiple attempts at correcting all the issues you want and creating the dumpfile can sometimes take a long time, depending on the size of your repository.

Next, start making a script that is going to use the Linux/Unix “sed” command to find and replace text in the dumpfile that will correct things so it looks like your repository always just had the plain old trunk, branches, and tags structure.

There are 2 types of lines we need to start replacing in the SVN dumpfile:

Type 1: lines that start with “Node-path: <x>”
Type 2: lines that start with “Node-copyfrom-path: <x>”

The first type is related to lines in the dumpfile that are basically telling SVN where the operation is occurring.

An example of how this might get used in the dump file is:

Node-path: trunk

Node-kind: dir

Node-action: add

A section like this is what you would see for an initial commit that creates the trunk folder in the root directory structure.

The second type of line (the copyfrom) occurs when you move a file or folder in SVN. You might see something like:

Node-path: trunk/folder1/file1.h

Node-kind: file

Node-action: add

Node-copyfrom-rev: X

Node-copyfrom-path: trunk/folder2/file1.h

A section like this is what you would see when you moved file1.h from folder2 to folder1. The SVN dumpfile keeps track of where and at what revision the file came from.

Let’s start with a “trivial” case as an example

Let’s say that you have a project that you want to migrate that is in a subfolder of a “master” repository. In this particular case, you can use the built-in “git svn clone” options to just specify the paths to the trunk, branches, tags, but let’s instead use editing the dump file to achieve the same thing. Let’s suppose that in your repository you have the following structure:

Project
- trunk
- branches
- tags

You’ve first already created a dumpfile that only includes these folders (see above). Now we want to edit the dumpfile so it looks like none of the commits ever knew about the top-level folder.

We do that by doing the following in the script file:

sed -i ‘s/Node-path: Project/trunk/Node-path: trunk/g’ <dumpfile>

sed -i ‘s/Node-copyfrom-path: Project/trunk/Node-copyfrom-path: trunk/g’ <dumpfile>

sed -i ‘s/Node-path: Project/branches/Node-path: branches/g’ <dumpfile>

sed -i ‘s/Node-copyfrom-path: Project/branches/Node-copyfrom-path: branches/g’ <dumpfile>

sed -i ‘s/Node-path: Project/tags/Node-path: tags/g’ <dumpfile>

sed -i ‘s/Node-copyfrom-path: Project/tags/Node-copyfrom-path: tags/g’ <dumpfile>

What we are doing is replacing the paths to every single folder/file in every single commit in the dumpfile with a new path that strips the first folder out. We need to also include the “copyfrom” versions so that when things are moved, it matches the new directory structures we are creating.

Fixing Non-trivial Issues

Now, that is a relatively trivial example. But you can fix other much more non-trivial issues. Let’s say you actually started with a root trunk/branches/tags and then you moved it into a sub-folder. So this:

trunk
branches
tags

Got converted to this with some folder SVN-moves:

Project
- trunk
- branches
- tags

This is where things start to get complicated. Using “svn git clone” you cannot specify that the trunk got moved around. Now, you must fix it in the dumpfile first. The good news is that you can do this. The script to do this is actually the same as the script above for the trivial case. However, there are now some additional tweaks we need to make to the dumpfile.

At the revision that you moved these 3 folders (let’s assume you did it all in one revision), it’s going to look like some copyfrom operations in the dumpfile. After we run the cleaning script, there is going to be some stuff that looks like:
1. Node-path: trunk
2. Node-kind: dir
3. Node-action: add
4. Node-copyfrom-rev: X
5. Node-copyfrom-path: trunk
Because we stripped off the “Project” top-level folder, it’s going to look like we copied everything from trunk back in to trunk. Now, it turns out that is ok and SVN can handle that in a dump file. However, the problem is that also in part of this commit, the dumpfile is going to contain the deletion of the top-level trunk folder. It’s going to have a section in this commit that looks like:
1. Note-path: trunk
2. Node-kind: dir
3. Node-action: delete
Because that is exactly what we did in the original history. We need to go in and just manually delete the deletion part. We could also delete the copyfrom part too, and if we did all of that, we would end up with a commit that has a log message but does absolutely nothing, which is ok. But we don’t have to delete the copyfrom parts if we don’t want.

Things can get even more complicated than this as well. Imagine if you moved a file to a location that was filtered out in the original dump because it’s not part of the project you want to recreate. But then the file got moved back into the project that you do want to recreate. In those cases, you actually need to create a dump that contains the parts that you don’t (eventually) want, at least for now. One suggestion is to move those folder locations using the tricks above to a temporary folder in the new “trunk”, as in “trunk/svn-git-migration/”, which is a folder you will delete later. This will allow you to preserve the file moves during the recreation of the repo. After you make the new repo (see below), you can then SVN-delete the temporary folder right before you convert to git.

Using These Tricks to Deal With Externals

You can also use these tricks to deal with externals. This section describes how to handle when you moved some stuff around in a “master” repo to share it between some projects. Git has submodules and subtrees if you want to preserve and SVN “externals” as its own repo but let’s suppose that instead, you want to make it look like the contents of the externals file never moved around at all, so you can create one new repo with all the history of the externals too. Let’s suppose you started with:

Project1
- trunk
- branches
- tags

Then, you took some stuff from Project1 and started moving it to a shared section in your repo (to share with Project2) and added an externals folder in your trunk to get back into Project1. So now your repo looks like:

Project1
- trunk
  - Library
- branches
- tags
Library
- trunk
- branches
- tags

You can use the same path replacement techniques to make it look like all you did was reorganize some files that were in other areas of “trunk” into a new subfolder you named “Library”. Just do the following with some sed commands:

sed -i ‘s/Node-path: Library/trunk/Node-path: trunk/Library/g’ <dumpfile>

sed -i ‘s/Node-copyfrom-path: Library/trunk/Node-copyfrom-path: trunk/Library/g’ <dumpfile>

There is one other thing that you need to do though. Since “trunk/Library” is created via an externals, the actual “Library” folder is never created in the dumpfile. So when recreating the repo, it’s going to fail when it tries to move stuff into the Library folder because it doesn’t exist yet. You will need to go into the dumpfile and manually “add” a directory in an appropriate existing commit (probably the one right before you need to use it). This is similar to how we had to manually delete some parts of the dumpfile discussed above. You can do this with something like:

Node-path: trunk/Library

Node-kind: dir

Node-action: add

Prop-content-length: 10

Content-length: 10

You generally only need to do this for the “trunk”, however, this may not completely preserve what the state of the Library was in any branches you created. You probably fixed the revision of Library in those branches. You could (if you want) continue using these tricks to fix it all up.

The Last Part

Things can also get tricky sometimes regarding the order of the path replacement operations. For example, consider a folder structure that evolved as such:

Project

Project-trunk
Project-branches
Project-tags

Project
- trunk
- branches
- tags

If you first try to replace “Project” with trunk, it will also replace “Project-trunk” with “trunk-trunk”. That will screw up future replacement operations. So, generally speaking, do the replacement operation from most-specific to least-specific. In this example, the order might be:

Project/trunk → trunk
Project/branches → branches
Project/tags → tags
Project-trunk → trunk
Project-branches → branches
Project-tags → tags
Project → trunk

The End

Happy Migrating!