Migrating Subversion to Git | 219 Design

News

Migrating Subversion to Git

Dan Santos
08/08/2015

Wondering how to migrate a Subversion repository to a Git repository? If so, this is the blog for you. We will cover that and various issues that can be encountered. The steps discussed here are a summary of BitBucket’s step-by-step procedure for doing the migration and rely on some tools/scripts provided by Bitbucket (BB). Sometimes, there is additional preparation of the SVN repository required. That is discussed at the end of this article.

Procedure

  1. Download BitBucket’s helper utility (Java jar)
  2. Verify that you have all the necessary command line tools installed and in your path by running: java -jar svn-migration-scripts.jar verify
  3. If errors are returned, install the necessary tools (e.g. git).
    1. NOTE: the migration must be done on a case-sensitive file system. Linux is fine, Windows is out, and OSX is do-able. For OSX, you can use the BB scripts to mount a case-sensitive file system by calling: java -jar svn-migration-scripts.jar create-disk-image <X> GitMigration
  4. Where <X> is the size of the disk image to create in Gigabytes (make sure it’s big enough).

Extract Author Information

  1. To convert from SVN “usernames” to Git “full names and emails” using the BB scripts, run: java -jar svn-migration-scripts.jar authors <repo url> > authors.txt
  2. Manually edit the authors file to tell git how to map the names. The initial authors file will look like:
    1. dan = dan <[email protected]>
    2. dave = dave <[email protected]>
  3. Edit the file so it looks like this instead:
    1. dan = Daniel Santos <[email protected]>
    2. dave = Dave Bim-Merle <[email protected]>
  4. Save the new file

Convert the Repository

  1. If your SVN repository is using the standard SVN layout of “trunk/branches/tags”, then run the following: git svn clone –stdlayout –prefix=”” –authors-file=authors.txt <svn repo url> <git repo name>
  2. If your SVN repository is not using the standard layout, then you can pass in the specific repo paths to the trunk, branches, and tags folders. This might be the case when your trunk/branches/tags are in a “master” repo that has mutliple subprojects. Then you can do: git svn clone –prefix=”” –authors-file=authors.txt –trunk=<path-to-trunk> –branches=<path-to-branches1> –branches=<path-to-branches2> –tags=<path-to-tags> <svn repo url> <git repo name>

 

NOTE: you can have multiple branch paths specified.

NOTE: this differs from the instructions on BB (as of 2018-05-31). The instructions on BB do not include specifing the prefix as empty. Git changed its default behavior from 1.x to 2.x. In 1.x the default prefix when doing and svn clone was to use an empty prefix. In 2.x the default is to add a prefix named “origin”. This does 2 things. First, it screws up later BB tools for converting SVN branches to Git branches (although there is technically a different way to fix that). Second, it will add the prefix “origin/” to the name of all of your branches and tags once they are converted to Git branches in tags. That may be desirable in some instance, since it will remind you that those branches and tags came from SVN, but also adds unnecessary decoration if your true goal is to switch to git permanently. This is the recommended (Dan Santos) approach.

NOTE: remember to put the git repo in a case-sensitive file system (see above)

Convert Branches and Tags

The git utilities won’t actually convert your SVN branches and tags to true git branches and tags. The BB utilities will help you do that though.

  1. Using the BB scripts run: java -Dfile.encoding=utf-8 -jar svn-migration-scripts.jar clean-git This will do a dry run of the conversion and show you what it is about to do.
  2. When ready (this is a 1-way destructive process after which you cannot support both SVN and Git together) run: java -Dfile.encoding=utf-8 -jar svn-migration-scripts.jar clean-git –force
  3. This will do the conversion

NOTE: If you didn’t specify an empty prefix in the git svn clone step, then you will first see all your branches converted/created first and then at the end the script will delete all of the created branches. It also seems to screw up the tags as well. You can get around this by passing the “–no-delete” option after the force option. However, as of this writing (2018-05-31) this still doesn’t fix issues with tags. Also, the conversion will keep all branches ever made. You’ll then have to go into git and manually delete all branches that were already merged back in the SVN repository and deleted.

Dealing With Very Non-Standard SVN Layout

The above steps should work fine for both standard SVN layouts and simple non-standard layouts where the location of the trunk, branches, and tags folders never changed. If you ever changed, moved, or renamed the trunk, branches, and tags folders you are going to have serious issues with the conversion. It is a good idea to consider whether or not you really need to maintain history or if it would be ok to just start a git repository fresh from the latest SVN version, as if it was the first commit into git. However, if you really want to maintain history, there is a way to do it – described here.

 

How To Maintain History

Basically, you need to “massage” your old SVN repository data into a new SVN repository structure that looks as-if you started with the standard trunk, branches, and tags layout and never changed it.

  1. Create an SVN dump file,
  2. Modify that file manually (or preferably with scripts so you can save work/progress)
  3. Recreate a new SVN repo from the dump file.
    1. A SVN dump file is a human-readable (and editable) file that contains all the information necessary to recreate an SVN repo exactly (if you want to).
    2. Each commit has the meta-data as human-readable for adding/deleting/renaming directories (and other stuff too).
    3. You can fix trunk, branches, and tags modifications (moves, renames, etc.) by editing the commits that did those changes. Most of this must be done on the SVN server, at least the dump and recreate portions (the editing of the dump file anywhere).

Create the Initial Dumpfile

  1. Start by creating the initial SVN dump file. On the SVN server run: svnadmin dump <path-to-repo> > <dumpfile-name>
  2. This will dump the entire repo to the dump file. However, if the stuff you want is in subprojects of a master repo, then you will only want to filter for those subdirectories. You can do that with: svnadmin dump <path-to-repo> | svndumpfilter include <single-quoted-list-of-subdirectories-in-the-repo> > <dumpfile-name>

 

When you use an include filter, the process will end up with empty revision numbers for the revisions that didn’t work on your project. It’s generally better to get rid of those. However, if you refer to specific SVN revision numbers anywhere (e.g. commits, documentation), then you will not preserve that information. However, it’s generally recommended if you can live with it.

  1. To do that: svnadmin dump <path-to-repo> | svndumpfilter include <list> –drop-empty-revs –renumber-revs > <dumpfile-name>
  2. Also, you may run into issues with missing merge sources if you moved your branches around, so the full recommended command to run is: svnadmin dump <path-to-repo> | svndumpfilter include <list> –drop-empty-revs –renumber-revs –skip-missing-merge-sources > <df-name>
  3. This will create the starting point dump file that you will then start editing.

Massaging the Dumpfile

First, make a backup of the starting dumpfile. You will likely need to make multiple attempts at correcting all the issues you want to correct and creating the dumpfile can take a long time.

Next, start making a script that is going to use the “sed” command to find and replace text in the dumpfile that will correct things so it looks like your repository always just had the plain old trunk, branches, and tags structure.

There are 2 types of lines we need to start replacing in the SVN dump file. Type 1: lines that start with “Node-path: <x>” and Type 2: lines that start with “Node-copyfrom-path: <x>”.

The first type is related to lines in the dumpfile that is basically telling SVN where the operation is occurring.

An example of how this might get used in the dump file is:

Node-path: trunk

Node-kind: dir

Node-action: add

A section like this is what you would see for an initial commit that create the trunk folder in the root directory structure.

The second type of line (the copyfrom) occurs when you move a file or folder in SVN. You might see something like:

Node-path: trunk/folder1/file1.h

Node-kind: file

Node-action: add

Node-copyfrom-rev: X

Node-copyfrom-path: trunk/folder2/file1.h

A section like this is what you would see when you moved file1.h from folder2 to folder1. The SVN dumpfile keeps track of where and at what revision the file came from.

 

Let’s start with a “trivial” case as an example

Let’s say that you have a project that you want to migrate that is in a subfolder of a “master” repository. In this particular case, you can use the built-in “git svn clone” options to just specify the paths to the trunk, branches, tags, but let’s instead use editing the dump file to achieve the same thing. Let’s suppose that in your repository you have the following structure:

  • Project>
    • trunk
    • branches
    • tags

You’ve first already created a dumpfile that only includes these folders (see above). Now we want to edit the dumpfile so it looks like none of the commits ever knew about the top-level folder.

We do that by doing the following in the script file:

sed -i ‘s/Node-path: Project\/trunk/Node-path: trunk/g’ <dumpfile>

sed -i ‘s/Node-copyfrom-path: Project\/trunk/Node-copyfrom-path: trunk/g’ <dumpfile>

sed -i ‘s/Node-path: Project\/branches/Node-path: branches/g’ <dumpfile>

sed -i ‘s/Node-copyfrom-path: Project\/branches/Node-copyfrom-path: branches/g’ <dumpfile>

sed -i ‘s/Node-path: Project\/tags/Node-path: tags/g’ <dumpfile>

sed -i ‘s/Node-copyfrom-path: Project\/tags/Node-copyfrom-path: tags/g’ <dumpfile>

What we are doing is replacing the paths to every single folder/file in every single commit in the dumpfile with a new path that strips the first folder out. We need to also include the “copyfrom” versions so that when things are moved, it matches the new directory structures we are creating.

 

Fixing Non-trivial Issues

Now, that is a relatively trivial example. But you can fix other much more non-trivial issues. Let’s say you actually started with a root trunk/branches/tags and then you moved it into a sub-folder. So this:

  • trun
  • branches
  • tags

Got converted to this with some folder SVN-moves:

  • Project
    • trunk
    • branches
    • tags

This is where things start to get complicated. Using “svn git clone” you cannot specify that the trunk got moved around. Now, you must fix it in the dumpfile first. The good news is that you can do this. The script to do this is actually the same as the script above for the trivial case. However, there are now some additional tweaks we need to make to the dumpfile.

  1. At the revision that you moved these 3 folders (let’s assume you did it all in one revision), it’s going to look like some copyfrom operations in the dumpfile. After we run the cleaning script, there is going to be some stuff that looks like:
    1. Node-path: trunk
    2. Node-kind: dir
    3. Node-action: add
    4. Node-copyfrom-rev: X
    5. Node-copyfrom-path: trunk
  2. Because we stripped off the “Project” top-level folder, it’s going to look like we copied everything from trunk back in to trunk. Now, it turns out that is ok and SVN can handle that in a dump file. However, the problem is that also in part of this commit, the dumpfile is going to contain the deletion of the top-level trunk folder. It’s going to have a section in this commit that looks like:
    1. Note-path: trunk
    2. Node-kind: dir
    3. Node-action: delete
  3. Because that is exactly what we did in the original history. We need to go in and just manually delete the deletion part. We could also delete the copyfrom part too, and if we did all of that, we would end up with a commit that has a log message but does absolutely nothing, which is ok. But we don’t have to do the copyfrom parts if we don’t want.

Things can get even more complicated than this as well. Imagine if you moved a file to a location that was filtered out in the original dump because it’s not part of the project you want to recreate. But then the file got moved back in to the project that you do want to recreate. In those cases, you actually need to create a dump that contains the parts that you don’t (eventually) want, at least for now. Then, one suggestion is to move those folder locations using the tricks above to a temporary folder in the new “trunk”, as in “trunk/svn-git-migration/<folder-to-delete-later”. This will allow you to preserve the file moves during the recreation of the repo. After you make the new repo (see below), you can then SVN-delete the temporary folder right before you convert to git.

 

Using These Tricks to Deal With Externals

You can also use these tricks to deal with externals. This section describes how to handle when you moved some stuff around in a “master” repo to share it between some projects. Git has submodules and subtrees if you want to preserve and SVN “externals” as its own repo but let’s suppose that instead, you want to make it look like the contents of the externals file never moved around at all, so you can create one new repo with all the history of the externals too. Let’s suppose you started with:

  • Project1
    • trunk
    • branches
    • tags

Then, you took some stuff from Project1 and started moving it to a shared section in your repo (to share with Project2) and added an externals folder in your trunk to get back into Project1. So now your repo looks like:

  • Project1
    • trunk
      • Library
    • branches
    • tags
  • Library
    • trunk
    • branches
    • tags

You can use the same path replacement techniques to make it look like all you did was reorganize some files that were in other areas of “trunk” into a new subfolder you named “Library”. Just do the following with some sed commands:

sed -i ‘s/Node-path: Library\/trunk/Node-path: trunk\/Library/g’ <dumpfile>

sed -i ‘s/Node-copyfrom-path: Library\/trunk/Node-copyfrom-path: trunk\/Library/g’ <dumpfile>

There is one other thing that you need to do though. Since “trunk/Library” is created via an externals, the actual “Library” folder is never created in the dumpfile. So when recreating the repo, it’s going to fail when it trys to move stuff into the Library folder because it doesn’t exist yet. You will need to go into the dumpfile and manually “add” a directory in an appropriate existing commit (probably the one right before you need to use it). This is similar to how we had to manually delete some parts of the dumpfile discussed above. You can do this with something like:

Node-path: trunk/Library

Node-kind: dir

Node-action: add

Prop-content-length: 10

Content-length: 10

The Last Part

You generally only need to do this for the “trunk”, however, this may not completely preserve what the state of the Library was in any branches you created. You probably fixed the revision of Library in those branches. You could (if you want) continue using these tricks to fix it all up.

Things can also get tricky sometimes regarding the order of the path replacement operations. For example, consider a folder structure that evolved as such:

  • Project
  • Project-trunk
  • Project-branches
  • Project-tags
  • Project
  • trunk
  • branches
  • tags

If you first try to replace “Project” with trunk, it will also replace “Project-trunk” with “trunk-trunk”. That will screw up future replacement operations. So, generally speaking, do the replacement operation from most-specific to least-specific. In this example, the order might be:

  • Project/trunk → trunk
  • Project/branches → branches
  • Project/tags → tags
  • Project-trunk → trunk
  • Project-branches → branches
  • Project-tags → tags
  • Project → trunk

The End

Happy Migrating!