gitolite/doc/mirroring.mkd
Sitaram Chamarty a3ffc9d8fd (mirroring) reject non-local pushes if GL_HOSTNAME not set
We previously said all mirroring features are disabled if GL_HOSTNAME is
not set.

But what if, after mirroring has been setup, and master/slaves defined
for a repo, a slave admin fat-fingers the RC file and accidentally
comments out GL_HOSTNAME?  We might end up violating RULE NUMBER ONE!
2011-08-13 14:32:38 +05:30

21 KiB

mirroring gitolite servers

Mirroring a repo is simple in git; you just need code like this in a post-receive hook in each repo:

#!/bin/bash
git push --mirror slave_user@mirror.host:/path/to/repo.git

The hard part is managing this across multiple mirror sites with multiple repositories being mirrored.

This document has been tested using a 3-server setup, all installed using the "non-root" method (see doc/1-INSTALL.mkd). However, the process is probably not going to be very forgiving of human error -- like anything that is this deep in "system admin" territory, errors are likely to be costly. If you're the kind who hits enter first and then thinks about what he typed, you're in for some fun times ;-)

On the plus side, everything we do is done using git commands, so things are never really lost until you do a git gc.


In this document:


RULE NUMBER ONE!

RULE OF GIT MIRRORING: users should push directly to only one server! All the other machines (the slaves) should be updated by the master server.

If a user pushes directly to one of the slaves, those changes will get wiped out on the next mirror push from the real master server.

Corollary: if the primary went down and you effected a changeover, you must make sure that the primary does not come up in a push-enabled mode when it recovers.

what will/will not work

This process will only mirror your git repositories, using git push --mirror. It will not mirror log files, and repo-specific files like gl-creater and gl-perms files, or indeed anything that was manually created or added (for example, custom config entries added manually instead of via gitolite).

None of these affect actual repo contents of course, but they could be important, (especially the gl-creator, although if your wildcard pattern had "CREATOR" in it you can recreate those files easily enough anyway).

Mirroring has not been, and will not be, tested when gitolite is installed using the deprecated 'from-client' method. Please use one of the other methods.

Also, none of this has been tested with smart-http. I'm not even sure it'll work; http is very fiddly to get right. If you want mirroring, at least your server-to-server comms should be over ssh.

concepts and terminology

Servers can host 3 kinds of repos: master, slave, and local.

  • A repo can be a master on one and only one server. A repo on its "master" server is a native repo, on slaves it is "non-native".

  • A slave repo cannot be pushed to by a user. It will only accept pushes from a master server. (But see later for an exception).

  • A local repo is not involved in mirroring at all, in either direction.

setup and usage

server level setup

To start with, assign each server a short name. We will use 'frodo', 'sam', and 'gollum' as examples here.

  1. Generate ssh keys on each machine. Copy the .pub files to all other machines with the appropriate names. I.e., frodo should have sam.pub and gollum.pub, etc.

  2. Install gitolite on all servers, under some 'hosting user' (we'll use git in our examples here). You need not use the same hosting user on all machines.

    It is not necessary to use the same "admin key" on all the machines. However, if you do plan to mirror the gitolite-admin repo also, they will eventually become the same anyway. In our example, frodo does mirror the admin repo to sam, but not to gollum. (Can you really see frodo or sam trusting gollum?)

  3. Now copy hooks/common/post-receive.mirrorpush from the gitolite source, and install it as a custom hook called post-receive; see here for instructions.

  4. Edit ~/.gitolite.rc on each machine and add/edit the following lines. The GL_HOSTNAME variable must have the correct name for that host (frodo, sam, or gollum), so that will definitely be different on each server. The other line can be the same, or may have additional patterns for other git config keys you have previously enabled. See here and the description for GL_GITCONFIG_KEYS in this for details.

    $GL_HOSTNAME = 'frodo';     # will be different on each server!
    $GL_GITCONFIG_KEYS = "gitolite.mirror.*";
    

    (Remember the "rc" file is NOT mirrored; it is meant to be site-local).

    Note: if GL_HOSTNAME is undefined, you cannot push to repos which have the 'gitolite.mirror.master' config variable set. (See 'details' section below for more info on this variable).

  5. On each machine, add the keys for all other machines. For example, on frodo you'd run these two commands:

    gl-tool add-mirroring-peer sam.pub
    gl-tool add-mirroring-peer gollum.pub
    
  6. Create "host" aliases on each machine to refer to all other machines. See here for what/why/how.

    The host alias for a host (in other machines' ~/.ssh/config files) MUST be the same as the GL_HOSTNAME in the referred host's ~/.gitolite.rc. Gitolite mirroring requires this consistency in naming; things will NOT work otherwise.

    For example, if machine A's ~/.gitolite.rc says $GL_HOSTNAME = 'frodo';, then all other machines must use a host alias of "frodo" in their ~/.ssh/config files to refer to machine A.

Once you've done this, each host should be able to reach the other hosts and get a response back. For example, running this on sam:

ssh frodo info

should get you

Hello sam, I am frodo.

Check this command from everywhere to everywhere else, and make sure you get expected results. Do NOT proceed otherwise.

repository level setup

Setting up mirroring at the repository level instead of at the "entire server" level gives you a lot of flexibility (see "discussion" section below).

The basic idea is to use git config variables within each repo (gitolite allows you to create them from within the gitolite.conf file so that's convenient), and use these to specify which machine is the master and which machines are slaves for the repo.

Side note: if you just want to simulate the old mirroring scheme, despite its limitations, it's very easy. Say frodo is the master for all repos, and the other 2 are slaves. Just clone the gitolite-admin repos of all servers, add these lines to the top of each:

    repo @all
        config gitolite.mirror.master   =   "frodo"
        config gitolite.mirror.slaves   =   "sam gollum"

then commit, and push all 3. Finally, make a dummy commit on just the frodo clone and push again. You're done.

Let's say frodo and sam are internal servers, while gollum is an external (and therefore less trusted) server that has agreed to help us out by mirroring one of our high traffic repos. We want the following setup:

  • the "gitolite-admin" repo, as well as an internal project repo called "ip1", should be mastered on frodo and mirrored to sam.

  • internal project "ip2" has almost all of its developers closer to sam, so it should be mastered there, and mirrored on frodo.

  • an open source project we manage, "os1", should be mastered on frodo and mirrored on both sam and gollum.

So here's how our example would go:

  1. Clone frodo's and sam's gitolite-admin repos to your workstation, then add the following lines to both their gitolite.conf files:

    repo ip1 gitolite-admin
        config gitolite.mirror.master   =   "frodo"
        config gitolite.mirror.slaves   =   "sam"
    
    repo ip2
        config gitolite.mirror.master   =   "sam"
        config gitolite.mirror.slaves   =   "frodo"
    

    You also need normal access control lines for ip1 and ip2; I'm assuming you already have them elsewhere, at least on frodo. (What you have on sam won't matter in a few minutes, as you will see!)

    Commit and push these changes.

  2. There are a couple of quirks to keep in mind when you make changes to the gitolite-admin repo's config.

    • the first push will create the git config entries required, but by then it is too late to act on them; i.e., actually do the mirroring. If there were any older values, like a different list of slaves perhaps, then those would be in effect.

      This is largely because git invokes post-receive before post-update. In theory I can work around this but I do not intend to.

      Anyway, this means that after the 2 pushes, you have to make a dummy push from frodo:

      git commit --allow-empty -m empty; git push
      

      which gets you something like this amidst the other messages:

      remote: (25158&) frodo ==== (gitolite-admin) ===> sam
      

      telling you that frodo is sending gitolite-admin to sam in the background.

    • the second quirk is that your clone of server sam's gitolite-admin repo is now completely out of date, since frodo has overwritten it on the server. You have to 'cd' to that clone and do this:

      git fetch
      git reset --hard origin/master
      
  3. That completes the setup of the gitolite-admin and the internal project repos. We'll now setup things for the open source project, "os1".

    On frodo's gitolite-admin clone, add the following lines to conf/gitolite.conf, then commit and push:

    repo os1
        config gitolite.mirror.master   =   "frodo"
        config gitolite.mirror.slaves   =   "sam gollum"
    

    Also, send the same lines to gollum's administrator and ask him to add them into his conf/gitolite.conf file, commit, and push.

commands to (re-)sync mirrors

Sometimes there's a network problem and a mirror will not receive an update immediately on a push. When the network is back up, you can do one of these things to get it back in sync.

  1. On the master server, you can start a background job to mirror a repo. For example, this:

    gl-mirror-shell request-push ip1
    

    triggers a mirror-push of repo "ip1" to all slaves listed in that repo's "gitolite.mirror.slaves" config.

    On the hand, this:

    gl-mirror-shell request-push ip1 gollum
    

    triggers a mirror-push of "ip1" only to the gollum server, regardless of what servers are listed as slaves in the config.

    Note that this invocation does not even check if gollum is listed as a slave for "ip1"; since you're doing it at the command line on the master server, you're allowed to push it to any slave that will accept it.

    Side note: if you want to start a foreground job, the syntax is gl-mirror-shell request-push ip1 -fg gollum. Foreground mode requires one (and only one) slave name -- you cannot send to an implicit list, nor to more than one slave.

  2. Cronjobs and custom mirroring schemes are now very easy to do. Just use the second form of the command above to push any repo to any slave, and it can form the basis of any scheme you like. Appendix A contains an example setup.

  3. Once in a while a slave will realise it needs an update, and wants to ask for one. It can run this command to do so:

    ssh sam request-push ip2
    

    If the requesting server is not one of the slaves listed in the config variable gitolite.mirror.slaves on the master, it will be rejected.

    This is always a foreground push, reflecting the fact that the slave may want to know why their push errored out or didn't work last time or whatever.

details

the conf/gitolite.conf file

One goal I have is to minimise the code changes to "core" gitolite due to this, so all repo-specific mirror settings are stored as git config variables (you know you can specify git config variables in the gitolite config file right?). These are:

  • gitolite.mirror.master

    The name of the server which is the master for this repo. Each server will compare this with $GL_HOSTNAME (from its own rc file) to determine if it's the master or a slave. Here're the possible values:

    • undefined or local: this repo is local to this server
    • same as $GL_HOSTNAME: this server is the "master" for this repo. (The repo is "native" to this server).
    • not same as $GL_HOSTNAME: this server is a "slave" for the repo. (The repo is a non-native on this server).
  • gitolite.mirror.slaves

    Ignored for non-native repos. For native repos, this is a space-separated list of servers to push to from the post-receive hook.

    Clearly, you can have different sets of slaves for different repos (again, see "discussion" section later for more on this).

  • gitolite.mirror.redirectOK

    See the section on "redirecting pushes"

redirecting pushes

Please read carefully; there are security implications if you enable this for mirrors NOT under your control.

When a user pushes to a non-native repo, it is possible to transparently redirect the push to the correct master server. This is a very neat feature, because now all your users just use one URL (the mirror nearest to them). They don't need to know where the actual master is, and more importantly, if you and the other admins change it, they don't need to know it changed!

The gitolite.mirror.redirectOK config variable decides where this redirection is OK. If it is set to 'true', any valid 'slave' can redirect an incoming non-native push from a developer. Otherwise, it contains a list of slaves that are permitted to redirect pushes (this might happen if you don't trust some of your slaves enough to accept a redirected push from them).

This check needs to pass on both the master and slave servers; both have a say in deciding if this is allowed. (The master may have real reasons not to allow this; see below. I cannot think of any real reason for the slave to disable this, but it's there in case some admin doesn't like it).

There are some potential issues that you MUST consider before enabling this:

  • (security) If the slave and master server are so different or autonomous that a user, say "alice", on the slave is not guaranteed to be the same one as "alice" on the master, then the master admin should NOT enable this feature.

    This is because, in this scheme, authentication happens on the slave, but authorisation is on the master. The slave-authenticated userid (alice) is passed to the master.

    (If you know ssh well enough, you know that the ssh authentication has already happened, so all we can do is ensure authorisation happens with whatever username we know so far).

  • If your slave is out of sync with the master for whatever reason, then the user will get confusing results. A git fetch may say everything is upto-date but the push fails saying it is not a fast-forward push. (Of course there's a way to fix this; see the "commands to (re-)sync mirrors" section above).

  • We cannot redirect non-git commands like ADC, setperms, etc because we don't really have a way of knowing what repo he's talking about (different commands have different syntaxes, some have more than one reponame...). Any user who needs to do that should access the end server directly. It should be easy enough to write an ADC to do the forwarding, in case the slave server is the only one that can reach the real master due to network or firewall setup.

    Ideally, I recommend that ad hoc repos not be mirrored at all. Keep mirroring for "blessed" repos only.

discussion

problems with the old mirroring model

The old mirroring model had a single server as the master for all repositories. Slaves were effectively only for load-balancing reads, or for failover if the master died.

This is not good enough for corporate setups where the developers are spread fairly evenly across the world. Some repos need to be closer to some teams (NUMA is a good analogy).

A model where different repos are "mastered" in different cities is much more efficient here.

The old model had other rigidities too, though they're not really problems, as such:

  • the slaves are just slaves; they can't have any "local" repos.

  • a slave had to carry all repos; it couldn't choose to carry just a subset.

  • it implicitly assumed all the mirrors were under the same admin, and that the gitolite-admin repo was itself mirrored too.

the new mirroring model

In the new model, servers can be (but, I hasten to add, don't have to be!) much more independent and autonomous than in the old model. This has a few pros/cons:

  • The gitolite-admin repo (and config) need not be mirrored. This allows site-local repos not meant to be mirrored, without unnecessarily creating a second gitolite install just for those.

    (Site-local repos are useful for purely local projects that need not/should not be mirrored for some reason, or ad-hoc personal repos that developers create for themselves, etc.)

    Of course, then the admin(s) need to make an effort to keep things consistent for the "blessed" repos. For example, two servers can both claim to be "master"!

  • Servers can choose to mirror a subset of the repos from one of the bigger servers.

    In the open source world, you can imagine more popular repos (or more popular parts of huge projects like KDE) having more mirrors. Or substitute "more popular" with "larger in size" if you wish (FlightGear-data anyone?)

    In the corporate world it could help with jurisdiction issues if the mirror is in a different country with different laws.

    I'm sure people will find other uses for this. And I'm positive the pros will outweigh the cons. If you don't like it, follow the suggestion in the side note somewhere up above, and just forget this feature exists :-)


appendix A: example cronjob based mirroring

Let's say you have some repos that are so active that you're pushing halfway across the world every few seconds. The slaves do not need to be that closely updated, and it is sufficient to update them once an hour instead. Here's how you might do that:

repo foo bar frob/nitz
    config gitolite.mirror.hourly = "slave1 slave2 slave3"

Then you'd write a cron job that looks like this (untested):

#!/bin/bash

REPO_BASE=`${0%/*}/gl-query-rc REPO_BASE`
GL_BINDIR=`${0%/*}/gl-query-rc GL_BINDIR`

cd $REPO_BASE
find . -type d -name "*.git" -prune | while read r
do
    cd $REPO_BASE; cd $r

    # get reponame as gitolite knows it
    r=${r:2}
    r=${r%.git}

    # get slaves list
    slaves=`git config --get gitolite.mirror.hourly`

    gl-mirror-shell request-push $r $slaves

    # that command backgrounds the push, so you'd best wait a few seconds
    # before hitting the next one, otherwise you'll have all your repos
    # going out at once!
    sleep 10
done

appendix B: efficiency versus paranoia

If you're paranoid enough to use mirrors, you should be paranoid enough to use the receive.fsckObjects setting. However, informal tests indicate a 40-50% CPU overhead from this. If you're ok with that, make the appropriate adjustments to GL_GITCONFIG_KEYS and possibly GL_GITCONFIG_WILD in the rc file, then add this to your gitolite.conf file:

repo @all
    config receive.fsckObjects = "true"

Personally, I just set git config --global receive.fsckObjects true, since those servers aren't doing anything else anyway, and are idle for long stretches of time. It's upto you what you want to do here.