From 37ce28a43b0e2dbff17eb20d2e8bd5051dc425d6 Mon Sep 17 00:00:00 2001 From: Sitaram Chamarty Date: Fri, 12 Aug 2011 00:04:15 +0530 Subject: [PATCH] (new mirroring) documentation --- doc/mirroring.mkd | 703 +++++++++++++++++++++++++++++----------------- 1 file changed, 451 insertions(+), 252 deletions(-) diff --git a/doc/mirroring.mkd b/doc/mirroring.mkd index 94d3170..0e5b2ce 100644 --- a/doc/mirroring.mkd +++ b/doc/mirroring.mkd @@ -1,12 +1,13 @@ -## mirroring a gitolite setup +# mirroring gitolite servers -Mirroring git repos is essentially a one-liner. For each mirror you want to -update, you just add a post-receive hook that says +Mirroring a repo is simple in git; you just need code like this in a +`post-receive` hook in each repo: #!/bin/bash git push --mirror slave_user@mirror.host:/path/to/repo.git -But life is never that simple... +The hard part is managing this across multiple mirror sites with multiple +repositories being mirrored. **This document has been tested using a 3-server setup, all installed using the "non-root" method (see doc/1-INSTALL.mkd). However, the process is @@ -20,33 +21,25 @@ never *really* lost until you do a `git gc`**. ---- -**Update 2011-03-10**: I wrote this with a typical "corporate" setup in mind -where all the servers involved are owned and administered by the same group of -people. As a result, the scripts assume the servers trust each other -completely. If that is not your situation, you will have to add code into -`gl-mirror-shell` to limit the commands the remote may send. Patches welcome -:-) - ----- - In this document: * RULE NUMBER ONE! - * things that will NOT be mirrored by this process - * conventions in this document - * setting up mirroring - * install gitolite on all servers - * generate keypairs - * setup the mirror-shell on each server - * set slaves to slave mode - * set slave server lists - * efficiency versus paranoia - * syncing the mirrors the first time - * switching over - * the return of foo - * switching back - * making foo a slave - * URLs that your users will use + * what will/will not work + * concepts and terminology + * setup and usage + * server level setup + * repository level setup + * commands to (re-)sync mirrors + * details + * the `conf/gitolite.conf` file + * redirecting pushes + * discussion + * problems with the old mirroring model + * the new mirroring model + * appendix A: example cronjob based mirroring + * appendix B: efficiency versus paranoia + +---- @@ -62,285 +55,491 @@ Corollary: if the primary went down and you effected a changeover, you must make sure that the primary does not come up in a push-enabled mode when it recovers. - + -### things that will NOT be mirrored by this process +### what will/will not work -Let's get this out of the way. This procedure will only mirror your git -repositories, using `git push --mirror`. Therefore, certain files will not be -mirrored: - - * gitolite log files - * "gl-creator" and "gl-perms" files - * "projects.list", "description", and entries in the "config" files within - each repo +This process will only mirror your git repositories, using `git push +--mirror`. It will not mirror log files, and repo-specific files like +`gl-creater` and `gl-perms` files, or indeed anything that was manually +created or added (for example, custom config entries added manually instead of +via gitolite). None of these affect actual repo contents of course, but they could be important, (especially the gl-creator, although if your wildcard pattern had "CREATOR" in it you can recreate those files easily enough anyway). -Your best bet is to use rsync for the log files, and tar for the others, at -regular intervals. +Mirroring has not been, and will not be, tested when gitolite is installed +using the deprecated 'from-client' method. Please use one of the other +methods. - +Also, none of this has been tested with smart-http. I'm not even sure it'll +work; http is very fiddly to get right. If you want mirroring, at least your +server-to-server comms should be over ssh. -### conventions in this document + -The userid hosting gitolite is `gitolite` on all machines. The servers are -foo, bar, and baz. At the beginning, foo is the master, the other 2 are -slaves. +### concepts and terminology - +Servers can host 3 kinds of repos: master, slave, and local. -### setting up mirroring + * A repo can be a **master** on one and only one server. A repo on its + "master" server is a **native** repo, on slaves it is "non-native". - + * A **slave** repo cannot be pushed to by a user. It will only accept + pushes from a master server. (But see later for an exception). -#### install gitolite on all servers + * A **local** repo is not involved in mirroring at all, in either direction. - * before running the final step in the install sequence, make sure you go to - the `hooks/common` directory and rename `post-receive.mirrorpush` to - `post-receive`. See doc/hook-propagation.mkd if you're not sure where you - should look for `hooks/common`. + - * if the server already has gitolite installed, use the normal methods to - make sure this hook gets in. +### setup and usage - * Use the same "admin key" on all the machines, so that the same person has - gitolite-admin access to all of them. + - +#### server level setup -#### generate keypairs +To start with, assign each server a short name. We will use 'frodo', 'sam', +and 'gollum' as examples here. -Each server will be potentially logging on to one or more of the other -servers, so first generate keypairs on each of them (`ssh-keygen`) and copy -the `.pub` files to all other servers, named appropriately. So foo will have -bar.pub and baz.pub, etc. +1. Generate ssh keys on each machine. Copy the `.pub` files to all other + machines with the appropriate names. I.e., frodo should have sam.pub and + gollum.pub, etc. - +2. Install gitolite on all servers, under some 'hosting user' (we'll use + `git` in our examples here). You need not use the same hosting user on + all machines. -#### setup the mirror-shell on each server + It is not necessary to use the same "admin key" on all the machines. + However, if you do plan to mirror the gitolite-admin repo also, they will + eventually become the same anyway. In our example, frodo does mirror the + admin repo to sam, but not to gollum. (Can you really see frodo or sam + trusting gollum?) -XXX review this document after testing mirroring... +3. Now copy `hooks/common/post-receive.mirrorpush` from the gitolite source, + and install it as a custom hook called `post-receive`; see [here][ch] for + instructions. -If you installed gitolite using the from client method, run the following: +4. Edit `~/.gitolite.rc` on each machine and add/edit the following lines. + The `GL_HOSTNAME` variable **must** have the correct name for that host + (frodo, sam, or gollum), so that will definitely be different on each + server. The other line can be the same, or may have additional patterns + for other `git config` keys you have previously enabled. See [here][rsgc] + and the description for `GL_GITCONFIG_KEYS` in [this][vsi] for details. - # on foo - export GL_BINDIR=$HOME/.gitolite/src - cat bar.pub baz.pub | - sed -e 's,^,command="'$GL_BINDIR'/gl-mirror-shell" ,' >> ~/.ssh/authorized_keys + $GL_HOSTNAME = 'frodo'; # will be different on each server! + $GL_GITCONFIG_KEYS = "gitolite.mirror.*"; -If you installed using any of the other 3 methods do this: + (Remember the "rc" file is NOT mirrored; it is meant to be site-local). - # on foo - export GL_BINDIR=`gl-query-rc GL_BINDIR` - cat bar.pub baz.pub | - sed -e 's,^,command="'$GL_BINDIR'/gl-mirror-shell" ,' >> ~/.ssh/authorized_keys + Note: if `GL_HOSTNAME` is undefined, all mirroring features are disabled + on that server, regardless of other settings. -Also do the same thing on the other machines. +5. On each machine, add the keys for all other machines. For example, on + frodo you'd run these two commands: -Now test this access: + gl-tool add-mirroring-peer sam.pub + gl-tool add-mirroring-peer gollum.pub - # on foo - ssh gitolite@bar pwd - # should print /home/gitolite/repositories - ssh gitolite@bar uname -a - # should print the appropriate info for that server +6. Create "host" aliases on each machine to refer to all other machines. See + [here][ha] for what/why/how. -Similarly test the other combinations. + The host alias for a host (in other machines' `~/.ssh/config` files) MUST + be the same as the `GL_HOSTNAME` in the referred host's `~/.gitolite.rc`. + Gitolite mirroring **requires** this consistency in naming; things will + NOT work otherwise. - + For example, if machine A's `~/.gitolite.rc` says `$GL_HOSTNAME = + 'frodo';`, then all other machines must use a host alias of "frodo" in + their `~/.ssh/config` files to refer to machine A. -#### set slaves to slave mode +Once you've done this, each host should be able to reach the other hosts and +get a response back. For example, running this on sam: -Set slave mode on all the *slave* servers by setting `$GL_SLAVE_MODE = 1` -(uncommenting the line if necessary). + ssh frodo info -Leave the master server's file as is. +should get you - + Hello sam, I am frodo. -#### set slave server lists +Check this command from *everywhere to everywhere else*, and make sure you get +expected results. **Do NOT proceed otherwise.** -On the master (foo), set the names of the slaves by editing the -`~/.gitolite.rc` to contain: + - $ENV{GL_SLAVES} = 'gitolite@bar gitolite@baz'; +#### repository level setup -**Note the syntax well; this is critical**: +Setting up mirroring at the repository level instead of at the "entire server" +level gives you a lot of flexibility (see "discussion" section below). - * **this must be in single quotes** (or you must remember to escape the `@`) - * the variable is an ENV var, not a plain perl var - * the values are *space separated* - * each value represents the userid and hostname for one server +The basic idea is to use `git config` variables within each repo (gitolite +allows you to create them from within the gitolite.conf file so that's +convenient), and use these to specify which machine is the master and which +machines are slaves for the repo. -The basic idea is that this string, should be usable in both the following -syntaxes: + - git clone gitolite@bar:repo - ssh gitolite@bar pwd +> Side note: if you just want to simulate the old mirroring scheme, despite +> its limitations, it's very easy. Say frodo is the master for all repos, +> and the other 2 are slaves. Just clone the gitolite-admin repos of all +> servers, add these lines to the top of each: -You can also use ssh host aliases. Let's say server "bar" has a non-standard -port number: + repo @all + config gitolite.mirror.master = "frodo" + config gitolite.mirror.slaves = "sam gollum" - # in ~/.ssh/config on foo - host mybar - hostname bar - user gitolite - port 2222 +> then commit, and push all 3. Finally, make a dummy commit on just the +> frodo clone and push again. You're done. - # in ~/.gitolite.rc on foo - $ENV{GL_SLAVES} = 'bar gitolite@baz'; + -And that's really all there is, unless... +Let's say frodo and sam are internal servers, while gollum is an external (and +therefore less trusted) server that has agreed to help us out by mirroring one +of our high traffic repos. We want the following setup: - + * the "gitolite-admin" repo, as well as an internal project repo called + "ip1", should be mastered on frodo and mirrored to sam. -### efficiency versus paranoia + * internal project "ip2" has almost all of its developers closer to sam, so + it should be mastered there, and mirrored on frodo. + + * an open source project we manage, "os1", should be mastered on frodo and + mirrored on both sam and gollum. + +So here's how our example would go: + +1. Clone frodo's and sam's gitolite-admin repos to your workstation, then add + the following lines to both their gitolite.conf files: + + repo ip1 gitolite-admin + config gitolite.mirror.master = "frodo" + config gitolite.mirror.slaves = "sam" + + repo ip2 + config gitolite.mirror.master = "sam" + config gitolite.mirror.slaves = "frodo" + + You also need normal access control lines for ip1 and ip2; I'm assuming + you already have them elsewhere, at least on frodo. (What you have on sam + won't matter in a few minutes, as you will see!) + + Commit and push these changes. + +2. There are a couple of quirks to keep in mind when you make changes to the + gitolite-admin repo's config. + + * the first push will create the `git config` entries required, but by + then it is too late to *act* on them; i.e., actually do the mirroring. + If there were any older values, like a different list of slaves + perhaps, then those would be in effect. + + This is largely because git invokes post-receive before post-update. + In theory I can work around this but I do not intend to. + + Anyway, this means that after the 2 pushes, you have to make a dummy + push from frodo: + + git commit --allow-empty -m empty; git push + + which gets you something like this amidst the other messages: + + remote: (25158&) frodo ==== (gitolite-admin) ===> sam + + telling you that frodo is sending gitolite-admin to sam in the + background. + + * the second quirk is that your clone of server sam's gitolite-admin + repo is now completely out of date, since frodo has overwritten it on + the server. You have to 'cd' to that clone and do this: + + git fetch + git reset --hard origin/master + +2. That completes the setup of the gitolite-admin and the internal project + repos. We'll now setup things for the open source project, "os1". + + On frodo's gitolite-admin clone, add the following lines to + `conf/gitolite.conf`, then commit and push: + + repo os1 + config gitolite.mirror.master = "frodo" + config gitolite.mirror.slaves = "sam gollum" + + Also, send the same lines to gollum's administrator and ask him to add + them into his conf/gitolite.conf file, commit, and push. + + + +#### commands to (re-)sync mirrors + +Sometimes there's a network problem and a mirror will not receive an update +immediately on a push. When the network is back up, you can do one of these +things to get it back in sync. + +1. On the master server, you can start a **background** job to mirror a repo. + For example, this: + + gl-mirror-shell request-push ip1 + + triggers a mirror-push of repo "ip1" to all slaves listed in that repo's + "gitolite.mirror.slaves" config. + + On the hand, this: + + gl-mirror-shell request-push ip1 gollum + + triggers a mirror-push of "ip1" *only* to the gollum server, regardless of + what servers are listed as slaves in the config. + + Note that this invocation does not even check if gollum is listed as a + slave for "ip1"; since you're doing it at the command line on the master + server, you're allowed to push it to *any* slave that will accept it. + + + + > Side note: if you want to start a **foreground** job, the syntax is + > `gl-mirror-shell request-push ip1 -fg gollum`. Foreground mode + > requires one (and only one) slave name -- you cannot send to an + > implicit list, nor to more than one slave. + + + +2. Cronjobs and custom mirroring schemes are now very easy to do. Just use + the second form of the command above to push any repo to any slave, and it + can form the basis of any scheme you like. Appendix A contains an example + setup. + +3. Once in a while a slave will realise it needs an update, and wants to ask + for one. It can run this command to do so: + + ssh sam request-push ip2 + + If the requesting server is not one of the slaves listed in the config + variable gitolite.mirror.slaves on the master, it will be rejected. + + This is always a foreground push, reflecting the fact that the slave may + want to know why their push errored out or didn't work last time or + whatever. + + + +### details + + + +#### the `conf/gitolite.conf` file + +One goal I have is to minimise the code changes to "core" gitolite due to +this, so all repo-specific mirror settings are stored as `git config` +variables (you know you can specify git config variables in the gitolite +config file right?). These are: + + * `gitolite.mirror.master` + + The name of the server which is the master for this repo. Each server + will compare this with `$GL_HOSTNAME` (from its own rc file) to + determine if it's the master or a slave. Here're the possible values: + + * **undefined** or `local`: this repo is local to this server + * **same** as `$GL_HOSTNAME`: this server is the "master" for this + repo. (The repo is "native" to this server). + * **not same** as `$GL_HOSTNAME`: this server is a "slave" for the + repo. (The repo is a non-native on this server). + + * `gitolite.mirror.slaves` + + Ignored for non-native repos. For native repos, this is a space-separated + list of servers to push to from the `post-receive` hook. + + Clearly, you can have different sets of slaves for different repos (again, + see "discussion" section later for more on this). + + * `gitolite.mirror.redirectOK` + + See the section on "redirecting pushes" + + + +### redirecting pushes + +**Please read carefully; there are security implications if you enable this +for mirrors NOT under your control**. + +When a user pushes to a non-native repo, it is possible to transparently +redirect the push to the correct master server. This is a very neat feature, +because now all your users just use one URL (the mirror nearest to them). +They don't need to know where the actual master is, and more importantly, if +you and the other admins change it, they don't need to know it changed! + +The `gitolite.mirror.redirectOK` config variable decides where this +redirection is OK. If it is set to 'true', any valid 'slave' can redirect an +incoming non-native push from a developer. Otherwise, it contains a list of +slaves that are permitted to redirect pushes (this might happen if you don't +trust some of your slaves enough to accept a redirected push from them). + +This check needs to pass on both the master and slave servers; both have a say +in deciding if this is allowed. (The master may have real reasons not to +allow this; see below. I cannot think of any real reason for the *slave* to +disable this, but it's there in case some admin doesn't like it). + +There are some potential issues that you MUST consider before enabling this: + + * (security) If the slave and master server are so different or autonomous + that a user, say "alice", on the slave is not guaranteed to be the same + one as "alice" on the master, then the master admin should NOT enable this + feature. + + This is because, in this scheme, authentication happens on the slave, but + authorisation is on the master. The slave-authenticated userid (alice) is + passed to the master. + + (If you know ssh well enough, you know that the ssh authentication has + already happened, so all we can do is ensure authorisation happens with + whatever username we know so far). + + * If your slave is out of sync with the master for whatever reason, then the + user will get confusing results. A `git fetch` may say everything is + upto-date but the push fails saying it is not a fast-forward push. (Of + course there's a way to fix this; see the "commands to (re-)sync mirrors" + section above). + + * We cannot redirect non-git commands like ADC, setperms, etc because we + don't really have a way of knowing what repo he's talking about (different + commands have different syntaxes, some have more than one reponame...). + Any user who needs to do that should access the end server directly. It + should be easy enough to write an ADC to do the forwarding, in case the + slave server is the only one that can reach the real master due to network + or firewall setup. + + Ideally, I recommend that ad hoc repos not be mirrored at all. Keep + mirroring for "blessed" repos only. + + + +### discussion + + + +#### problems with the old mirroring model + +The old mirroring model had a single server as the master for *all* +repositories. Slaves were effectively only for load-balancing reads, or for +failover if the master died. + +This is not good enough for corporate setups where the developers are spread +fairly evenly across the world. Some repos need to be closer to some teams +(NUMA is a good analogy). + +A model where different repos are "mastered" in different cities is much more +efficient here. + +The old model had other rigidities too, though they're not really *problems*, +as such: + + * the slaves are just slaves; they can't have any "local" repos. + + * a slave had to carry *all* repos; it couldn't choose to carry just a + subset. + + * it implicitly assumed all the mirrors were under the same admin, and that + the gitolite-admin repo was itself mirrored too. + + + +#### the new mirroring model + +In the new model, servers can be (but, I hasten to add, don't *have to* be!) +much more independent and autonomous than in the old model. This has a few +pros/cons: + + * The gitolite-admin repo (and config) need not be mirrored. This allows + site-local repos not meant to be mirrored, without unnecessarily creating + a second gitolite install just for those. + + (Site-local repos are useful for purely local projects that need + not/should not be mirrored for some reason, or ad-hoc personal repos that + developers create for themselves, etc.) + + Of course, then the admin(s) need to make an effort to keep things + consistent for the "blessed" repos. For example, two servers can both + claim to be "master"! + + * Servers can choose to mirror a subset of the repos from one of the bigger + servers. + + In the open source world, you can imagine more popular repos (or more + popular parts of huge projects like KDE) having more mirrors. Or + substitute "more popular" with "larger in size" if you wish + (FlightGear-data anyone?) + + In the corporate world it could help with jurisdiction issues if the + mirror is in a different country with different laws. + + I'm sure people will find other uses for this. And I'm *positive* the + pros will outweigh the cons. If you don't like it, follow the suggestion + in the side note somewhere up above, and just forget this feature exists + :-) + +---- + + + +### appendix A: example cronjob based mirroring + +Let's say you have some repos that are so active that you're pushing halfway +across the world every few seconds. The slaves do not need to be that closely +updated, and it is sufficient to update them once an hour instead. Here's how +you might do that: + + repo foo bar frob/nitz + config gitolite.mirror.hourly = "slave1 slave2 slave3" + +Then you'd write a cron job that looks like this (untested): + + #!/bin/bash + + REPO_BASE=`${0%/*}/gl-query-rc REPO_BASE` + GL_BINDIR=`${0%/*}/gl-query-rc GL_BINDIR` + + cd $REPO_BASE + find . -type d -name "*.git" -prune | while read r + do + cd $REPO_BASE; cd $r + + # get reponame as gitolite knows it + r=${r:2} + r=${r%.git} + + # get slaves list + slaves=`git config --get gitolite.mirror.hourly` + + gl-mirror-shell request-push $r $slaves + + # that command backgrounds the push, so you'd best wait a few seconds + # before hitting the next one, otherwise you'll have all your repos + # going out at once! + sleep 10 + done + + + +### appendix B: efficiency versus paranoia If you're paranoid enough to use mirrors, you should be paranoid enough to -like the `receive.fsckObjects` setting we now default to :-) However, informal -tests indicate a 40-50% CPU overhead from this. If you don't like that, -remove that line from the post-receive code. +use the `receive.fsckObjects` setting. However, informal tests indicate a +40-50% CPU overhead from this. If you're ok with that, make the appropriate +adjustments to `GL_GITCONFIG_KEYS` and possibly `GL_GITCONFIG_WILD` in the rc +file, then add this to your gitolite.conf file: -Please also note that we only set it on mirrors, and that too at the time the -mirrored repo is *created*. This means, when you start using your old "main" -server as a mirror (see later sections on switching over to a mirror, etc.), -it's repos do not have this setting. Repos created by previous versions of -gitolite also will not have this setting. + repo @all + config receive.fsckObjects = "true" Personally, I just set `git config --global receive.fsckObjects true`, since those servers aren't doing anything else anyway, and are idle for long stretches of time. It's upto you what you want to do here. - +[ch]: http://sitaramc.github.com/gitolite/doc/2-admin.html#_custom_hooks +[ha]: http://sitaramc.github.com/gitolite/doc/ssh-troubleshooting.html#_appendix_4_host_aliases +[rsgc]: http://sitaramc.github.com/gitolite/doc/gitolite.conf.html#_repo_specific_git_config_commands +[vsi]: http://sitaramc.github.com/gitolite/doc/gitolite.rc.html#_variables_with_a_security_impact -### syncing the mirrors the first time - -This is fine if you're setting up everything from scratch. But if your master -server already had some repos with commits on them, you have to manually sync -them up once. - - # on foo - gl-mirror-sync gitolite@bar - # path to "sync" program is ~/.gitolite/src if "from-client" install - - - -### switching over - -Let's say foo goes down. You want to make bar the main server, and continue -to have "baz" be a slave. - - * on bar, edit `~/.gitolite.rc` and set - - $GL_SLAVE_MODE = 0; - $ENV{GL_SLAVES} = 'gitolite@baz'; - - * **sanity check**: go to your gitolite-admin clone, add a remote for "bar", - fetch it, and make sure they are the same: - - git remote add bar gitolite@bar:gitolite-admin - git fetch bar - git branch -a -v - # check that all SHAs are the same - - * inform everyone of the new URL for their repos (see next section for more - on this) - - * make sure that if "foo" does come up, it will not immediately start - serving requests. You'll be in trouble if (a) foo comes up as it was - before, and (b) some developer still had the old URL lying around and - started pushing changes to it. - - You could jump in quickly and set `$GL_SLAVE_MODE = 1` as soon as the - system comes up. Better still, use extraneous means to block incoming - connections from normal users (out of scope for this document). - - - -### the return of foo - - - -#### switching back - -Switching back is fairly easy. - - * synchronise all repos from bar to foo. This may take some time, depending - on how long foo was down. - - # on bar - gl-mirror-sync gitolite@foo - # path to "sync" program is ~/.gitolite/src if "from-client" install - - * turn off pushes on "bar" by setting slave mode to 1 - * run the sync once again; this should complete quickly - - * **double check by comparing some the repos on both sides if needed**. You - could run the following snippet on all servers for a quick check: - - cd ~/repositories # or wherever $REPO_BASE is - find . -type d -name "*.git" | sort | - while read r - do - echo $r - git ls-remote $r | sort - done | md5sum - - * on foo, set the slave list (or check that it is correct) - * on foo, set slave mode off - * tell everyone to switch back - - - -#### making foo a slave - -If "foo" does come up in a controlled manner, you might not want to switch -back right away. Unless you're doing DNS tricks, users may be peeved at -having to do 2 switches. - -If you want to make foo a slave, you know the drill by now: - - * set slave mode to 1 on foo - * on bar, add foo as a slave - - # in ~/.gitolite.rc on bar - $ENV{GL_SLAVES} = 'gitolite@foo gitolite@baz'; - -I think that should cover pretty much everything. I *have* tested most of -this, but YMMV. - ----- - - - -### URLs that your users will use - -Unless you play DNS tricks, it is more than likely that your users would have -to change the URLs they use to access their repos if you change the server -they push to. - -I cannot speak for the plethora of git client software out there but for -normal git, this problem can be mitigated somewhat by doing this: - - * in `~/.ssh/config` on my workstation, I have - - host gl - hostname=primary.server.ip - user=gitolite - - * all my `git clone` commands use `gl:reponame` as the URL - - * if the primary goes down, and I have to access the secondary, I just - change the `hostname` line in `~/.ssh/config`. - -That's it. Every clone of every repo used anywhere in this userid is now -changed. - -To repeat, this may or may not work with all the git clients that exist (like -jgit, or any of the GUI tools, and especially if you're on Windows). - -If anyone has a better idea, something that works more universally, I'd love -to hear it.