Add slightly more cluster repository documentation

Summary: Ref T10751. There are still some missing support tools here, but explain some of this a little better. Test Plan: Read documentation. Reviewers: chad Reviewed By: chad Maniphest Tasks: T10751 Differential Revision: https://secure.phabricator.com/D15764
2016-04-19 20:15:39 -07:00
parent bab3690b54
commit 48b015a3fa
1 changed files with 90 additions and 4 deletions
--- a/src/docs/user/cluster/cluster_repositories.diviner
+++ b/src/docs/user/cluster/cluster_repositories.diviner
@@ -19,19 +19,19 @@ advantages of doing this are:
 This configuration is complex, and many installs do not need to pursue it.
-This configuration is not currently supported with Subversion.
+This configuration is not currently supported with Subversion or Mercurial.
 Repository Hosts
 ================
 Repository hosts must run a complete, fully configured copy of Phabricator,
-including a webserver. If you make repositories available over SSH, they must
+including a webserver. They must also run a properly configured `sshd`.
 also run a properly configured `sshd`.
 Generally, these hosts will run the same set of services and configuration that
 web hosts run. If you prefer, you can overlay these services and put web and
-repository services on the same hosts.
+repository services on the same hosts. See @{article:Clustering Introduction}
 for some guidance on overlaying services.
 When a user requests information about a repository that can only be satisfied
 by examining a repository working copy, the webserver receiving the request
@@ -57,6 +57,17 @@ If it isn't, they block the read until they can complete a fetch.
 Before responding to a write, replicas obtain a global lock, perform the same
 version check and fetch if necessary, then allow the write to continue.
 Additionally, repositories passively check other nodes for updates and
 replicate changes in the background. After you push a change to a repositroy,
 it will usually spread passively to all other repository nodes within a few
 minutes.
 Even if passive replication is slow, the active replication makes acknowledged
 changes sequential to all observers: after a write is acknowledged, all
 subsequent reads are guaranteed to see it. The system does not permit stale
 reads, and you do not need to wait for a replication delay to see a consistent
 view of the repository no matter which node you ask.
 HTTP vs HTTPS
 =============
@@ -84,6 +95,81 @@ Other mitigations are possible, but securing a network against the NSA and
 similar agents of other rogue nations is beyond the scope of this document.
 Monitoring Replication
 ======================
 You can review the current status of a repository on cluster nodes in
 {nav Diffusion > (Repository) > Manage Repository > Cluster Configuration}.
 This screen shows all the configured devices which are hosting the repository
 and the available version.
 **Version**: When a repository is mutated by a push, Phabricator increases
 an internal version number for the repository. This column shows which version
 is on disk on the corresponding node.
 After a change is pushed, the node which received the change will have a larger
 version number than the other nodes. The change should be passively replicated
 to the remaining nodes after a brief period of time, although this can take
 a while if the change was large or the network connection between nodes is
 slow or unreliable.
 You can click the version number to see the corresponding push logs for that
 change. The logs contain details about what was changed, and can help you
 identify if replication is slow because a change is large or for some other
 reason.
 **Writing**: This shows that the node is currently holding a write lock. This
 normally means that it is actively receiving a push, but can also mean that
 there was a write interruption. See "Write Interruptions" below for details.
 Write Interruptions
 ===================
 A repository cluster can be put into an inconsistent state by an interruption
 in a brief window immediately after a write.
 Phabricator can not commit changes to a working copy (stored on disk) and to
 the global state (stored in a database) atomically, so there is a narrow window
 between committing these two different states when some tragedy (like a
 lightning strike) can befall a server, leaving the global and local views of
 the repository state divergent.
 In these cases, Phabricator fails into a "frozen" state where further writes
 are not permitted until the failure is investigated and resolved.
 TODO: Complete the support tooling and provide recovery instructions.
 Loss of Leaders
 ===============
 A more straightforward failure condition is the loss of all servers in a
 cluster which have the most up-to-date copy of a repository. This looks like
 this:
  - There is a cluster setup with two nodes, X and Y.
  - A new change is pushed to server X.
  - Before the change can propagate to server Y, lightning strikes server X
    and destroys it.
 Here, all of the "leader" nodes with the most up-to-date copy of the repository
 have been lost. Phabricator will refuse to serve this repository because it
 can not serve it consistently, and can not accept writes without data loss.
 The most straightforward way to resolve this issue is to restore any leader to
 service. The change will be able to replicate to other nodes once a leader
 comes back online.
 If you are unable to restore a leader or unsure that you can restore one
 quickly, you can use the monitoring console to review which changes are
 present on the leaders but not present on the followers by examining the
 push logs.
 TODO: Complete the support tooling and provide recovery instructions.
 Backups
 ======