Commit Graph

15 Commits

Author SHA1 Message Date
epriestley
859b274970 Provide more information to users during git push while waiting for write locks
Summary:
Ref T13109. Make it slightly more clear what the scope of the write and read locks are, and slightly more clear that we're actively acquiring locks, not just sitting around waiting.

While waiting on another writer, show who we're waiting on so you can walk over to their desk and glare at them.

Test Plan:
Added `sleep(15)` after `willWrite()`. Pushed in two windows. Saw new, more informative messages. In the second window, saw the new guidance:

> # Waiting for hector to finish writing (on device "repo1.local.phacility.net" for 11s)...

Reviewers: asherkin

Reviewed By: asherkin

Subscribers: asherkin

Maniphest Tasks: T13109

Differential Revision: https://secure.phabricator.com/D19247
2018-03-22 13:42:18 -07:00
Dmitri Iouchtchenko
9bd6a37055 Fix spelling
Summary: Noticed a couple of typos in the docs, and then things got out of hand.

Test Plan:
  - Stared at the words until my eyes watered and the letters began to swim on the screen.
  - Consulted a dictionary.

Reviewers: #blessed_reviewers, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: epriestley, yelirekim, PHID-OPKG-gm6ozazyms6q6i22gyam

Differential Revision: https://secure.phabricator.com/D18693
2017-10-09 10:48:04 -07:00
epriestley
8034b9d819 Don't require a device be registered in Almanac to do cluster init/resync steps
Summary:
Fixes T12893. See also PHI15. This is complicated but:

  - In the documentation, we say "register your web devices with Almanac". We do this ourselves on `secure` and in the production Phacility cluster.
  - We don't actually require you to do this, don't detect that you didn't, and there's no actual reason you need to.
  - If you don't register your "web" devices, the only bad thing that really happens is that creating repositories skips version initialization, creating the bug in T12893. This process does not actually require the devices be registered, but the code currently just kind of fails silently if they aren't.

Instead, just move forward on these init/resync phases even if the device isn't registered. These steps are safe to run from unregistered hosts since they just wipe the whole table and don't affect specific devices.

If this sticks, I'll probably update the docs to not tell you to register `web` devices, or at least add "Optionally, ...". I don't think there's any future reason we'd need them to be registered.

Test Plan:
This is a bit tough to test without multiple hosts, but I added this piece of code to `AlmanacKeys` so we'd pretend to be a nameless "web" device when creating a repository:

```
if ($_REQUEST['__path__'] == '/diffusion/edit/form/default/') {
  return null;
}
```

Then I created some Git repositories. Before the patch, they came up with `-` versions (no version information). After the patch, they came up with `0` versions (correctly initialized).

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12893

Differential Revision: https://secure.phabricator.com/D18273
2017-07-25 05:12:10 -07:00
epriestley
0d5538672c Detect unsynchronizable repositories on multiple cluster hosts
Summary:
Ref T12613. Currently, the SVNTEST and HGTEST repositories are improperly configured on `secure`. These repositories use VCS systems which do not support synchronization, so they can not be served from cluster services with multiple hosts.

However, I've incorrectly configured them the same way as all the Git repositories, which support synchronization. This causes about 50% of requests to randomly fail (when they reach the wrong host).

Detect this issue and warn the user that the configuration is not valid.

It should be exceptionally difficult for normal installs to run into this.

Test Plan:
  - Mostly faked these conditions locally, verified that `secure` really has this configuration.
  - I'll push this, verify that the issue is detected correctly in production, then fix the config which should resolve the intermittent issues with SVNTEST.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12613

Differential Revision: https://secure.phabricator.com/D17774
2017-04-24 10:43:05 -07:00
Josh Cox
e0675b28d8 Pass exception to PhutilProxyException
Summary: Fixes T12243. That error occured due to network flakiness with some mounted filesystems so I'm not sure how best to simulate it. But you can look and see that the PhutilProxyException does indeed expect an exception as its second arg.

Test Plan: Look at method signature... look at callsite... now back at the method. Smile and nod.

Reviewers: #blessed_reviewers, yelirekim, epriestley

Reviewed By: #blessed_reviewers, epriestley

Subscribers: epriestley

Maniphest Tasks: T12243

Differential Revision: https://secure.phabricator.com/D17335
2017-02-08 13:24:44 -05:00
epriestley
ccff47682f Provide more useful guidance if a repository is clusterized into an existing multi-device cluster
Summary:
Fixes T12087. When transitioning into a clustered configuration for the first time, the documentation recommends using a one-device cluster as a transitional step.

However, installs may not do this for whatever reason, and we aren't as clear as we could be in warning about clusterizing directly into a multi-device cluster.

Roughly, when you do this, we end up believing that working copies exist on several different devices, but have no information about which copy or copies are up to date. //Usually// they all were already synchronized and are all up to date, but we can't make this assumption safely without risking data.

Instead, we err on the side of caution, and require a human to tell us which copy we should consider to be up-to-date, using `bin/repository thaw --promote`.

Test Plan:
```
$ ./bin/repository clusterize rLOCKS --service repos001.phacility.net
Service "repos001.phacility.net" is actively bound to more than one device
(local002.local, local001.phacility.net).

If you clusterize a repository onto this service it will be unclear which
devices have up-to-date copies of the repository. This leader/follower
ambiguity will freeze the repository. You may need to manually promote a
device to unfreeze it. See "Ambiguous Leaders" in the documentation for
discussion.

    Continue anyway? [y/N]
```

Read other changes.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12087

Differential Revision: https://secure.phabricator.com/D17169
2017-01-10 12:45:55 -08:00
epriestley
0ed767b967 Fix a couple of partition migration bugs
Summary:
Ref T11044. Few issues here:

  - The `PhutilProxyException` is missing an argument (hit this while in read-only mode).
  - The `$ref_key` is unused.
  - When you add a new master to an existing cluster, we can incorrectly apply `.php` patches which we should not reapply. Instead, mark them as already-applied.

Test Plan:
  - Poked this locally, but will initialize `secure004` as an empty master to be sure.

Reviewers: chad, avivey

Reviewed By: avivey

Maniphest Tasks: T11044

Differential Revision: https://secure.phabricator.com/D16916
2016-11-22 10:57:24 -08:00
epriestley
4dc37bcee0 Ignore repository versions on inactive devices in "Repository Servers" panel in Config
Summary:
Fixes T11590. Currently, we incorrectly consider cluster repository versions that are (or were) on devices which are no longer part of the active cluster service when building this status screen.

Instead, ignore them. This is just a display bug; the actual `ClusterEngine` already had similar logic.

Test Plan:
  - Added a bad leader record to `repository_workingcopyversion`.
  - Before patch, got a bad "Partial (1w)" sync:

{F1802292}

  - After patch, got a good "Sycnchronized":

{F1802293}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11590

Differential Revision: https://secure.phabricator.com/D16492
2016-09-05 11:10:16 -07:00
epriestley
39d4e21eec Fix a bad DiffusionCommandEngine parameter from HTTPEngine conversion
Summary:
I converted this call incorrectly in D16092. We should pass the `PhutilURI` object, not the string version of it.

Specifically, this resulted in hitting an error like this if a replica needed synchronization:

```
[2016-08-11 21:22:37] EXCEPTION: (InvalidArgumentException) Argument 1 passed to DiffusionCommandEngine::setURI() must be an instance of PhutilURI, string given, called in...
#0 PhutilErrorHandler::handleError(integer, string, string, integer, array) called at [<phabricator>/src/applications/diffusion/protocol/DiffusionCommandEngine.php:52]
#1 DiffusionCommandEngine::setURI(string) called at [<phabricator>/src/applications/diffusion/protocol/DiffusionRepositoryClusterEngine.php:601]
...
```

Test Plan: Clusterized an observed repository, demoted a node, ran `bin/repository update Rxxx` to update, saw no typehint fatal.

Reviewers: chad

Reviewed By: chad

Differential Revision: https://secure.phabricator.com/D16390
2016-08-11 16:41:09 -07:00
epriestley
55a698a28a Use HTTPEngineExtension proxy for git HTTP operations
Summary: Ref T10227. When we perform `git` http operations (fetch, mirror) check if we should use a proxy; if we should, set `http_proxy` or `https_proxy` in the environment to make `git` have `curl` use it.

Test Plan:
  - Configured a proxy extension to run stuff through a local instance of Charles.
  - Ran `repository pull` and `repository mirror`.
  - Saw `git` HTTP requests route through the proxy.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10227

Differential Revision: https://secure.phabricator.com/D16092
2016-06-09 12:17:10 -07:00
epriestley
f5f784f4c1 Version clustered, observed repositories in a reasonable way (by largest discovered HEAD)
Summary:
Ref T4292. For hosted, clustered repositories we have a good way to increment the internal version of the repository: every time a user pushes something, we increment the version by 1.

We don't have a great way to do this for observed/remote repositories because when we `git fetch` we might get nothing, or we might get some changes, and we can't easily tell //what// changes we got.

For example, if we see that another node is at "version 97", and we do a fetch and see some changes, we don't know if we're in sync with them (i.e., also at "version 97") or ahead of them (at "version 98").

This implements a simple way to version an observed repository:

  - Take the head of every branch/tag.
  - Look them up.
  - Pick the biggest internal ID number.

This will work //except// when branches are deleted, which could cause the version to go backward if the "biggest commit" is the one that was deleted. This should be OK, since it's rare and the effects are minor and the repository will "self-heal" on the next actual push.

Test Plan:
  - Created an observed repository.
  - Ran `bin/repository update` and observed a sensible version number appear in the version table.
  - Pushed to the remote, did another update, saw a sensible update.
  - Did an update with no push, saw no effect on version number.
  - Toggled repository to hosted, saw the version reset.
  - Simulated read traffic to out-of-sync node, saw it do a remote fetch.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4292

Differential Revision: https://secure.phabricator.com/D15986
2016-05-30 09:53:01 -07:00
epriestley
bb16a1b0e2 Fix a possible fatal on the first push to a cluster repository
Summary:
Fixes T11020. I think this resolves things -- `$new_version` (set above) should be used, not `$new_log` directly.

Specifically, we would get into trouble if the initial push failed for some reason (working copy not initialized yet, commit hook rejected, etc).

Test Plan: Made a bad push to a new repository. Saw it freeze before the patch and succeed afterwards.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11020

Differential Revision: https://secure.phabricator.com/D15969
2016-05-23 17:54:54 -07:00
epriestley
892a9a1f07 Make cluster repositories more resistant to freezing
Summary:
Ref T10860. This allows us to recover if the connection to the database is lost during a push.

If we lose the connection to the master database during a push, we would previously freeze the repository. This is very safe, but not very operator-friendly since you have to go manually unfreeze it.

We don't need to be quite this aggressive about freezing things. The repository state is still consistent after we've "upgraded" the lock by setting `isWriting = 1`, so we're actually fine even if we lost the global lock.

Instead of just freezing the repository immediately, sit there in a loop waiting for the master to come back up for a few minutes. If it recovers, we can release the lock and everything will be OK again.

Basically, the changes are:

  - If we can't release the lock at first, sit in a loop trying really hard to release it for a while.
  - Add a unique lock identifier so we can be certain we're only releasing //our// lock no matter what else is going on.
  - Do the version reads on the same connection holding the lock, so we can be sure we haven't lost the lock before we do that read.

Test Plan:
  - Added a `sleep(10)` after accepting the write but before releasing the lock so I could run `mysqld stop` and force this issue to occur.
  - Pushed like this:

```
$ echo D >> record && git commit -am D && git push
[master 707ecc3] D
 1 file changed, 1 insertion(+)
# Push received by "local001.phacility.net", forwarding to cluster host.
# Waiting up to 120 second(s) for a cluster write lock...
# Acquired write lock immediately.
# Waiting up to 120 second(s) for a cluster read lock on "local001.phacility.net"...
# Acquired read lock immediately.
# Device "local001.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "local001.phacility.net".
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 254 bytes | 0 bytes/s, done.
Total 3 (delta 1), reused 0 (delta 0)
BEGIN SLEEP
```

  - Here, I stopped `mysqld` from the CLI in another terminal window.

```
END SLEEP
# CRITICAL. Failed to release cluster write lock!
# The connection to the master database was lost while receiving the write.
# This process will spend 300 more second(s) attempting to recover, then give up.
```

  - Here, I started `mysqld` again.

```
# RECOVERED. Link to master database was restored.
# Released cluster write lock.
To ssh://local@localvault.phacility.com/diffusion/26/locktopia.git
   2cbf87c..707ecc3  master -> master
```

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10860

Differential Revision: https://secure.phabricator.com/D15792
2016-04-25 11:37:31 -07:00
epriestley
d0b5dac36b Make cluster repositories more chatty
Summary:
Ref T10860. At least in Git over SSH, we can freely echo a bunch of stuff to stderr and Git will print it to the console, so we can tell users what's going on.

This should make debugging, etc., easier. We could tone this down a little bit once things are more stable if it's a little too chatty.

Test Plan:
```
$ echo D >> record && git commit -am D && git push
[master ca5efff] D
 1 file changed, 1 insertion(+)
# Push received by "local001.phacility.net", forwarding to cluster host.
# Waiting up to 120 second(s) for a cluster write lock...
# Acquired write lock immediately.
# Waiting up to 120 second(s) for a cluster read lock on "local001.phacility.net"...
# Acquired read lock immediately.
# Device "local001.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "local001.phacility.net".
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 256 bytes | 0 bytes/s, done.
Total 3 (delta 1), reused 0 (delta 0)
To ssh://local@localvault.phacility.com/diffusion/26/locktopia.git
   8616189..ca5efff  master -> master
```

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10860

Differential Revision: https://secure.phabricator.com/D15791
2016-04-25 11:20:57 -07:00
epriestley
dc75b4bd06 Move all cluster locking logic to a separate class
Summary: Ref T10860. This doesn't change anything, it just separates all this stuff out of `PhabricatorRepository` since I'm planning to add a bit more state to it and it's already pretty big and fairly separable.

Test Plan: Pulled, pushed, browsed Diffusion.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T10860

Differential Revision: https://secure.phabricator.com/D15790
2016-04-25 11:20:29 -07:00