Summary:
Ref T13109. Make it slightly more clear what the scope of the write and read locks are, and slightly more clear that we're actively acquiring locks, not just sitting around waiting.
While waiting on another writer, show who we're waiting on so you can walk over to their desk and glare at them.
Test Plan:
Added `sleep(15)` after `willWrite()`. Pushed in two windows. Saw new, more informative messages. In the second window, saw the new guidance:
> # Waiting for hector to finish writing (on device "repo1.local.phacility.net" for 11s)...
Reviewers: asherkin
Reviewed By: asherkin
Subscribers: asherkin
Maniphest Tasks: T13109
Differential Revision: https://secure.phabricator.com/D19247
Summary: Noticed a couple of typos in the docs, and then things got out of hand.
Test Plan:
- Stared at the words until my eyes watered and the letters began to swim on the screen.
- Consulted a dictionary.
Reviewers: #blessed_reviewers, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: epriestley, yelirekim, PHID-OPKG-gm6ozazyms6q6i22gyam
Differential Revision: https://secure.phabricator.com/D18693
Summary:
Fixes T12893. See also PHI15. This is complicated but:
- In the documentation, we say "register your web devices with Almanac". We do this ourselves on `secure` and in the production Phacility cluster.
- We don't actually require you to do this, don't detect that you didn't, and there's no actual reason you need to.
- If you don't register your "web" devices, the only bad thing that really happens is that creating repositories skips version initialization, creating the bug in T12893. This process does not actually require the devices be registered, but the code currently just kind of fails silently if they aren't.
Instead, just move forward on these init/resync phases even if the device isn't registered. These steps are safe to run from unregistered hosts since they just wipe the whole table and don't affect specific devices.
If this sticks, I'll probably update the docs to not tell you to register `web` devices, or at least add "Optionally, ...". I don't think there's any future reason we'd need them to be registered.
Test Plan:
This is a bit tough to test without multiple hosts, but I added this piece of code to `AlmanacKeys` so we'd pretend to be a nameless "web" device when creating a repository:
```
if ($_REQUEST['__path__'] == '/diffusion/edit/form/default/') {
return null;
}
```
Then I created some Git repositories. Before the patch, they came up with `-` versions (no version information). After the patch, they came up with `0` versions (correctly initialized).
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12893
Differential Revision: https://secure.phabricator.com/D18273
Summary:
Ref T12613. Currently, the SVNTEST and HGTEST repositories are improperly configured on `secure`. These repositories use VCS systems which do not support synchronization, so they can not be served from cluster services with multiple hosts.
However, I've incorrectly configured them the same way as all the Git repositories, which support synchronization. This causes about 50% of requests to randomly fail (when they reach the wrong host).
Detect this issue and warn the user that the configuration is not valid.
It should be exceptionally difficult for normal installs to run into this.
Test Plan:
- Mostly faked these conditions locally, verified that `secure` really has this configuration.
- I'll push this, verify that the issue is detected correctly in production, then fix the config which should resolve the intermittent issues with SVNTEST.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12613
Differential Revision: https://secure.phabricator.com/D17774
Summary: Fixes T12243. That error occured due to network flakiness with some mounted filesystems so I'm not sure how best to simulate it. But you can look and see that the PhutilProxyException does indeed expect an exception as its second arg.
Test Plan: Look at method signature... look at callsite... now back at the method. Smile and nod.
Reviewers: #blessed_reviewers, yelirekim, epriestley
Reviewed By: #blessed_reviewers, epriestley
Subscribers: epriestley
Maniphest Tasks: T12243
Differential Revision: https://secure.phabricator.com/D17335
Summary:
Fixes T12087. When transitioning into a clustered configuration for the first time, the documentation recommends using a one-device cluster as a transitional step.
However, installs may not do this for whatever reason, and we aren't as clear as we could be in warning about clusterizing directly into a multi-device cluster.
Roughly, when you do this, we end up believing that working copies exist on several different devices, but have no information about which copy or copies are up to date. //Usually// they all were already synchronized and are all up to date, but we can't make this assumption safely without risking data.
Instead, we err on the side of caution, and require a human to tell us which copy we should consider to be up-to-date, using `bin/repository thaw --promote`.
Test Plan:
```
$ ./bin/repository clusterize rLOCKS --service repos001.phacility.net
Service "repos001.phacility.net" is actively bound to more than one device
(local002.local, local001.phacility.net).
If you clusterize a repository onto this service it will be unclear which
devices have up-to-date copies of the repository. This leader/follower
ambiguity will freeze the repository. You may need to manually promote a
device to unfreeze it. See "Ambiguous Leaders" in the documentation for
discussion.
Continue anyway? [y/N]
```
Read other changes.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T12087
Differential Revision: https://secure.phabricator.com/D17169
Summary:
Ref T11044. Few issues here:
- The `PhutilProxyException` is missing an argument (hit this while in read-only mode).
- The `$ref_key` is unused.
- When you add a new master to an existing cluster, we can incorrectly apply `.php` patches which we should not reapply. Instead, mark them as already-applied.
Test Plan:
- Poked this locally, but will initialize `secure004` as an empty master to be sure.
Reviewers: chad, avivey
Reviewed By: avivey
Maniphest Tasks: T11044
Differential Revision: https://secure.phabricator.com/D16916
Summary:
Fixes T11590. Currently, we incorrectly consider cluster repository versions that are (or were) on devices which are no longer part of the active cluster service when building this status screen.
Instead, ignore them. This is just a display bug; the actual `ClusterEngine` already had similar logic.
Test Plan:
- Added a bad leader record to `repository_workingcopyversion`.
- Before patch, got a bad "Partial (1w)" sync:
{F1802292}
- After patch, got a good "Sycnchronized":
{F1802293}
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11590
Differential Revision: https://secure.phabricator.com/D16492
Summary:
I converted this call incorrectly in D16092. We should pass the `PhutilURI` object, not the string version of it.
Specifically, this resulted in hitting an error like this if a replica needed synchronization:
```
[2016-08-11 21:22:37] EXCEPTION: (InvalidArgumentException) Argument 1 passed to DiffusionCommandEngine::setURI() must be an instance of PhutilURI, string given, called in...
#0 PhutilErrorHandler::handleError(integer, string, string, integer, array) called at [<phabricator>/src/applications/diffusion/protocol/DiffusionCommandEngine.php:52]
#1 DiffusionCommandEngine::setURI(string) called at [<phabricator>/src/applications/diffusion/protocol/DiffusionRepositoryClusterEngine.php:601]
...
```
Test Plan: Clusterized an observed repository, demoted a node, ran `bin/repository update Rxxx` to update, saw no typehint fatal.
Reviewers: chad
Reviewed By: chad
Differential Revision: https://secure.phabricator.com/D16390
Summary: Ref T10227. When we perform `git` http operations (fetch, mirror) check if we should use a proxy; if we should, set `http_proxy` or `https_proxy` in the environment to make `git` have `curl` use it.
Test Plan:
- Configured a proxy extension to run stuff through a local instance of Charles.
- Ran `repository pull` and `repository mirror`.
- Saw `git` HTTP requests route through the proxy.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10227
Differential Revision: https://secure.phabricator.com/D16092
Summary:
Ref T4292. For hosted, clustered repositories we have a good way to increment the internal version of the repository: every time a user pushes something, we increment the version by 1.
We don't have a great way to do this for observed/remote repositories because when we `git fetch` we might get nothing, or we might get some changes, and we can't easily tell //what// changes we got.
For example, if we see that another node is at "version 97", and we do a fetch and see some changes, we don't know if we're in sync with them (i.e., also at "version 97") or ahead of them (at "version 98").
This implements a simple way to version an observed repository:
- Take the head of every branch/tag.
- Look them up.
- Pick the biggest internal ID number.
This will work //except// when branches are deleted, which could cause the version to go backward if the "biggest commit" is the one that was deleted. This should be OK, since it's rare and the effects are minor and the repository will "self-heal" on the next actual push.
Test Plan:
- Created an observed repository.
- Ran `bin/repository update` and observed a sensible version number appear in the version table.
- Pushed to the remote, did another update, saw a sensible update.
- Did an update with no push, saw no effect on version number.
- Toggled repository to hosted, saw the version reset.
- Simulated read traffic to out-of-sync node, saw it do a remote fetch.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T4292
Differential Revision: https://secure.phabricator.com/D15986
Summary:
Fixes T11020. I think this resolves things -- `$new_version` (set above) should be used, not `$new_log` directly.
Specifically, we would get into trouble if the initial push failed for some reason (working copy not initialized yet, commit hook rejected, etc).
Test Plan: Made a bad push to a new repository. Saw it freeze before the patch and succeed afterwards.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T11020
Differential Revision: https://secure.phabricator.com/D15969
Summary:
Ref T10860. This allows us to recover if the connection to the database is lost during a push.
If we lose the connection to the master database during a push, we would previously freeze the repository. This is very safe, but not very operator-friendly since you have to go manually unfreeze it.
We don't need to be quite this aggressive about freezing things. The repository state is still consistent after we've "upgraded" the lock by setting `isWriting = 1`, so we're actually fine even if we lost the global lock.
Instead of just freezing the repository immediately, sit there in a loop waiting for the master to come back up for a few minutes. If it recovers, we can release the lock and everything will be OK again.
Basically, the changes are:
- If we can't release the lock at first, sit in a loop trying really hard to release it for a while.
- Add a unique lock identifier so we can be certain we're only releasing //our// lock no matter what else is going on.
- Do the version reads on the same connection holding the lock, so we can be sure we haven't lost the lock before we do that read.
Test Plan:
- Added a `sleep(10)` after accepting the write but before releasing the lock so I could run `mysqld stop` and force this issue to occur.
- Pushed like this:
```
$ echo D >> record && git commit -am D && git push
[master 707ecc3] D
1 file changed, 1 insertion(+)
# Push received by "local001.phacility.net", forwarding to cluster host.
# Waiting up to 120 second(s) for a cluster write lock...
# Acquired write lock immediately.
# Waiting up to 120 second(s) for a cluster read lock on "local001.phacility.net"...
# Acquired read lock immediately.
# Device "local001.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "local001.phacility.net".
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 254 bytes | 0 bytes/s, done.
Total 3 (delta 1), reused 0 (delta 0)
BEGIN SLEEP
```
- Here, I stopped `mysqld` from the CLI in another terminal window.
```
END SLEEP
# CRITICAL. Failed to release cluster write lock!
# The connection to the master database was lost while receiving the write.
# This process will spend 300 more second(s) attempting to recover, then give up.
```
- Here, I started `mysqld` again.
```
# RECOVERED. Link to master database was restored.
# Released cluster write lock.
To ssh://local@localvault.phacility.com/diffusion/26/locktopia.git
2cbf87c..707ecc3 master -> master
```
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10860
Differential Revision: https://secure.phabricator.com/D15792
Summary:
Ref T10860. At least in Git over SSH, we can freely echo a bunch of stuff to stderr and Git will print it to the console, so we can tell users what's going on.
This should make debugging, etc., easier. We could tone this down a little bit once things are more stable if it's a little too chatty.
Test Plan:
```
$ echo D >> record && git commit -am D && git push
[master ca5efff] D
1 file changed, 1 insertion(+)
# Push received by "local001.phacility.net", forwarding to cluster host.
# Waiting up to 120 second(s) for a cluster write lock...
# Acquired write lock immediately.
# Waiting up to 120 second(s) for a cluster read lock on "local001.phacility.net"...
# Acquired read lock immediately.
# Device "local001.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "local001.phacility.net".
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 256 bytes | 0 bytes/s, done.
Total 3 (delta 1), reused 0 (delta 0)
To ssh://local@localvault.phacility.com/diffusion/26/locktopia.git
8616189..ca5efff master -> master
```
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10860
Differential Revision: https://secure.phabricator.com/D15791
Summary: Ref T10860. This doesn't change anything, it just separates all this stuff out of `PhabricatorRepository` since I'm planning to add a bit more state to it and it's already pretty big and fairly separable.
Test Plan: Pulled, pushed, browsed Diffusion.
Reviewers: chad
Reviewed By: chad
Maniphest Tasks: T10860
Differential Revision: https://secure.phabricator.com/D15790