Poor page load and download times for some countries #24

Open
opened 2023-02-13 09:37:34 +01:00 by Brecht Van Lommel · 25 comments

Developers from US and Canada reported:

  • Slow page load times (> 10s), while it the page generation time (1 < s) is much less and it's much faster in Amsterdam.
  • Downloading 10mb attachments can take 2 minutes
  • Git clone only gets a few 100 kb/s

Some difference is expected but this much indicates there is a problem somewhere.

Developers from US and Canada reported: * Slow page load times (> 10s), while it the page generation time (1 < s) is much less and it's much faster in Amsterdam. * Downloading 10mb attachments can take 2 minutes * Git clone only gets a few 100 kb/s Some difference is expected but this much indicates there is a problem somewhere.
Brecht Van Lommel added the
bug
deployment
labels 2023-02-13 09:42:17 +01:00
Author
Owner

Developers being able to efficiently use the website is important for productivity, so marking this as a bug.

I know @dmcgrath looked into this a bit in chat. Not sure if this was fully explored, if there are still things to check, if @Arnd has looked into it, or other people need to get involved for advice.

Developers being able to efficiently use the website is important for productivity, so marking this as a bug. I know @dmcgrath looked into this a bit in chat. Not sure if this was fully explored, if there are still things to check, if @Arnd has looked into it, or other people need to get involved for advice.
Arnd Marijnissen self-assigned this 2023-02-13 14:46:54 +01:00

It's pretty gnarly here in western canada

k:\git_test>git clone https://projects.blender.org/blender/blender.git
Cloning into 'blender'...
remote: Enumerating objects: 20903, done.
remote: Counting objects: 100% (20903/20903), done.
remote: Compressing objects: 100% (1842/1842), done.
Receiving objects:   0% (19400/2211317), 5.22 MiB | 238.00 KiB/s
k:\git_test>git clone git://git.blender.org/blender.git
Cloning into 'blender'...
remote: Enumerating objects: 2207050, done.
remote: Counting objects: 100% (2207050/2207050), done.
remote: Compressing objects: 100% (261079/261079), done.
Receiving objects:   3% (66212/2207050), 15.88 MiB | 3.15 MiB/s

And just to validate my connection is fine

k:\git_test>git clone https://github.com/blender/blender.git
Cloning into 'blender'...
remote: Enumerating objects: 2207050, done.
remote: Counting objects: 100% (2658/2658), done.
remote: Compressing objects: 100% (828/828), done.
Receiving objects:  10% (220705/2207050), 155.39 MiB | 24.91 MiB/s

I have packet captures of both (project.blender.org and git.b.o available privately on request) it appears the conversation with git.blender.org uses much larger TCP Window size

git.blender.org

8036	34.537583	82.94.226.105	192.168.0.99	TCP	1514	9418 → 60505 [ACK] Seq=9235827 Ack=42472 Win=131072 Len=1448 TSval=2341364968 TSecr=1962817993 [TCP segment of a reassembled PDU]
8037	34.537621	192.168.0.99	82.94.226.105	TCP	66	60505 → 9418 [ACK] Seq=42472 Ack=9237275 Win=4222208 Len=0 TSval=1962818160 TSecr=2341364968

projects.blender.org

2272	22.999762	82.94.226.107	192.168.0.99	TCP	1514	443 → 60185 [ACK] Seq=2015732 Ack=22519 Win=65792 Len=1448 TSval=2385712657 TSecr=1962647722 [TCP segment of a reassembled PDU]
2273	22.999813	192.168.0.99	82.94.226.107	TCP	66	60185 → 443 [ACK] Seq=22519 Ack=2017180 Win=263424 Len=0 TSval=1962647874 TSecr=2385712652

There's also some packetloss and retransmissions going on this happens on both projects.blender.org and git.blender.org. I have a capture from june last year when we were looking at svn.blender.org which does not exhibit this problem (but does now) so this appears to be something new affecting all traffic coming from b.org to me

image

It's pretty gnarly here in western canada ``` k:\git_test>git clone https://projects.blender.org/blender/blender.git Cloning into 'blender'... remote: Enumerating objects: 20903, done. remote: Counting objects: 100% (20903/20903), done. remote: Compressing objects: 100% (1842/1842), done. Receiving objects: 0% (19400/2211317), 5.22 MiB | 238.00 KiB/s ``` ``` k:\git_test>git clone git://git.blender.org/blender.git Cloning into 'blender'... remote: Enumerating objects: 2207050, done. remote: Counting objects: 100% (2207050/2207050), done. remote: Compressing objects: 100% (261079/261079), done. Receiving objects: 3% (66212/2207050), 15.88 MiB | 3.15 MiB/s ``` And just to validate my connection is fine ``` k:\git_test>git clone https://github.com/blender/blender.git Cloning into 'blender'... remote: Enumerating objects: 2207050, done. remote: Counting objects: 100% (2658/2658), done. remote: Compressing objects: 100% (828/828), done. Receiving objects: 10% (220705/2207050), 155.39 MiB | 24.91 MiB/s ``` I have packet captures of both (project.blender.org and git.b.o available privately on request) it appears the conversation with `git.blender.org` uses much larger TCP Window size git.blender.org ``` 8036 34.537583 82.94.226.105 192.168.0.99 TCP 1514 9418 → 60505 [ACK] Seq=9235827 Ack=42472 Win=131072 Len=1448 TSval=2341364968 TSecr=1962817993 [TCP segment of a reassembled PDU] 8037 34.537621 192.168.0.99 82.94.226.105 TCP 66 60505 → 9418 [ACK] Seq=42472 Ack=9237275 Win=4222208 Len=0 TSval=1962818160 TSecr=2341364968 ``` projects.blender.org ``` 2272 22.999762 82.94.226.107 192.168.0.99 TCP 1514 443 → 60185 [ACK] Seq=2015732 Ack=22519 Win=65792 Len=1448 TSval=2385712657 TSecr=1962647722 [TCP segment of a reassembled PDU] 2273 22.999813 192.168.0.99 82.94.226.107 TCP 66 60185 → 443 [ACK] Seq=22519 Ack=2017180 Win=263424 Len=0 TSval=1962647874 TSecr=2385712652 ``` There's also some packetloss and retransmissions going on this happens on both projects.blender.org and git.blender.org. I have a capture from june last year when we were looking at svn.blender.org which does not exhibit this problem (but does now) so this appears to be something new affecting all traffic coming from b.org to me ![image](/attachments/78b28013-1b81-474d-b6f9-f61e0bb1a539)

Last weekend when I took a quick peek into this problem, it was found that git checkouts gave results that were fastest in the same rack/network as the VM of gitea.

Performance at home was ~300kb/sec doing a git clone of blender. From Scales (across the 2x1G etherchannel) is 3-5MiB/sec. From the Archive jail I got 17.63MiB/sec, which is in the same 10G network as the Gitea VM.

Unfortunately, there has been constant 600G archives across that link the last week or so. I've had to redo the 600G archive backup multiple times due to changes in the repo from Sergey adjusting and rebaking, mostly since I had to go from a tarball to an rsync (which has too long filename issues), and today back yet again to a tarball to a non archived (offsite) copy on the NAS.

In addition, the storage for gitea has been getting swamped due to backups also backing up to the NAS all of the temporary directories, which has been so unexpected that it filled the storage for the VM at least 3 times this weekend.

Add on top of this the daily backups to the studio, general outbound traffic, rsync's of svn and similar to the studio, and it's just been, generally a bad week or two to look into problems since there are many factors to consider.

Below are some of the sysctl's that we use for the machine:

net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.ipv4.tcp_congestion_control = htcp
net.core.netdev_max_backlog = 30000
net.ipv4.tcp_fin_timeout = 7
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_timestamps = 1

Looking at his tcp graph, the window size looks to be staying suffiently above the transfer. I don't know what his receive window was like at the time, or what the server disk I/O was like, but there could have been some bottlenecks between our switches. It is only 2x1G ether-channel after all.

That said, I can take a look at the link closer tomorrow and see if I can do anything to help out, but this has been happening with LazyDodo and others for years, before we even had to factor in the ether-channel. I do recall seeing some errors in the SNMP values for the ether-channel, but I may have just got the wrong SNMP OID plugged into Zabbix.

Last weekend when I took a quick peek into this problem, it was found that git checkouts gave results that were fastest in the same rack/network as the VM of gitea. Performance at home was ~300kb/sec doing a git clone of blender. From Scales (across the 2x1G etherchannel) is 3-5MiB/sec. From the Archive jail I got 17.63MiB/sec, which is in the same 10G network as the Gitea VM. Unfortunately, there has been constant 600G archives across that link the last week or so. I've had to redo the 600G archive backup multiple times due to changes in the repo from Sergey adjusting and rebaking, mostly since I had to go from a tarball to an rsync (which has too long filename issues), and today back yet again to a tarball to a non archived (offsite) copy on the NAS. In addition, the storage for gitea has been getting swamped due to backups also backing up to the NAS all of the temporary directories, which has been so unexpected that it filled the storage for the VM at least 3 times this weekend. Add on top of this the daily backups to the studio, general outbound traffic, rsync's of svn and similar to the studio, and it's just been, generally a bad week or two to look into problems since there are many factors to consider. Below are some of the sysctl's that we use for the machine: ``` net.core.rmem_default = 1048576 net.core.wmem_default = 1048576 net.core.rmem_max = 67108864 net.core.wmem_max = 67108864 net.ipv4.tcp_moderate_rcvbuf = 1 net.ipv4.tcp_rmem = 4096 87380 33554432 net.ipv4.tcp_wmem = 4096 65536 33554432 net.ipv4.tcp_congestion_control = htcp net.core.netdev_max_backlog = 30000 net.ipv4.tcp_fin_timeout = 7 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_fack = 1 net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_sack = 1 net.ipv4.tcp_timestamps = 1 ``` Looking at his tcp graph, the window size looks to be staying suffiently above the transfer. I don't know what his receive window was like at the time, or what the server disk I/O was like, but there could have been some bottlenecks between our switches. It is only 2x1G ether-channel after all. That said, I can take a look at the link closer tomorrow and see if I can do anything to help out, but this has been happening with LazyDodo and others for years, before we even had to factor in the ether-channel. I do recall seeing some errors in the SNMP values for the ether-channel, but I may have just got the wrong SNMP OID plugged into Zabbix.
Author
Owner

No networking expert so feel free to ignore if this is obvious or wrong.

To me it seems a small TCP window size as found by @LazyDodo is exactly the kind of thing that would lead to a situation where low latency (in Amsterdam) is fast, and higher latency (further away) is slow. Despite there being enough throughput, it would be waiting much longer for confirmation that the TCP package was received than actually sending data, because it's not sending enough data at a time.

Even if the server is doing a lot of things right now, seemingly it's able to respond with reasonable latency and throughput to requests from Amsterdam. Even if some latency or throughput between e.g. switches on the server would be worse than before, I would expect that to add roughly a constant amount of time that is about the same regardless of distance, rather than a multiplicative factor that increases with distance.

Though of course interactions between many components can have complicated effects, it's configuration like TCP window size that seems like it should be the first suspect.

No networking expert so feel free to ignore if this is obvious or wrong. To me it seems a small TCP window size as found by @LazyDodo is exactly the kind of thing that would lead to a situation where low latency (in Amsterdam) is fast, and higher latency (further away) is slow. Despite there being enough throughput, it would be waiting much longer for confirmation that the TCP package was received than actually sending data, because it's not sending enough data at a time. Even if the server is doing a lot of things right now, seemingly it's able to respond with reasonable latency and throughput to requests from Amsterdam. Even if some latency or throughput between e.g. switches on the server would be worse than before, I would expect that to add roughly a constant amount of time that is about the same regardless of distance, rather than a multiplicative factor that increases with distance. Though of course interactions between many components can have complicated effects, it's configuration like TCP window size that seems like it should be the first suspect.

Just to chime in with a status update, I gave @Arnd a shell here on a VM he can use for some tests and packet captures, since our availability overlaps by only 2 hours a day which wasn't allowing for much time in a day to run diagnostics.

fingers crossed!

Just to chime in with a status update, I gave @Arnd a shell here on a VM he can use for some tests and packet captures, since our availability overlaps by only 2 hours a day which wasn't allowing for much time in a day to run diagnostics. fingers crossed!

In terms of network statistics, I took another look at the switch that directly connects to the ISP. There are (still) tons of errors from the server where Phabricator/Git/Svn/Buildbot run, but this has been known and reported multiple times a year for about 7 or 8 years now since that server was replaced in the rack when the last R410 died.

Here are the error charts:

Note: port 19 is the public interface for the git/svn/etc that was mentioned above, while port 21 and 22 are the ether-channel interfaces.

Frame Check Sequence errors:

image

Late collisions:

(none on any interface)

Frame Too Long:

(none on any interface)

Internal Mac Receive errors:

image

Received Pause Frames:

(none on any interface)

Transmitted Pause Frames:
(none on any interface)

Here are the same charts for the ether-channel to the second (10G) rack:

Frame Check Sequence:

image

Internal Mac Receive Errors:

image

Note: For port-channel 5 (LACP between the racks), the charts not displayed are not reporting errors.

Now it would appear at a glance that the Dell switch doesn't like something about the link with such a high error count. That said, port 21 and 22, the underlying interfaces, don't show any errors.

Whatever the problem is, it would seem that it is the connection to port 21. The statistics on the other switch show no errors with anything received, nor any transmit errors.

A quick check of the configuration shows that the 10G switch seems to have the ports configured at 10G rather than 1G. Interestingly, the other ports have "speed 10000" configured explicitly, but the trunk lacks the entire definition. It may be that I presumed it was auto negotiating to speed 1000 since the link "worked", or that I looked (incorrectly) at the interface speed for the link below it in the list.

I tried briefly to set the speed to 1000 but the port-channel refused to come up in an ideal time, but that could be because for some reason the Dell switch LACP priority is set to 1 (it wins) and has a Long LACP timeout.

A few more tests show that the transceiver doesn't appear to like being set to gigabit:
Ethernet1/21 is down (SFP validation failed)

I did find a page that mentions a possible quirk of the transceiver where they will not come up properly without being bounced if there is a speed change. This unfortunately is a somewhat lengthy outage. Probably better to do it with more of an announcement to at least the channel and mailing list.

In terms of network statistics, I took another look at the switch that directly connects to the ISP. There are (still) tons of errors from the server where Phabricator/Git/Svn/Buildbot run, but this has been known and reported multiple times a year for about 7 or 8 years now since that server was replaced in the rack when the last R410 died. ## Here are the error charts: Note: port 19 is the public interface for the git/svn/etc that was mentioned above, while port 21 and 22 are the ether-channel interfaces. ### Frame Check Sequence errors: ![image](/attachments/bdc733bc-ffa4-4c02-96bd-c51ae83688ff) ### Late collisions: (none on any interface) ### Frame Too Long: (none on any interface) ### Internal Mac Receive errors: ![image](/attachments/b9334e2b-29d8-4395-87a0-6987d311b379) ### Received Pause Frames: (none on any interface) Transmitted Pause Frames: (none on any interface) ## Here are the same charts for the ether-channel to the second (10G) rack: ### Frame Check Sequence: ![image](/attachments/a2f48905-63b2-4456-b4f0-438c74162834) ### Internal Mac Receive Errors: ![image](/attachments/5f39691b-6f44-4fde-b7af-c1020102fd9e) Note: For port-channel 5 (LACP between the racks), the charts not displayed are not reporting errors. Now it would appear at a glance that the Dell switch doesn't like something about the link with such a high error count. That said, port 21 and 22, the underlying interfaces, don't show any errors. Whatever the problem is, it would seem that it is the connection to port 21. The statistics on the other switch show no errors with anything received, nor any transmit errors. A quick check of the configuration shows that the 10G switch seems to have the ports configured at 10G rather than 1G. Interestingly, the other ports have "speed 10000" configured explicitly, but the trunk lacks the entire definition. It may be that I presumed it was auto negotiating to speed 1000 since the link "worked", or that I looked (incorrectly) at the interface speed for the link below it in the list. I tried briefly to set the speed to 1000 but the port-channel refused to come up in an ideal time, but that could be because for some reason the Dell switch LACP priority is set to 1 (it wins) and has a Long LACP timeout. A few more tests show that the transceiver doesn't appear to like being set to gigabit: ```Ethernet1/21 is down (SFP validation failed)``` I did find a page that mentions a possible quirk of the transceiver where they will not come up properly without being bounced if there is a speed change. This unfortunately is a somewhat lengthy outage. Probably better to do it with more of an announcement to at least the channel and mailing list.

No networking expert so feel free to ignore if this is obvious or wrong.

To me it seems a small TCP window size as found by @LazyDodo is exactly the kind of thing that would lead to a situation where low latency (in Amsterdam) is fast, and higher latency (further away) is slow. Despite there being enough throughput, it would be waiting much longer for confirmation that the TCP package was received than actually sending data, because it's not sending enough data at a time.

Even if the server is doing a lot of things right now, seemingly it's able to respond with reasonable latency and throughput to requests from Amsterdam. Even if some latency or throughput between e.g. switches on the server would be worse than before, I would expect that to add roughly a constant amount of time that is about the same regardless of distance, rather than a multiplicative factor that increases with distance.

Though of course interactions between many components can have complicated effects, it's configuration like TCP window size that seems like it should be the first suspect.

Indeed window size is a problem. There is a recv and a send window size though. If the recv window (LazyDodo) is 0, he would basically be telling us to stop sending. If the sender needs to send more, it would increase the size as needed, but we aren't seeing the window size lines in his graph touch.

> No networking expert so feel free to ignore if this is obvious or wrong. > > To me it seems a small TCP window size as found by @LazyDodo is exactly the kind of thing that would lead to a situation where low latency (in Amsterdam) is fast, and higher latency (further away) is slow. Despite there being enough throughput, it would be waiting much longer for confirmation that the TCP package was received than actually sending data, because it's not sending enough data at a time. > > Even if the server is doing a lot of things right now, seemingly it's able to respond with reasonable latency and throughput to requests from Amsterdam. Even if some latency or throughput between e.g. switches on the server would be worse than before, I would expect that to add roughly a constant amount of time that is about the same regardless of distance, rather than a multiplicative factor that increases with distance. > > Though of course interactions between many components can have complicated effects, it's configuration like TCP window size that seems like it should be the first suspect. Indeed window size is a problem. There is a recv and a send window size though. If the recv window (LazyDodo) is 0, he would basically be telling us to stop sending. If the sender needs to send more, it would increase the size as needed, but we aren't seeing the window size lines in his graph touch.

I did look more into the interface errors, but this time from the host pov. It would seem that, based on the total interface errors and uptime, we are receiving about 1.173 million interface errors a day on the external facing port for the host that serves svn/git/buildbot/phabricator.

If I were to hazard a guess, I would think that the switch was overflowing the counter and making it seem less of a problem than it actually is, given the host interface errors. My apologies for letting this go so long, but I can't easily walk to the data center :) I poked @Arnd to see about wire replacements for the host ports.

While it's not clear that problem between the two switches is critical or affecting performance in an obvious way, it's clearly not ideal to have one switch try to stuff 10gb/s down the wire at another switch that can only handle gigabit. The fact this link even was in an up state baffles me. It must have slip past my scrutiny! We can discuss this and see about trying to find better transceivers, or gigabit twinax between the racks. Perhaps even expand the LAG from 2 ports to 4 while we are at it to help with the LACP hashing.

I did look more into the interface errors, but this time from the host pov. It would seem that, based on the total interface errors and uptime, we are receiving about 1.173 million interface errors a day on the external facing port for the host that serves svn/git/buildbot/phabricator. If I were to hazard a guess, I would think that the switch was overflowing the counter and making it seem less of a problem than it actually is, given the host interface errors. My apologies for letting this go so long, but I can't easily walk to the data center :) I poked @Arnd to see about wire replacements for the host ports. While it's not clear that problem between the two switches is critical or affecting performance in an obvious way, it's clearly not ideal to have one switch try to stuff 10gb/s down the wire at another switch that can only handle gigabit. The fact this link even was in an up state baffles me. It must have slip past my scrutiny! We can discuss this and see about trying to find better transceivers, or gigabit twinax between the racks. Perhaps even expand the LAG from 2 ports to 4 while we are at it to help with the LACP hashing.

I dug a little bit more. I tried different sysctl's on the host, checked the netstat -m nmbcluster values, mtu's, but no matter what I just can't seem to get the window size to go past around 256k on that host, despite sufficient (afaict) sysctl's.

I also dropped over to the GPT bot and tried some question. One this stood out was:

image

The interesting part that I wasn't considering is that the bandwidth delay product (as referenced in chat at https://www.switch.ch/network/tools/tcp_throughput/?mss=1460&rtt=100&loss=1e-06&Calculate=Calculate&bw=100&rtt2=80&win=64 from chat) seemed to hint that the interface errors we are experiencing may be preventing the window from scaling.

eg:
image

As well, there is some intentional packet error that we are tossing into the mix via the random early detect (RED) that I enabled on buildbot, svn and git http ports. The gist of RED is that we toss out occassional packets during congestion (key term) in order to help with global synchronization. Unfortunately, disabling RED had no measurable impact on my SVN tests.

Ultimately it seems that we should try to prioritize fixing our interface errors, and proceed from there.

I dug a little bit more. I tried different sysctl's on the host, checked the netstat -m nmbcluster values, mtu's, but no matter what I just can't seem to get the window size to go past around 256k on that host, despite sufficient (afaict) sysctl's. I also dropped over to the GPT bot and tried some question. One this stood out was: ![image](/attachments/f6a5d5a4-0a83-489b-a120-553e6b4eef03) The interesting part that I wasn't considering is that the bandwidth delay product (as referenced in chat at https://www.switch.ch/network/tools/tcp_throughput/?mss=1460&rtt=100&loss=1e-06&Calculate=Calculate&bw=100&rtt2=80&win=64 from chat) seemed to hint that the interface errors we are experiencing may be preventing the window from scaling. eg: ![image](/attachments/416969ad-227a-48dd-9d7c-8a96f730a88b) As well, there is some intentional packet error that we are tossing into the mix via the random early detect (RED) that I enabled on buildbot, svn and git http ports. The gist of RED is that we toss out occassional packets during congestion (key term) in order to help with global synchronization. Unfortunately, disabling RED had no measurable impact on my SVN tests. Ultimately it seems that we should try to prioritize fixing our interface errors, and proceed from there.
289 KiB
130 KiB

oops, I realize I closed this earlier. I was wondering what happened when I misclicked that!

oops, I realize I closed this earlier. I was wondering what happened when I misclicked that!

I have the same problem, but I live in italy

I have the same problem, but I live in italy

git.blender.org got yanked from underneeth me, and i didn't keep any numbers, but the maintainance done last weekend has improved things on the new server, the old server is was still 4x faster.

(From @deadpin on chat)

Deadpin> Well the network maintenance may have helped a bit. A 6x improvement for me during a new clone; though I'm not gonna write home about the speed still 🙂
Receiving objects: 8% (185108/2181490), 47.24 MiB | 1.33 MiB/s

LazyDodo> Jesse Y clone the old server: git clone git://git.blender.org/blender.git see what that gets you

Deadpin> LazyDodo Oh...
Receiving objects: 10% (235266/2207050), 59.86 MiB | 4.88 MiB/s

Something is still not right with the new server, also i'm still seeing the packet errors

git.blender.org got yanked from underneeth me, and i didn't keep any numbers, but the maintainance done last weekend has improved things on the new server, the old server ~~is~~ was still 4x faster. (From @deadpin on chat) > Deadpin> Well the network maintenance may have helped a bit. A 6x improvement for me during a new clone; though I'm not gonna write home about the speed still 🙂 > Receiving objects: 8% (185108/2181490), 47.24 MiB | 1.33 MiB/s > LazyDodo> Jesse Y clone the old server: git clone git://git.blender.org/blender.git see what that gets you > Deadpin> LazyDodo Oh... > Receiving objects: 10% (235266/2207050), 59.86 MiB | 4.88 MiB/s Something is still not right with the new server, also i'm still seeing the packet errors

The weekend work at the data center went pretty good. It sounds like things at least helped.

More specifically, the work to deal with the MAC receive errors by swapping out the 10G SFP+ transceivers with 1G SFP ones, worked perfectly and the errors have disappeared. However, the work errors on Koro's g19 port remain, despite having move it from g19 to g17, and replacing a cable. So whatever the problem was likely is a issue with the NIC itself.

Another thing I noticed while I was filling in some NetBox information for the network was that despite most of the Proxmox nodes having jumbo frames set, and the switches having jumbo frames enabled, most of the interfaces and port-channels I was looking at seemed to have 1500's set. This could potentially be a problem if the hosts are expecting jumbos, but the underlying system isn't allowing it. Although if the VM's aren't configured for anything past 1500 anyway, it may not be a problem is all the services, only some. Possibly HAProxy though.

At some point I will have to dig deeper into this and start slowly adjusting the settings when I get some time. I will let you know if anything changes.

The weekend work at the data center went pretty good. It sounds like things at least helped. More specifically, the work to deal with the MAC receive errors by swapping out the 10G SFP+ transceivers with 1G SFP ones, worked perfectly and the errors have disappeared. However, the work errors on Koro's g19 port remain, despite having move it from g19 to g17, and replacing a cable. So whatever the problem was likely is a issue with the NIC itself. Another thing I noticed while I was filling in some NetBox information for the network was that despite most of the Proxmox nodes having jumbo frames set, and the switches having jumbo frames enabled, most of the interfaces and port-channels I was looking at seemed to have 1500's set. This could potentially be a problem if the hosts are expecting jumbos, but the underlying system isn't allowing it. Although if the VM's aren't configured for anything past 1500 anyway, it may not be a problem is all the services, only some. Possibly HAProxy though. At some point I will have to dig deeper into this and start slowly adjusting the settings when I get some time. I will let you know if anything changes.
Author
Owner

Another data point: from a location in China, Git clone is 20-50KB/s and SVN is 10KB/s. Meanwhile Git clone from GitHub is 5-10MB/s.

Another data point: from a location in China, Git clone is 20-50KB/s and SVN is 10KB/s. Meanwhile Git clone from GitHub is 5-10MB/s.

Both git-clone and svn-update gets very slow (20-50kb/s) in China.

Both git-clone and svn-update gets very slow (20-50kb/s) in China.

Thanks for the heads up. Although I think it's pretty much agreed that anywhere beyond the Netherlands is affected.

LazyDodo (in Canada) and I did another round of extensive tests, and found that he was able to push a good few hundred MB/sec to/from us using UDP, but not tcp. We also found that after I capped our 2 primary servers down to around 300Mbit/sec, he was able to get a good 200Mbit/sec and an impressive 8M tcp window size. Then he repeated it again a few minutes later after no changes were made and the speed went back down to horrible rates again. At this time, the entire upstream link went back to a new cap of around 300Mbit/sec. Normally we cap at around 500Mbit/sec.

So without pointing fingers, it almost seems like something upstream of our position is dropping traffic or slowing things to keep us at a steady rate over time. All I know is that it isn't something coming from our end. At least not intentionally! We did have RED enabled on SVN, which itself drops traffic to avoid global synchronization problems, but that should only activate with that host hits its queue limit, which it never gets to.

As for the DC maintenance the other week, Arnd was able to get to the DC and replace the flakey 10G SFP+ with a 1G SFP, which solved the MAC receive errors we got just fine, but this was not a problem. The Koro host was moved to an alternate set of ports, but the errors only reduced, but never disappeared, suggesting a possible problem with the NIC itself. That said, it is a rather low error rate after closer inspection, and doesn't warrant much more work considering that the host is being decommissioned.

Since UDP seems to be a better at maximizing our speeds, I could look into the possibility of enabling QUIC on the Gitea host, but there is nothing I can do about SVN due to that hosts software versions. There is also more of a risk/concern with QUIC/HTTP2, being new and...buggy, in some implementations, which brings security risks, etc.

So for now, apologies for the speed, but I don't know when we will be able to find out what is causing such a horrible experience with our network.

Thanks for the heads up. Although I think it's pretty much agreed that anywhere beyond the Netherlands is affected. LazyDodo (in Canada) and I did another round of extensive tests, and found that he was able to push a good few hundred MB/sec to/from us using UDP, but not tcp. We also found that after I capped our 2 primary servers down to around 300Mbit/sec, he was able to get a good 200Mbit/sec and an impressive 8M tcp window size. Then he repeated it again a few minutes later after no changes were made and the speed went back down to horrible rates again. At this time, the entire upstream link went back to a new cap of around 300Mbit/sec. Normally we cap at around 500Mbit/sec. So without pointing fingers, it almost seems like something upstream of our position is dropping traffic or slowing things to keep us at a steady rate over time. All I know is that it isn't something coming from our end. At least not intentionally! We did have RED enabled on SVN, which itself drops traffic to avoid global synchronization problems, but that should only activate with that host hits its queue limit, which it never gets to. As for the DC maintenance the other week, Arnd was able to get to the DC and replace the flakey 10G SFP+ with a 1G SFP, which solved the MAC receive errors we got just fine, but this was not a problem. The Koro host was moved to an alternate set of ports, but the errors only reduced, but never disappeared, suggesting a possible problem with the NIC itself. That said, it is a rather low error rate after closer inspection, and doesn't warrant much more work considering that the host is being decommissioned. Since UDP seems to be a better at maximizing our speeds, I could look into the possibility of enabling QUIC on the Gitea host, but there is nothing I can do about SVN due to that hosts software versions. There is also more of a risk/concern with QUIC/HTTP2, being new and...buggy, in some implementations, which brings security risks, etc. So for now, apologies for the speed, but I don't know when we will be able to find out what is causing such a horrible experience with our network.

Thanks for the explanation.
Hope it can be fixed otherwise there problem will prevent me, or other devs in China, or from other places to contribute any work for Blender.

Thanks for the explanation. Hope it can be fixed otherwise there problem will prevent me, or other devs in China, or from other places to contribute any work for Blender.
Author
Owner

This was also reported as a problem now for Arch Linux package building from source.

We indeed know many countries are affected. But I think it's good to know to what extent this is a problem. It was not obvious to me previously that it's gotten to the point that git clone with our default build instructions effectively does not work at all for various users and contributors. The workaround for now is using the GitHub mirror.

And so knowing that informs where this goes on our list of priorities, and how drastic our solutions might be (like using a CDN or hosting elsewhere).

I think we need to definitively confirm where the problem is located exactly, and go from there. The details of why and how we don't need to post in this issue, the main reason I created it was to gather info from those affected by the problem, and to communicate that it's a known problem and that we are working on it.

This was also reported as a problem now for Arch Linux package building from source. We indeed know many countries are affected. But I think it's good to know to what extent this is a problem. It was not obvious to me previously that it's gotten to the point that `git clone` with our default build instructions effectively does not work at all for various users and contributors. The workaround for now is using the GitHub mirror. And so knowing that informs where this goes on our list of priorities, and how drastic our solutions might be (like using a CDN or hosting elsewhere). I think we need to definitively confirm where the problem is located exactly, and go from there. The details of why and how we don't need to post in this issue, the main reason I created it was to gather info from those affected by the problem, and to communicate that it's a known problem and that we are working on it.

This was also reported as a problem now for Arch Linux package building from source.

We indeed know many countries are affected. But I think it's good to know to what extent this is a problem. It was not obvious to me previously that it's gotten to the point that git clone with our default build instructions effectively does not work at all for various users and contributors. The workaround for now is using the GitHub mirror.

And so knowing that informs where this goes on our list of priorities, and how drastic our solutions might be (like using a CDN or hosting elsewhere).

I think we need to definitively confirm where the problem is located exactly, and go from there. The details of why and how we don't need to post in this issue, the main reason I created it was to gather info from those affected by the problem, and to communicate that it's a known problem and that we are working on it.

I noticed an interesting case when trying to use github mirror for speedup.
The speed is slow(50-80KB/s) as long as I'm using git bash to clone the mirror.
The only case I could get normal speed(10MB/s) is to clone the mirror via Github-Desktop.

> This was also reported as a problem now for Arch Linux package building from source. > > We indeed know many countries are affected. But I think it's good to know to what extent this is a problem. It was not obvious to me previously that it's gotten to the point that `git clone` with our default build instructions effectively does not work at all for various users and contributors. The workaround for now is using the GitHub mirror. > > And so knowing that informs where this goes on our list of priorities, and how drastic our solutions might be (like using a CDN or hosting elsewhere). > > I think we need to definitively confirm where the problem is located exactly, and go from there. The details of why and how we don't need to post in this issue, the main reason I created it was to gather info from those affected by the problem, and to communicate that it's a known problem and that we are working on it. I noticed an interesting case when trying to use github mirror for speedup. The speed is slow(50-80KB/s) as long as I'm using git bash to clone the mirror. The only case I could get normal speed(10MB/s) is to clone the mirror via Github-Desktop.

I noticed an interesting case when trying to use github mirror for speedup.
The speed is slow(50-80KB/s) as long as I'm using git bash to clone the mirror.
The only case I could get normal speed(10MB/s) is to clone the mirror via Github-Desktop.

This almost sounds like the slow speeds are related to protocol back and forth over a high latency link, versus a blob download that has time to "ramp up" the connection speed.

> I noticed an interesting case when trying to use github mirror for speedup. > The speed is slow(50-80KB/s) as long as I'm using git bash to clone the mirror. > The only case I could get normal speed(10MB/s) is to clone the mirror via Github-Desktop. This almost sounds like the slow speeds are related to protocol back and forth over a high latency link, versus a blob download that has time to "ramp up" the connection speed.

This was also reported as a problem now for Arch Linux package building from source.

We indeed know many countries are affected. But I think it's good to know to what extent this is a problem. It was not obvious to me previously that it's gotten to the point that git clone with our default build instructions effectively does not work at all for various users and contributors. The workaround for now is using the GitHub mirror.

And so knowing that informs where this goes on our list of priorities, and how drastic our solutions might be (like using a CDN or hosting elsewhere).

I think we need to definitively confirm where the problem is located exactly, and go from there. The details of why and how we don't need to post in this issue, the main reason I created it was to gather info from those affected by the problem, and to communicate that it's a known problem and that we are working on it.

I finally found out the reason for the slow clone speeds in China.

Our gov has a thing that we call GFW(Great Fire Wall), which blocks/limits access between the internet in mainland China and outside.
I thought that Git shouldn't be blocked since it's a must for programmers. But unluckily, the speed was still limited to about 30KB/s on average.
We usually use proxies to cross the GFW; however, the VPN does not automatically cover the git command prompt.

After setting up proxies for Git by hand, the speed increased to 500KB-1MB/s.
I also manually set a proxy for the SVN to improve the "make update" speed.
For other Chinese friends who have the same problem as I had, I suggest reading the following links:

I still could not fully understand this problem, however.
For example, the fixes I mentioned above were unnecessary before Blender migrated from Github to Gitea. Also, the clone speed from Gitea via Git Bash(500KB-1MB/s) is still slower than cloning from a Github mirror via GitHub Desktop(1MB-10MB/s).

> This was also reported as a problem now for Arch Linux package building from source. > > We indeed know many countries are affected. But I think it's good to know to what extent this is a problem. It was not obvious to me previously that it's gotten to the point that `git clone` with our default build instructions effectively does not work at all for various users and contributors. The workaround for now is using the GitHub mirror. > > And so knowing that informs where this goes on our list of priorities, and how drastic our solutions might be (like using a CDN or hosting elsewhere). > > I think we need to definitively confirm where the problem is located exactly, and go from there. The details of why and how we don't need to post in this issue, the main reason I created it was to gather info from those affected by the problem, and to communicate that it's a known problem and that we are working on it. I finally found out the reason for the slow clone speeds in China. Our gov has a thing that we call GFW(Great Fire Wall), which blocks/limits access between the internet in mainland China and outside. I thought that Git shouldn't be blocked since it's a must for programmers. But unluckily, the speed was still limited to about 30KB/s on average. We usually use proxies to cross the GFW; however, the VPN does not automatically cover the git command prompt. After setting up proxies for Git by hand, the speed increased to 500KB-1MB/s. I also manually set a proxy for the SVN to improve the "make update" speed. For other Chinese friends who have the same problem as I had, I suggest reading the following links: * [给SVN客户端配置代理服务器](https://blog.csdn.net/Aaronzzq/article/details/7380545) * [Git Clone加速](https://github.com/0xAiKang/Note/blob/master/Skill/Git%20Clone%20%E5%A4%AA%E6%85%A2%E6%80%8E%E4%B9%88%E5%8A%9E%EF%BC%9F.md) I still could not fully understand this problem, however. For example, the fixes I mentioned above were unnecessary before Blender migrated from Github to Gitea. Also, the clone speed from Gitea via Git Bash(500KB-1MB/s) is still slower than cloning from a Github mirror via GitHub Desktop(1MB-10MB/s).

I still could not fully understand this problem, however.

I think I have an idea what is happening there

Can you run git config --get remote.origin.url in both clones and show the output?

> I still could not fully understand this problem, however. I think I have an idea what is happening there Can you run `git config --get remote.origin.url` in both clones and show the output?

It's been 6 months, i know structural progress takes time, but what's happening in this department?

It's been 6 months, i know structural progress takes time, but what's happening in this department?

Last I heard, there was some extra ports enabled in a new rack at the DC awaiting for us to move things, which was planned to be sometime after BCON. Maybe it will help, maybe it won't.

Last I heard, there was some extra ports enabled in a new rack at the DC awaiting for us to move things, which was planned to be sometime after BCON. Maybe it will help, maybe it won't.

Excellent! thanks for the update!

Excellent! thanks for the update!
Sign in to join this conversation.
No Milestone
No project
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: infrastructure/blender-projects-platform#24
No description provided.