Deployment: Deal with repo-archives bloat #32

Open
opened 2023-02-14 11:54:14 +01:00 by Arnd Marijnissen · 2 comments

The data/repo-archives directory is exploding in size over the course of just a day.

This is (mostly?) due to search-engine bots scraping URLS that cause tar.gz's, git-bundles, etc to be produced.

There are two administration-tasks that seem to govern the contents of the repo-archive directory:

  • Delete all repositories' archives (ZIP,TAR.GZ,etc..)
  • Delete old repository archives

The latter also is also 'cron-able' :

[cron.archive_cleanup]
SCHEDULE = @every 1h
OLDER_THAN = 6h

The first item in the list, however, does not seem to have a cron-able action.. and it's the one that actually deletes stuff..

To solve/control this issue, we need to do the following, in order of importance:

  • Set a robots.txt
  • Control access to (certain?) resources via the nginx-lb (preferred over HA-proxy) to control bots that do not honor robots.txt (can use UserAgent)
  • Move the repo-archive data over to 'slow-storage' (rbd_hdd); giving it its own pool; preventing application failure because of Postgres running out of disk (should not happen)
  • Move Postgres-data to its own pool (too)
  • Get controls in GITEA that permit cleanup of data based on age/usefulness
  • Get cron-controls for this type of data.
The data/repo-archives directory is exploding in size over the course of just a day. This is (mostly?) due to search-engine bots scraping URLS that cause tar.gz's, git-bundles, etc to be produced. There are two administration-tasks that seem to govern the contents of the repo-archive directory: * ```Delete all repositories' archives (ZIP,TAR.GZ,etc..)``` * ```Delete old repository archives``` The latter also is also 'cron-able' \: ``` [cron.archive_cleanup] SCHEDULE = @every 1h OLDER_THAN = 6h ``` The first item in the list, however, does not seem to have a cron-able action.. and it's the one that actually deletes stuff.. To solve/control this issue, we need to do the following, in order of importance: * [x] Set a robots.txt * [ ] Control access to (certain?) resources via the nginx-lb (preferred over HA-proxy) to control bots that do not honor robots.txt (can use UserAgent) * [ ] Move the repo-archive data over to 'slow-storage' (rbd_hdd); giving it its own pool; preventing application failure because of Postgres running out of disk (should not happen) * [ ] Move Postgres-data to its own pool (too) * [ ] Get controls in GITEA that permit cleanup of data based on age/usefulness * [ ] Get cron-controls for this type of data.
Arnd Marijnissen added the
help wanted
gitea feature request
deployment
labels 2023-02-14 11:55:03 +01:00

I closed #29 which was about the same issue, and has some other details.

I closed #29 which was about the same issue, and has some other details.

Gitea.com has the following robots.txt: https://gitea.com/robots.txt

Gitea.com has the following robots.txt: https://gitea.com/robots.txt
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: infrastructure/blender-projects-platform#32
No description provided.