# uid-migration.txt -rw-r--r-- 2.4 KiB View raw
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
rollout plan
✓ update replace directives for core-go
	- commit & push core-go changes
	- go get -u git.sr.ht/~sircmpwn/core-go
✓ tag & push changes with -o skip-ci
	- core.sr.ht first
✓ submit custom package builds without the deploy step

deployment:
✓ open status.sr.ht incident
✓ disable git dispatch
✓ stop all services save meta.sr.ht
	✓ should we make meta.sr.ht read-only? perhaps by disabling POST
	  requests in nginx?
✓ dump old database IDs for reference
✓ snapshot the database filesystem
✓ roll out deployments according to id-unification.md
✓ start everything back up
✓ re-enable mutable requests in nginx
✓ re-enable git dispatch
✓ re-enable webhooks
✓ re-enable monitoring
✓ re-enable automatic migrations
✓ smoke tests

clean up:
✓ close out status.sr.ht incident
✓ uncomment deploy step in hg.sr.ht
✓ update patchsets on lists.sr.ht
✓ update ops/id-unification.md; notify sr.ht-admins
✓ remove hot patch for null uid issue
✓ remove /etc/hosts changes

problems:
- transient DNS failures during the migration can cause meta.sr.ht to be
  unreachable, aborting the migration
  workaround:
    add meta.sr.ht to /etc/hosts:
    173.195.146.143 meta.sr.ht
  follow-up: we have three DNS resolvers configured on each host, why are these
  failures happening in the first place?
- migrations are slow
  would have been faster to dump meta.sr.ht IDs to a CSV and import from that
  instead of doing the migration online
  note for improvement for future migrations
  back of napkin suggests 10-15 minutes per migration at ~30 users/second
- users unknown to meta.sr.ht cause migrations to fail
  most likely cause is botched manual account deletions
  solution: hot patch migration scripts during deployment to set these
  remote_id's to -old_id. Post migration we'll audit and likely drop these rows
  third-party instances unlikely to have similar issues, do not need to commit
  fix
- pages does not seem to have gql webhook tables, should fix that. I guess no
  one uses them

migration status:
✓ git.sr.ht
✓ hg.sr.ht
✓ builds.sr.ht
✓ lists.sr.ht
✓ todo.sr.ht
✓ paste.sr.ht
✓ man.sr.ht
- hub.sr.ht
✓ pages.sr.ht

packaging status:
✓ git.sr.ht
✓ hg.sr.ht
✓ builds.sr.ht
✓ lists.sr.ht
✓ todo.sr.ht
✓ hub.sr.ht
✓ paste.sr.ht
✓ man.sr.ht
✓ pages.sr.ht

. update go.mod
. go get
. amend commit
. semver patch
. git push -o skip-ci

testing status:
✓ git.sr.ht
✓ hg.sr.ht
✓ builds.sr.ht
✓ lists.sr.ht
✓ todo.sr.ht
✓ pages.sr.ht
✓ paste.sr.ht
✓ man.sr.ht
✓ hub.sr.ht