# uid-migration.log -rw-r--r-- 17.0 KiB View raw
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
[10:56:33] <ddevault> all set for the rollout
[10:57:32] <bitfehler> i'll get another tea, but otherwise ready
[11:00:43] <ddevault> bitfehler: can you rustle up an nginx config which disables POST requests for meta.sr.ht
[11:01:37] <bitfehler> aye. is the one in sr.ht-nginx the one actually running in prod?
[11:01:42] <ddevault> the one in prod
[11:01:46] <ddevault> though they should be the same
[11:01:58] <bitfehler> so if i work based on that it should do?
[11:02:08] <ddevault> yeah
[11:02:14] <bitfehler> ack
[11:02:41] <ddevault> if you would also be so kind as to log in and shut up metrics.sr.ht afterwards that would be nice
[11:03:03] <bitfehler> log in where?
[11:03:06] <ddevault> metrics.sr.ht
[11:03:16] <ddevault> sircmpwn@metrics.sr.ht
[11:03:17] <bitfehler> i don't think i can?
[11:03:25] <ddevault> you should be good now
[11:03:57] <bitfehler> aye, thanks
[11:10:49] <ddevault> figured out how to disable POST
[11:11:02] <bitfehler> https://paste.sr.ht/~bitfehler/3cb46a6c78505778f2dc1c0c0357478abaa2c60e
[11:11:06] <bitfehler> oh, ok :)
[11:11:09] <ddevault> ty
[11:11:23] <ddevault> lmk when metrics is out of the picture and we'll get started
[11:12:19] <bitfehler> it is
[11:12:24] <ddevault> here goes
[11:18:36] <ddevault> these migration scripts are kind of slow, it's going to take a while
[11:19:43] <ddevault> might have been good to dump the meta IDs to a CSV and work from that instead of while meta.sr.ht was online
[11:20:52] <ddevault> seeing transient DNS failures causing the migration to abort partway through
[11:20:55] <ddevault> added meta.sr.ht to /etc/hosts as workaround
[11:21:51] <bitfehler> is that a coincidence?
[11:22:10] <ddevault> not sure, not the time to investigate
[11:22:15] <ddevault> we have 3 DNS servers in /etc/resolv.conf on all systems
[11:22:20] <ddevault> could be that python's resolver does not query them all
[11:22:34] <ddevault> might have also been nice to add tqdm here to monitor progress and give an estimate on completion time
[11:25:30] <bitfehler> do you have any indication at all for how long it might take?
[11:25:37] <ddevault> we'll see after git.sr.ht migration is done
[11:25:50] <ddevault> looks like we're getting a couple dozen users migrated per second
[11:26:21] <ddevault> 33 users per second
[11:26:25] <ddevault> I'll let you do the math
[11:26:56] <ddevault> damn, hit an exception
[11:27:03] <ddevault> user known to git.sr.ht but not to meta.sr.ht
[11:27:06] <ddevault> that's a problem
[11:28:17] <ddevault> solution: hot patching migration script, will fix it properly after maintenance window
[11:28:28] <ddevault> it's all done in a transaction so have to start over
[11:29:26] <bitfehler> ok. i'd be curious how that user got there, but we can also talk about it later
[11:29:46] <ddevault> yeah, I'm taking notes, there'll be follow ups
[11:30:22] <ddevault> back of the napkin math on users per second suggests about 10-15 minutes per migration
[11:33:06] <ddevault> I think we can start bringing each service back online independently after its migration is done, perhaps saving hub for last
[11:33:22] <ddevault> bitfehler: are you able to do smoke testing on each service after I bring it back up, so I can move on to the next?
[11:34:44] <bitfehler> sure. like, checking all my stuff is still there and such?
[11:34:52] <ddevault> yeah, just poke about and see if you notice anything strange/broken
[11:34:56] <bitfehler> can do
[11:35:05] <ddevault> though, there is a risk with switching things online mid-migration
[11:35:13] <ddevault> if the system is writable then we'll have to lose changes if we want to roll back to the snapshot
[11:35:54] <ddevault> do you want to add that nginx thing to all services and bring them online read-only until we finish all of the migrations?
[11:36:18] <bitfehler> yeah, sounds good
[11:36:27] <ddevault> cool, let me make sure your SSH access is all set up
[11:37:32] <ddevault> you should be good to go
[11:42:52] <bitfehler> nginx is running and you just shut down git.sr.ht service itself, is that correct?
[11:42:59] <ddevault> aye
[11:43:18] <ddevault> just add the limit directive to each config file's proxy location
[11:43:20] <bitfehler> so i can just restart it read-only right away?
[11:43:21] <ddevault> then reload nginx
[11:43:29] <bitfehler> yep cool
[11:43:33] <ddevault> no, don't start the services until the migrations are finished
[11:43:36] <ddevault> I'll take care of that step
[11:43:48] <ddevault> until their respective migrations*
[11:43:49] <bitfehler> sorry, i meant nginx.
[11:44:14] <ddevault> doas nginx -s reload is safter than service nginx restart
[11:44:32] <ddevault> nginx -s reload will check for errors and tell you about them before applying changes, restart will shut off nginx if there are errors
[11:44:46] <ddevault> bah, another error in the migration, due presumably to my workaround
[11:44:51] <bitfehler> ok, done
[11:44:55] <bitfehler> oh no
[11:44:58] <ddevault> null in non-nullable remote-id column
[11:45:15] <ddevault> what should we do here, hmm...
[11:45:31] <ddevault> set ID to -old_id?
[11:45:37] <ddevault> then we can drop all rows with negative IDs later
[11:46:21] <bitfehler> users that are in some service but not in meta are users that ought to be deleted?
[11:46:25] <ddevault> yeah
[11:46:31] <ddevault> that's my thinking
[11:46:36] <ddevault> won't fully flesh out that line of thought until post migration
[11:46:52] <ddevault> probably a botched user deletion request
[11:46:54] <ddevault> since those are done by hand
[11:47:54] <bitfehler> i guess negative might work then
[11:48:10] <ddevault> I don't have any better ideas
[11:48:21] <ddevault> we have a snapshot, let's go with this and if it raises eyebrows in smoke testing we can revisit the choice
[11:48:34] <ddevault> migration resumed
[11:48:51] <ddevault> you know, I had thought we would be wise to test these against production database dumps, but part of my never wants to look at production database dumps
[11:49:38] <ddevault> I suspect that we don't actually need to commit the hot fixes for the migration scripts
[11:49:46] <ddevault> since third-party instances are unlikely to have our hacky manual account deletion process
[11:50:00] <ddevault> 99% sure that's what causes these, I remember a couple of instances where I deleted meta.sr.ht accounts but not other services
[11:50:03] <bitfehler> maybe they have even worse ones... :)
[11:50:15] <ddevault> hah, well, they made their bed
[11:50:21] <ddevault> unsupported use-case(TM)
[11:50:46] <bitfehler> i'm sure if someone runs into this we'll hear from them, so might as well wait for that...
[11:50:51] <ddevault> aye
[11:53:02] <ddevault> lol just tried to use my paste.sr.ht script
[11:54:01] <bitfehler> hehe, yeah, working w/o sr.ht can be difficult...
[11:54:31] <ddevault> ah, forgot to do anything with dispatch.sr.ht
[11:54:35] <ddevault> just shutting it off at least
[11:54:50] <ddevault> adnano confirmed that we don't need to bother with a migration for it
[11:55:18] <bitfehler> easy win
[11:55:24] <ddevault> just in time for christmas https://metrics.sr.ht/targets
[11:55:40] <bitfehler> lol
[11:56:41] <ddevault> I might paste up the logs from this channel during the migration and share them for transparency/third-party reference if no one is opposed
[11:59:07] <bitfehler> fine with me
[12:00:50] <ddevault> it would be nice if we could cut the services over to a maintenance mode in the future
[12:01:01] <ddevault> show a page explaining the situation
[12:01:21] <bitfehler> yeah
[12:05:02] <ddevault> hrm, my remote_id null fix does not seem to have worked
[12:05:12] <ddevault> ah, I see why, simple mistake
[12:05:13] <ddevault> here we go again
[12:05:18] <ddevault> took 981 seconds to run the migration btw
[12:07:00] <bitfehler> the full one?
[12:07:07] <ddevault> yeah, for git.sr.ht
[12:07:13] <ddevault> which has more users than any other service iirc
[12:22:26] <ddevault> git.sr.ht done
[12:22:32] <ddevault> service is online
[12:22:35] <bitfehler> \o/
[12:22:38] <ddevault> bitfehler: go ahead with smoke tests
[12:22:40] <ddevault> moving on to hg.sr.ht
[12:22:45] <bitfehler> ack
[12:24:58] <ddevault> there are 3 users with negative UIDs now
[12:26:01] <bitfehler> still poking, but at first glance git.sr.ht looks pretty fine, and POSTs are forbidden as expected
[12:26:07] <ddevault> nice
[12:29:23] <bitfehler> meh. my home internet acted up. git.sr.ht is fine
[12:29:37] <ddevault> woohoo!
[12:32:11] <ddevault> hg.sr.ht up
[12:32:19] <ddevault> moving on to builds
[12:34:35] <bitfehler> hg.sr.ht looks ok
[12:34:49] <bitfehler> nginx on builds is also ready
[12:35:02] <ddevault> you can go ahead and install all of the nginx changes ahead of my work
[12:35:32] <bitfehler> what's the order?
[12:35:41] <ddevault> builds, lists, todo, pages, paste, man, hub
[12:35:46] <bitfehler> ack
[12:39:13] <bitfehler> hmm, pages..
[12:39:24] <ddevault> we'll just leave it offline until the end
[12:39:29] <ddevault> I don't see a read-only mode being feasible for it
[12:39:37] <bitfehler> same here, ok
[12:39:42] <ddevault> or at least not easy
[12:43:40] <ddevault> https://news.ycombinator.com/item?id=33341716
[12:44:16] <bitfehler> lol
[12:45:22] <bitfehler> nginx running read-only on all those services
[12:45:25] <ddevault> ty
[12:51:27] <ddevault> builds.sr.ht online
[12:52:41] <ddevault> starting lists.sr.ht
[12:53:12] <bitfehler> builds looking good
[12:57:37] <adnano> note that dispatch should be modified to run on the old version of core.sr.ht
[12:57:39] <ddevault> moving pages to the end of the list
[12:57:45] <ddevault> adnano: just not going to update it, ez
[12:58:00] <ddevault> it's being decommissioned in 2 months
[13:06:56] <ddevault> lists up
[13:08:15] <ddevault> todo underway
[13:10:16] <bitfehler> lists looking good
[13:11:16] <ddevault> thought for next time
[13:11:32] <ddevault> we could have left all the services online in read-only mode and only turned off whichever one is actively being migrated
[13:16:17] <bitfehler> did you disable the lists.sr.ht worker?
[13:16:21] <ddevault> yes
[13:16:39] <bitfehler> so mail was accepted and will be backfilled, right?
[13:16:43] <ddevault> yep
[13:16:45] <bitfehler> cool
[13:16:52] <ddevault> ditto for emails to todo.sr.ht
[13:17:03] <bitfehler> yep
[13:25:25] <ddevault> todo up
[13:27:18] <ddevault> paste underway
[13:29:37] <bitfehler> todo lookin good
[13:30:48] <ddevault> paste up
[13:31:37] <bitfehler> looks sane as well
[13:32:51] <ddevault> man underway
[13:39:04] <bitfehler> hm, why are the prometheus endpoints still down, is that expected?
[13:39:14] <ddevault> seem to be up to me
[13:39:24] <ddevault> the APIs are still shut off
[13:39:31] <ddevault> but the service endpoints are working
[13:39:35] <bitfehler> ah, ok
[13:39:48] <bitfehler> yeah, meant the apis
[13:39:58] <ddevault> they can't be made read-only
[13:40:58] <bitfehler> aren't all requests POSTs anyways?
[13:41:04] <bitfehler> to the api i mean?
[13:41:11] <ddevault> yeah
[13:47:53] <ddevault> man also has a lot of users, this'll take a minute
[13:48:00] <ddevault> all users were once redirected to man post signup
[13:50:02] <ddevault> speak of the devil
[13:50:04] <ddevault> man is up
[13:51:44] <ddevault> starting hub
[13:53:59] <bitfehler> man looking good
[13:54:08] <bitfehler> i guess hub will also take some time, right
[13:54:10] <bitfehler> ?
[13:54:17] <ddevault> maybe
[13:54:33] <ddevault> users haven't been redirected there post-signup for as long as they were to man
[14:02:00] <ddevault> we have now officially exceeded our maintenance window estimate
[14:02:29] <ddevault> good news is that we're pretty close
[14:13:08] <ddevault> bah!
[14:13:14] <bitfehler> oh no?
[14:13:16] <ddevault> forgot the services have to be writable for hub to rewrite webhook URLs
[14:13:22] <ddevault> *nearly* removed the nginx directives in time
[14:13:27] <ddevault> now we have to re-run the whole migration
[14:13:48] <bitfehler> so we have to make all of them writable again?
[14:13:51] <ddevault> yeah
[14:13:53] <ddevault> working on it now
[14:13:59] <bitfehler> we could allow from just our own network?
[14:14:07] <ddevault> nah, I was going to suggest making them writable at this point anyway
[14:14:13] <bitfehler> alright
[14:14:18] <ddevault> would have got it done in time if I had correctly guessed the order in which the migration would use them
[14:14:35] <bitfehler> lol. sounds like you don't any help then?
[14:14:39] <ddevault> nah, thanks though
[14:15:40] <ddevault> now writable on all services which are up
[14:15:44] <ddevault> going to start up the APIs also
[14:15:47] <ddevault> but leave webhooks until hub is finished
[14:15:54] <ddevault> they'll backfill
[14:16:13] <adnano> the migration is split into two files. did the first one complete at least?
[14:16:25] <ddevault> yeah
[14:16:32] <ddevault> but they're all run in a transaction so it got rolled back
[14:16:42] <adnano> I see
[14:16:51] <adnano> I guess we could have done one step at a time
[14:16:55] <ddevault> ACTION shrugs
[14:17:13] <ddevault> APIs up
[14:17:32] <ddevault> bringing up the builders too
[14:17:34] <ddevault> and git dispatch
[14:19:46] <ddevault> bringing up lists
[14:20:06] <ddevault> I think I've brought up everything I care to at this stage
[14:20:17] <ddevault> now we're Mostly Online(TM)
[14:20:29] <bitfehler> pages still down on purpose?
[14:20:33] <ddevault> yeah
[14:20:35] <ddevault> it hasn't been migrated yet
[14:20:37] <ddevault> saving it for last
[14:20:42] <bitfehler> ok
[14:22:32] <adnano> I think we might have been able to keep pages online while the other services were migrating
[14:22:52] <ddevault> read-only, yes
[14:23:03] <ddevault> we could have kept all the services online (read-only) except for whichever one was being actively migrated, noted it earlier
[14:31:23] <ddevault> going to kick off the pages migration as well, I don't see any reason it can't run in parallel with the hub migration
[14:35:00] <ddevault> huh
[14:35:05] <ddevault> pages.sr.ht does not have a webhooks table and no one noticed
[14:35:33] <bitfehler> hehe
[14:35:34] <ddevault> pages online
[14:37:09] <ddevault> another hub issue, starting over... again
[14:37:15] <bitfehler> derp
[14:37:24] <bitfehler> nothing serious, though?
[14:37:25] <ddevault> scoping it to just do the first two upgrades
[14:37:32] <ddevault> so that we don't have to keep re-running this if the third has more problems
[14:37:38] <ddevault> yeah, nothing serious, just more missing user bugs
[14:39:18] <ddevault> also doing some closing out tasks while waiting on hub
[14:43:08] <ddevault> might as well turn dispatch.sr.ht back on, I guess
[14:43:31] <ddevault> is it obvious that I like one of my children less than the others
[14:44:52] <bitfehler> mind if i take a break at this point? it all seems pretty settled, right?
[14:47:28] <ddevault> go for it
[14:47:33] <ddevault> I don't think there's anything left for you to help with
[14:47:35] <ddevault> thanks :)
[14:48:39] <bitfehler> sure thing!
[14:58:08] <ddevault> 2/3 migrations done for hub
[14:58:14] <ddevault> bringing it back online before running the third
[14:58:32] <ddevault> third underway
[15:26:56] <ddevault> this last migration is going to take a while
[15:27:10] <ddevault> as much as a couple of seconds per user
[15:38:43] <ddevault> well, thankfully we're more or less online
[15:38:49] <ddevault> and the remaining work should mostly tend to itself
[15:53:41] <bitfehler> is it too early for a high five yet? :)
[15:55:35] <ddevault> yep
[15:55:43] <ddevault> fat lady has yet to sing
[15:56:03] <bitfehler> lolwut. is that a saying in english?
[15:56:08] <ddevault> yeah
[15:56:16] <bitfehler> i love it :D
[15:56:18] <ddevault> https://en.wikipedia.org/wiki/It_ain%27t_over_till_the_fat_lady_sings
[15:57:10] <bitfehler> amazing
[18:17:44] <ddevault> I foresee a problem when the webhooks are turned back on
[18:17:52] <ddevault> there's presumably a bunch of webhooks queued up with stale URLs for hub.sr.ht
[18:18:00] <ddevault> possible solution is just to drop them
[18:18:38] <adnano> I think that's probably the best option
[18:19:10] <adnano> It could result in a few missing events or missing deletion propogations, though
[18:19:49] <adnano> alternatively one could somehow correct the webhook URLs. not sure how that would be done
[18:21:24] <adnano> how many webhooks are queued up? it might not be that big of a deal
[18:22:45] <ddevault> metrics.sr.ht knows
[18:25:37] <adnano> looks like it's just lists.sr.ht with 58 queued webhooks?
[18:27:17] <adnano> and one builds trigger webhook
[18:27:58] <ddevault> cool
[21:41:20] <adnano> so, did the fat lady sing yet? :)
[21:50:44] <bitfehler> lol
[22:53:17] <ddevault> not yet
[23:00:01] <bitfehler> wow :/
[00:14:25] <ddevault> okay webhooks migrated
[00:14:30] <ddevault> now I have to figure out how to drop them
[00:14:35] <ddevault> could just shut off hub until the services catch up
[00:14:49] <ddevault> easiest way I think
[00:15:52] <ddevault> done
[00:16:51] <ddevault> bitfehler: can you turn monitoring back on when you have the chance
[00:17:21] <ddevault> nice work everyone
[00:20:55] <adnano> woohoo!
[00:27:16] <bl4ckb0ne> now can the fat lady sing
[00:27:21] <ddevault> sure.
[00:27:27] <ddevault> laaaa la la laaa dee doo