[10:56:33] <ddevault> all set for the rollout
[10:57:32] <bitfehler> i'll get another tea, but otherwise ready
[11:00:43] <ddevault> bitfehler: can you rustle up an nginx config which disables POST requests for meta.sr.ht
[11:01:37] <bitfehler> aye. is the one in sr.ht-nginx the one actually running in prod?
[11:01:42] <ddevault> the one in prod
[11:01:46] <ddevault> though they should be the same
[11:01:58] <bitfehler> so if i work based on that it should do?
[11:02:08] <ddevault> yeah
[11:02:14] <bitfehler> ack
[11:02:41] <ddevault> if you would also be so kind as to log in and shut up metrics.sr.ht afterwards that would be nice
[11:03:03] <bitfehler> log in where?
[11:03:06] <ddevault> metrics.sr.ht
[11:03:16] <ddevault> sircmpwn@metrics.sr.ht
[11:03:17] <bitfehler> i don't think i can?
[11:03:25] <ddevault> you should be good now
[11:03:57] <bitfehler> aye, thanks
[11:10:49] <ddevault> figured out how to disable POST
[11:11:02] <bitfehler> https://paste.sr.ht/~bitfehler/3cb46a6c78505778f2dc1c0c0357478abaa2c60e
[11:11:06] <bitfehler> oh, ok :)
[11:11:09] <ddevault> ty
[11:11:23] <ddevault> lmk when metrics is out of the picture and we'll get started
[11:12:19] <bitfehler> it is
[11:12:24] <ddevault> here goes
[11:18:36] <ddevault> these migration scripts are kind of slow, it's going to take a while
[11:19:43] <ddevault> might have been good to dump the meta IDs to a CSV and work from that instead of while meta.sr.ht was online
[11:20:52] <ddevault> seeing transient DNS failures causing the migration to abort partway through
[11:20:55] <ddevault> added meta.sr.ht to /etc/hosts as workaround
[11:21:51] <bitfehler> is that a coincidence?
[11:22:10] <ddevault> not sure, not the time to investigate
[11:22:15] <ddevault> we have 3 DNS servers in /etc/resolv.conf on all systems
[11:22:20] <ddevault> could be that python's resolver does not query them all
[11:22:34] <ddevault> might have also been nice to add tqdm here to monitor progress and give an estimate on completion time
[11:25:30] <bitfehler> do you have any indication at all for how long it might take?
[11:25:37] <ddevault> we'll see after git.sr.ht migration is done
[11:25:50] <ddevault> looks like we're getting a couple dozen users migrated per second
[11:26:21] <ddevault> 33 users per second
[11:26:25] <ddevault> I'll let you do the math
[11:26:56] <ddevault> damn, hit an exception
[11:27:03] <ddevault> user known to git.sr.ht but not to meta.sr.ht
[11:27:06] <ddevault> that's a problem
[11:28:17] <ddevault> solution: hot patching migration script, will fix it properly after maintenance window
[11:28:28] <ddevault> it's all done in a transaction so have to start over
[11:29:26] <bitfehler> ok. i'd be curious how that user got there, but we can also talk about it later
[11:29:46] <ddevault> yeah, I'm taking notes, there'll be follow ups
[11:30:22] <ddevault> back of the napkin math on users per second suggests about 10-15 minutes per migration
[11:33:06] <ddevault> I think we can start bringing each service back online independently after its migration is done, perhaps saving hub for last
[11:33:22] <ddevault> bitfehler: are you able to do smoke testing on each service after I bring it back up, so I can move on to the next?
[11:34:44] <bitfehler> sure. like, checking all my stuff is still there and such?
[11:34:52] <ddevault> yeah, just poke about and see if you notice anything strange/broken
[11:34:56] <bitfehler> can do
[11:35:05] <ddevault> though, there is a risk with switching things online mid-migration
[11:35:13] <ddevault> if the system is writable then we'll have to lose changes if we want to roll back to the snapshot
[11:35:54] <ddevault> do you want to add that nginx thing to all services and bring them online read-only until we finish all of the migrations?
[11:36:18] <bitfehler> yeah, sounds good
[11:36:27] <ddevault> cool, let me make sure your SSH access is all set up
[11:37:32] <ddevault> you should be good to go
[11:42:52] <bitfehler> nginx is running and you just shut down git.sr.ht service itself, is that correct?
[11:42:59] <ddevault> aye
[11:43:18] <ddevault> just add the limit directive to each config file's proxy location
[11:43:20] <bitfehler> so i can just restart it read-only right away?
[11:43:21] <ddevault> then reload nginx
[11:43:29] <bitfehler> yep cool
[11:43:33] <ddevault> no, don't start the services until the migrations are finished
[11:43:36] <ddevault> I'll take care of that step
[11:43:48] <ddevault> until their respective migrations*
[11:43:49] <bitfehler> sorry, i meant nginx.
[11:44:14] <ddevault> doas nginx -s reload is safter than service nginx restart
[11:44:32] <ddevault> nginx -s reload will check for errors and tell you about them before applying changes, restart will shut off nginx if there are errors
[11:44:46] <ddevault> bah, another error in the migration, due presumably to my workaround
[11:44:51] <bitfehler> ok, done
[11:44:55] <bitfehler> oh no
[11:44:58] <ddevault> null in non-nullable remote-id column
[11:45:15] <ddevault> what should we do here, hmm...
[11:45:31] <ddevault> set ID to -old_id?
[11:45:37] <ddevault> then we can drop all rows with negative IDs later
[11:46:21] <bitfehler> users that are in some service but not in meta are users that ought to be deleted?
[11:46:25] <ddevault> yeah
[11:46:31] <ddevault> that's my thinking
[11:46:36] <ddevault> won't fully flesh out that line of thought until post migration
[11:46:52] <ddevault> probably a botched user deletion request
[11:46:54] <ddevault> since those are done by hand
[11:47:54] <bitfehler> i guess negative might work then
[11:48:10] <ddevault> I don't have any better ideas
[11:48:21] <ddevault> we have a snapshot, let's go with this and if it raises eyebrows in smoke testing we can revisit the choice
[11:48:34] <ddevault> migration resumed
[11:48:51] <ddevault> you know, I had thought we would be wise to test these against production database dumps, but part of my never wants to look at production database dumps
[11:49:38] <ddevault> I suspect that we don't actually need to commit the hot fixes for the migration scripts
[11:49:46] <ddevault> since third-party instances are unlikely to have our hacky manual account deletion process
[11:50:00] <ddevault> 99% sure that's what causes these, I remember a couple of instances where I deleted meta.sr.ht accounts but not other services
[11:50:03] <bitfehler> maybe they have even worse ones... :)
[11:50:15] <ddevault> hah, well, they made their bed
[11:50:21] <ddevault> unsupported use-case(TM)
[11:50:46] <bitfehler> i'm sure if someone runs into this we'll hear from them, so might as well wait for that...
[11:50:51] <ddevault> aye
[11:53:02] <ddevault> lol just tried to use my paste.sr.ht script
[11:54:01] <bitfehler> hehe, yeah, working w/o sr.ht can be difficult...
[11:54:31] <ddevault> ah, forgot to do anything with dispatch.sr.ht
[11:54:35] <ddevault> just shutting it off at least
[11:54:50] <ddevault> adnano confirmed that we don't need to bother with a migration for it
[11:55:18] <bitfehler> easy win
[11:55:24] <ddevault> just in time for christmas https://metrics.sr.ht/targets
[11:55:40] <bitfehler> lol
[11:56:41] <ddevault> I might paste up the logs from this channel during the migration and share them for transparency/third-party reference if no one is opposed
[11:59:07] <bitfehler> fine with me
[12:00:50] <ddevault> it would be nice if we could cut the services over to a maintenance mode in the future
[12:01:01] <ddevault> show a page explaining the situation
[12:01:21] <bitfehler> yeah
[12:05:02] <ddevault> hrm, my remote_id null fix does not seem to have worked
[12:05:12] <ddevault> ah, I see why, simple mistake
[12:05:13] <ddevault> here we go again
[12:05:18] <ddevault> took 981 seconds to run the migration btw
[12:07:00] <bitfehler> the full one?
[12:07:07] <ddevault> yeah, for git.sr.ht
[12:07:13] <ddevault> which has more users than any other service iirc
[12:22:26] <ddevault> git.sr.ht done
[12:22:32] <ddevault> service is online
[12:22:35] <bitfehler> \o/
[12:22:38] <ddevault> bitfehler: go ahead with smoke tests
[12:22:40] <ddevault> moving on to hg.sr.ht
[12:22:45] <bitfehler> ack
[12:24:58] <ddevault> there are 3 users with negative UIDs now
[12:26:01] <bitfehler> still poking, but at first glance git.sr.ht looks pretty fine, and POSTs are forbidden as expected
[12:26:07] <ddevault> nice
[12:29:23] <bitfehler> meh. my home internet acted up. git.sr.ht is fine
[12:29:37] <ddevault> woohoo!
[12:32:11] <ddevault> hg.sr.ht up
[12:32:19] <ddevault> moving on to builds
[12:34:35] <bitfehler> hg.sr.ht looks ok
[12:34:49] <bitfehler> nginx on builds is also ready
[12:35:02] <ddevault> you can go ahead and install all of the nginx changes ahead of my work
[12:35:32] <bitfehler> what's the order?
[12:35:41] <ddevault> builds, lists, todo, pages, paste, man, hub
[12:35:46] <bitfehler> ack
[12:39:13] <bitfehler> hmm, pages..
[12:39:24] <ddevault> we'll just leave it offline until the end
[12:39:29] <ddevault> I don't see a read-only mode being feasible for it
[12:39:37] <bitfehler> same here, ok
[12:39:42] <ddevault> or at least not easy
[12:43:40] <ddevault> https://news.ycombinator.com/item?id=33341716
[12:44:16] <bitfehler> lol
[12:45:22] <bitfehler> nginx running read-only on all those services
[12:45:25] <ddevault> ty
[12:51:27] <ddevault> builds.sr.ht online
[12:52:41] <ddevault> starting lists.sr.ht
[12:53:12] <bitfehler> builds looking good
[12:57:37] <adnano> note that dispatch should be modified to run on the old version of core.sr.ht
[12:57:39] <ddevault> moving pages to the end of the list
[12:57:45] <ddevault> adnano: just not going to update it, ez
[12:58:00] <ddevault> it's being decommissioned in 2 months
[13:06:56] <ddevault> lists up
[13:08:15] <ddevault> todo underway
[13:10:16] <bitfehler> lists looking good
[13:11:16] <ddevault> thought for next time
[13:11:32] <ddevault> we could have left all the services online in read-only mode and only turned off whichever one is actively being migrated
[13:16:17] <bitfehler> did you disable the lists.sr.ht worker?
[13:16:21] <ddevault> yes
[13:16:39] <bitfehler> so mail was accepted and will be backfilled, right?
[13:16:43] <ddevault> yep
[13:16:45] <bitfehler> cool
[13:16:52] <ddevault> ditto for emails to todo.sr.ht
[13:17:03] <bitfehler> yep
[13:25:25] <ddevault> todo up
[13:27:18] <ddevault> paste underway
[13:29:37] <bitfehler> todo lookin good
[13:30:48] <ddevault> paste up
[13:31:37] <bitfehler> looks sane as well
[13:32:51] <ddevault> man underway
[13:39:04] <bitfehler> hm, why are the prometheus endpoints still down, is that expected?
[13:39:14] <ddevault> seem to be up to me
[13:39:24] <ddevault> the APIs are still shut off
[13:39:31] <ddevault> but the service endpoints are working
[13:39:35] <bitfehler> ah, ok
[13:39:48] <bitfehler> yeah, meant the apis
[13:39:58] <ddevault> they can't be made read-only
[13:40:58] <bitfehler> aren't all requests POSTs anyways?
[13:41:04] <bitfehler> to the api i mean?
[13:41:11] <ddevault> yeah
[13:47:53] <ddevault> man also has a lot of users, this'll take a minute
[13:48:00] <ddevault> all users were once redirected to man post signup
[13:50:02] <ddevault> speak of the devil
[13:50:04] <ddevault> man is up
[13:51:44] <ddevault> starting hub
[13:53:59] <bitfehler> man looking good
[13:54:08] <bitfehler> i guess hub will also take some time, right
[13:54:10] <bitfehler> ?
[13:54:17] <ddevault> maybe
[13:54:33] <ddevault> users haven't been redirected there post-signup for as long as they were to man
[14:02:00] <ddevault> we have now officially exceeded our maintenance window estimate
[14:02:29] <ddevault> good news is that we're pretty close
[14:13:08] <ddevault> bah!
[14:13:14] <bitfehler> oh no?
[14:13:16] <ddevault> forgot the services have to be writable for hub to rewrite webhook URLs
[14:13:22] <ddevault> *nearly* removed the nginx directives in time
[14:13:27] <ddevault> now we have to re-run the whole migration
[14:13:48] <bitfehler> so we have to make all of them writable again?
[14:13:51] <ddevault> yeah
[14:13:53] <ddevault> working on it now
[14:13:59] <bitfehler> we could allow from just our own network?
[14:14:07] <ddevault> nah, I was going to suggest making them writable at this point anyway
[14:14:13] <bitfehler> alright
[14:14:18] <ddevault> would have got it done in time if I had correctly guessed the order in which the migration would use them
[14:14:35] <bitfehler> lol. sounds like you don't any help then?
[14:14:39] <ddevault> nah, thanks though
[14:15:40] <ddevault> now writable on all services which are up
[14:15:44] <ddevault> going to start up the APIs also
[14:15:47] <ddevault> but leave webhooks until hub is finished
[14:15:54] <ddevault> they'll backfill
[14:16:13] <adnano> the migration is split into two files. did the first one complete at least?
[14:16:25] <ddevault> yeah
[14:16:32] <ddevault> but they're all run in a transaction so it got rolled back
[14:16:42] <adnano> I see
[14:16:51] <adnano> I guess we could have done one step at a time
[14:16:55] <ddevault> ACTION shrugs
[14:17:13] <ddevault> APIs up
[14:17:32] <ddevault> bringing up the builders too
[14:17:34] <ddevault> and git dispatch
[14:19:46] <ddevault> bringing up lists
[14:20:06] <ddevault> I think I've brought up everything I care to at this stage
[14:20:17] <ddevault> now we're Mostly Online(TM)
[14:20:29] <bitfehler> pages still down on purpose?
[14:20:33] <ddevault> yeah
[14:20:35] <ddevault> it hasn't been migrated yet
[14:20:37] <ddevault> saving it for last
[14:20:42] <bitfehler> ok
[14:22:32] <adnano> I think we might have been able to keep pages online while the other services were migrating
[14:22:52] <ddevault> read-only, yes
[14:23:03] <ddevault> we could have kept all the services online (read-only) except for whichever one was being actively migrated, noted it earlier
[14:31:23] <ddevault> going to kick off the pages migration as well, I don't see any reason it can't run in parallel with the hub migration
[14:35:00] <ddevault> huh
[14:35:05] <ddevault> pages.sr.ht does not have a webhooks table and no one noticed
[14:35:33] <bitfehler> hehe
[14:35:34] <ddevault> pages online
[14:37:09] <ddevault> another hub issue, starting over... again
[14:37:15] <bitfehler> derp
[14:37:24] <bitfehler> nothing serious, though?
[14:37:25] <ddevault> scoping it to just do the first two upgrades
[14:37:32] <ddevault> so that we don't have to keep re-running this if the third has more problems
[14:37:38] <ddevault> yeah, nothing serious, just more missing user bugs
[14:39:18] <ddevault> also doing some closing out tasks while waiting on hub
[14:43:08] <ddevault> might as well turn dispatch.sr.ht back on, I guess
[14:43:31] <ddevault> is it obvious that I like one of my children less than the others
[14:44:52] <bitfehler> mind if i take a break at this point? it all seems pretty settled, right?
[14:47:28] <ddevault> go for it
[14:47:33] <ddevault> I don't think there's anything left for you to help with
[14:47:35] <ddevault> thanks :)
[14:48:39] <bitfehler> sure thing!
[14:58:08] <ddevault> 2/3 migrations done for hub
[14:58:14] <ddevault> bringing it back online before running the third
[14:58:32] <ddevault> third underway
[15:26:56] <ddevault> this last migration is going to take a while
[15:27:10] <ddevault> as much as a couple of seconds per user
[15:38:43] <ddevault> well, thankfully we're more or less online
[15:38:49] <ddevault> and the remaining work should mostly tend to itself
[15:53:41] <bitfehler> is it too early for a high five yet? :)
[15:55:35] <ddevault> yep
[15:55:43] <ddevault> fat lady has yet to sing
[15:56:03] <bitfehler> lolwut. is that a saying in english?
[15:56:08] <ddevault> yeah
[15:56:16] <bitfehler> i love it :D
[15:56:18] <ddevault> https://en.wikipedia.org/wiki/It_ain%27t_over_till_the_fat_lady_sings
[15:57:10] <bitfehler> amazing
[18:17:44] <ddevault> I foresee a problem when the webhooks are turned back on
[18:17:52] <ddevault> there's presumably a bunch of webhooks queued up with stale URLs for hub.sr.ht
[18:18:00] <ddevault> possible solution is just to drop them
[18:18:38] <adnano> I think that's probably the best option
[18:19:10] <adnano> It could result in a few missing events or missing deletion propogations, though
[18:19:49] <adnano> alternatively one could somehow correct the webhook URLs. not sure how that would be done
[18:21:24] <adnano> how many webhooks are queued up? it might not be that big of a deal
[18:22:45] <ddevault> metrics.sr.ht knows
[18:25:37] <adnano> looks like it's just lists.sr.ht with 58 queued webhooks?
[18:27:17] <adnano> and one builds trigger webhook
[18:27:58] <ddevault> cool
[21:41:20] <adnano> so, did the fat lady sing yet? :)
[21:50:44] <bitfehler> lol
[22:53:17] <ddevault> not yet
[23:00:01] <bitfehler> wow :/
[00:14:25] <ddevault> okay webhooks migrated
[00:14:30] <ddevault> now I have to figure out how to drop them
[00:14:35] <ddevault> could just shut off hub until the services catch up
[00:14:49] <ddevault> easiest way I think
[00:15:52] <ddevault> done
[00:16:51] <ddevault> bitfehler: can you turn monitoring back on when you have the chance
[00:17:21] <ddevault> nice work everyone
[00:20:55] <adnano> woohoo!
[00:27:16] <bl4ckb0ne> now can the fat lady sing
[00:27:21] <ddevault> sure.
[00:27:27] <ddevault> laaaa la la laaa dee doo