[10:56:33] all set for the rollout [10:57:32] i'll get another tea, but otherwise ready [11:00:43] bitfehler: can you rustle up an nginx config which disables POST requests for meta.sr.ht [11:01:37] aye. is the one in sr.ht-nginx the one actually running in prod? [11:01:42] the one in prod [11:01:46] though they should be the same [11:01:58] so if i work based on that it should do? [11:02:08] yeah [11:02:14] ack [11:02:41] if you would also be so kind as to log in and shut up metrics.sr.ht afterwards that would be nice [11:03:03] log in where? [11:03:06] metrics.sr.ht [11:03:16] sircmpwn@metrics.sr.ht [11:03:17] i don't think i can? [11:03:25] you should be good now [11:03:57] aye, thanks [11:10:49] figured out how to disable POST [11:11:02] https://paste.sr.ht/~bitfehler/3cb46a6c78505778f2dc1c0c0357478abaa2c60e [11:11:06] oh, ok :) [11:11:09] ty [11:11:23] lmk when metrics is out of the picture and we'll get started [11:12:19] it is [11:12:24] here goes [11:18:36] these migration scripts are kind of slow, it's going to take a while [11:19:43] might have been good to dump the meta IDs to a CSV and work from that instead of while meta.sr.ht was online [11:20:52] seeing transient DNS failures causing the migration to abort partway through [11:20:55] added meta.sr.ht to /etc/hosts as workaround [11:21:51] is that a coincidence? [11:22:10] not sure, not the time to investigate [11:22:15] we have 3 DNS servers in /etc/resolv.conf on all systems [11:22:20] could be that python's resolver does not query them all [11:22:34] might have also been nice to add tqdm here to monitor progress and give an estimate on completion time [11:25:30] do you have any indication at all for how long it might take? [11:25:37] we'll see after git.sr.ht migration is done [11:25:50] looks like we're getting a couple dozen users migrated per second [11:26:21] 33 users per second [11:26:25] I'll let you do the math [11:26:56] damn, hit an exception [11:27:03] user known to git.sr.ht but not to meta.sr.ht [11:27:06] that's a problem [11:28:17] solution: hot patching migration script, will fix it properly after maintenance window [11:28:28] it's all done in a transaction so have to start over [11:29:26] ok. i'd be curious how that user got there, but we can also talk about it later [11:29:46] yeah, I'm taking notes, there'll be follow ups [11:30:22] back of the napkin math on users per second suggests about 10-15 minutes per migration [11:33:06] I think we can start bringing each service back online independently after its migration is done, perhaps saving hub for last [11:33:22] bitfehler: are you able to do smoke testing on each service after I bring it back up, so I can move on to the next? [11:34:44] sure. like, checking all my stuff is still there and such? [11:34:52] yeah, just poke about and see if you notice anything strange/broken [11:34:56] can do [11:35:05] though, there is a risk with switching things online mid-migration [11:35:13] if the system is writable then we'll have to lose changes if we want to roll back to the snapshot [11:35:54] do you want to add that nginx thing to all services and bring them online read-only until we finish all of the migrations? [11:36:18] yeah, sounds good [11:36:27] cool, let me make sure your SSH access is all set up [11:37:32] you should be good to go [11:42:52] nginx is running and you just shut down git.sr.ht service itself, is that correct? [11:42:59] aye [11:43:18] just add the limit directive to each config file's proxy location [11:43:20] so i can just restart it read-only right away? [11:43:21] then reload nginx [11:43:29] yep cool [11:43:33] no, don't start the services until the migrations are finished [11:43:36] I'll take care of that step [11:43:48] until their respective migrations* [11:43:49] sorry, i meant nginx. [11:44:14] doas nginx -s reload is safter than service nginx restart [11:44:32] nginx -s reload will check for errors and tell you about them before applying changes, restart will shut off nginx if there are errors [11:44:46] bah, another error in the migration, due presumably to my workaround [11:44:51] ok, done [11:44:55] oh no [11:44:58] null in non-nullable remote-id column [11:45:15] what should we do here, hmm... [11:45:31] set ID to -old_id? [11:45:37] then we can drop all rows with negative IDs later [11:46:21] users that are in some service but not in meta are users that ought to be deleted? [11:46:25] yeah [11:46:31] that's my thinking [11:46:36] won't fully flesh out that line of thought until post migration [11:46:52] probably a botched user deletion request [11:46:54] since those are done by hand [11:47:54] i guess negative might work then [11:48:10] I don't have any better ideas [11:48:21] we have a snapshot, let's go with this and if it raises eyebrows in smoke testing we can revisit the choice [11:48:34] migration resumed [11:48:51] you know, I had thought we would be wise to test these against production database dumps, but part of my never wants to look at production database dumps [11:49:38] I suspect that we don't actually need to commit the hot fixes for the migration scripts [11:49:46] since third-party instances are unlikely to have our hacky manual account deletion process [11:50:00] 99% sure that's what causes these, I remember a couple of instances where I deleted meta.sr.ht accounts but not other services [11:50:03] maybe they have even worse ones... :) [11:50:15] hah, well, they made their bed [11:50:21] unsupported use-case(TM) [11:50:46] i'm sure if someone runs into this we'll hear from them, so might as well wait for that... [11:50:51] aye [11:53:02] lol just tried to use my paste.sr.ht script [11:54:01] hehe, yeah, working w/o sr.ht can be difficult... [11:54:31] ah, forgot to do anything with dispatch.sr.ht [11:54:35] just shutting it off at least [11:54:50] adnano confirmed that we don't need to bother with a migration for it [11:55:18] easy win [11:55:24] just in time for christmas https://metrics.sr.ht/targets [11:55:40] lol [11:56:41] I might paste up the logs from this channel during the migration and share them for transparency/third-party reference if no one is opposed [11:59:07] fine with me [12:00:50] it would be nice if we could cut the services over to a maintenance mode in the future [12:01:01] show a page explaining the situation [12:01:21] yeah [12:05:02] hrm, my remote_id null fix does not seem to have worked [12:05:12] ah, I see why, simple mistake [12:05:13] here we go again [12:05:18] took 981 seconds to run the migration btw [12:07:00] the full one? [12:07:07] yeah, for git.sr.ht [12:07:13] which has more users than any other service iirc [12:22:26] git.sr.ht done [12:22:32] service is online [12:22:35] \o/ [12:22:38] bitfehler: go ahead with smoke tests [12:22:40] moving on to hg.sr.ht [12:22:45] ack [12:24:58] there are 3 users with negative UIDs now [12:26:01] still poking, but at first glance git.sr.ht looks pretty fine, and POSTs are forbidden as expected [12:26:07] nice [12:29:23] meh. my home internet acted up. git.sr.ht is fine [12:29:37] woohoo! [12:32:11] hg.sr.ht up [12:32:19] moving on to builds [12:34:35] hg.sr.ht looks ok [12:34:49] nginx on builds is also ready [12:35:02] you can go ahead and install all of the nginx changes ahead of my work [12:35:32] what's the order? [12:35:41] builds, lists, todo, pages, paste, man, hub [12:35:46] ack [12:39:13] hmm, pages.. [12:39:24] we'll just leave it offline until the end [12:39:29] I don't see a read-only mode being feasible for it [12:39:37] same here, ok [12:39:42] or at least not easy [12:43:40] https://news.ycombinator.com/item?id=33341716 [12:44:16] lol [12:45:22] nginx running read-only on all those services [12:45:25] ty [12:51:27] builds.sr.ht online [12:52:41] starting lists.sr.ht [12:53:12] builds looking good [12:57:37] note that dispatch should be modified to run on the old version of core.sr.ht [12:57:39] moving pages to the end of the list [12:57:45] adnano: just not going to update it, ez [12:58:00] it's being decommissioned in 2 months [13:06:56] lists up [13:08:15] todo underway [13:10:16] lists looking good [13:11:16] thought for next time [13:11:32] we could have left all the services online in read-only mode and only turned off whichever one is actively being migrated [13:16:17] did you disable the lists.sr.ht worker? [13:16:21] yes [13:16:39] so mail was accepted and will be backfilled, right? [13:16:43] yep [13:16:45] cool [13:16:52] ditto for emails to todo.sr.ht [13:17:03] yep [13:25:25] todo up [13:27:18] paste underway [13:29:37] todo lookin good [13:30:48] paste up [13:31:37] looks sane as well [13:32:51] man underway [13:39:04] hm, why are the prometheus endpoints still down, is that expected? [13:39:14] seem to be up to me [13:39:24] the APIs are still shut off [13:39:31] but the service endpoints are working [13:39:35] ah, ok [13:39:48] yeah, meant the apis [13:39:58] they can't be made read-only [13:40:58] aren't all requests POSTs anyways? [13:41:04] to the api i mean? [13:41:11] yeah [13:47:53] man also has a lot of users, this'll take a minute [13:48:00] all users were once redirected to man post signup [13:50:02] speak of the devil [13:50:04] man is up [13:51:44] starting hub [13:53:59] man looking good [13:54:08] i guess hub will also take some time, right [13:54:10] ? [13:54:17] maybe [13:54:33] users haven't been redirected there post-signup for as long as they were to man [14:02:00] we have now officially exceeded our maintenance window estimate [14:02:29] good news is that we're pretty close [14:13:08] bah! [14:13:14] oh no? [14:13:16] forgot the services have to be writable for hub to rewrite webhook URLs [14:13:22] *nearly* removed the nginx directives in time [14:13:27] now we have to re-run the whole migration [14:13:48] so we have to make all of them writable again? [14:13:51] yeah [14:13:53] working on it now [14:13:59] we could allow from just our own network? [14:14:07] nah, I was going to suggest making them writable at this point anyway [14:14:13] alright [14:14:18] would have got it done in time if I had correctly guessed the order in which the migration would use them [14:14:35] lol. sounds like you don't any help then? [14:14:39] nah, thanks though [14:15:40] now writable on all services which are up [14:15:44] going to start up the APIs also [14:15:47] but leave webhooks until hub is finished [14:15:54] they'll backfill [14:16:13] the migration is split into two files. did the first one complete at least? [14:16:25] yeah [14:16:32] but they're all run in a transaction so it got rolled back [14:16:42] I see [14:16:51] I guess we could have done one step at a time [14:16:55] ACTION shrugs [14:17:13] APIs up [14:17:32] bringing up the builders too [14:17:34] and git dispatch [14:19:46] bringing up lists [14:20:06] I think I've brought up everything I care to at this stage [14:20:17] now we're Mostly Online(TM) [14:20:29] pages still down on purpose? [14:20:33] yeah [14:20:35] it hasn't been migrated yet [14:20:37] saving it for last [14:20:42] ok [14:22:32] I think we might have been able to keep pages online while the other services were migrating [14:22:52] read-only, yes [14:23:03] we could have kept all the services online (read-only) except for whichever one was being actively migrated, noted it earlier [14:31:23] going to kick off the pages migration as well, I don't see any reason it can't run in parallel with the hub migration [14:35:00] huh [14:35:05] pages.sr.ht does not have a webhooks table and no one noticed [14:35:33] hehe [14:35:34] pages online [14:37:09] another hub issue, starting over... again [14:37:15] derp [14:37:24] nothing serious, though? [14:37:25] scoping it to just do the first two upgrades [14:37:32] so that we don't have to keep re-running this if the third has more problems [14:37:38] yeah, nothing serious, just more missing user bugs [14:39:18] also doing some closing out tasks while waiting on hub [14:43:08] might as well turn dispatch.sr.ht back on, I guess [14:43:31] is it obvious that I like one of my children less than the others [14:44:52] mind if i take a break at this point? it all seems pretty settled, right? [14:47:28] go for it [14:47:33] I don't think there's anything left for you to help with [14:47:35] thanks :) [14:48:39] sure thing! [14:58:08] 2/3 migrations done for hub [14:58:14] bringing it back online before running the third [14:58:32] third underway [15:26:56] this last migration is going to take a while [15:27:10] as much as a couple of seconds per user [15:38:43] well, thankfully we're more or less online [15:38:49] and the remaining work should mostly tend to itself [15:53:41] is it too early for a high five yet? :) [15:55:35] yep [15:55:43] fat lady has yet to sing [15:56:03] lolwut. is that a saying in english? [15:56:08] yeah [15:56:16] i love it :D [15:56:18] https://en.wikipedia.org/wiki/It_ain%27t_over_till_the_fat_lady_sings [15:57:10] amazing [18:17:44] I foresee a problem when the webhooks are turned back on [18:17:52] there's presumably a bunch of webhooks queued up with stale URLs for hub.sr.ht [18:18:00] possible solution is just to drop them [18:18:38] I think that's probably the best option [18:19:10] It could result in a few missing events or missing deletion propogations, though [18:19:49] alternatively one could somehow correct the webhook URLs. not sure how that would be done [18:21:24] how many webhooks are queued up? it might not be that big of a deal [18:22:45] metrics.sr.ht knows [18:25:37] looks like it's just lists.sr.ht with 58 queued webhooks? [18:27:17] and one builds trigger webhook [18:27:58] cool [21:41:20] so, did the fat lady sing yet? :) [21:50:44] lol [22:53:17] not yet [23:00:01] wow :/ [00:14:25] okay webhooks migrated [00:14:30] now I have to figure out how to drop them [00:14:35] could just shut off hub until the services catch up [00:14:49] easiest way I think [00:15:52] done [00:16:51] bitfehler: can you turn monitoring back on when you have the chance [00:17:21] nice work everyone [00:20:55] woohoo! [00:27:16] now can the fat lady sing [00:27:21] sure. [00:27:27] laaaa la la laaa dee doo