[01:00:46] 10DBA, 10Community-Tech, 10MediaWiki-General-or-Unknown, 10Operations, 10Stewards-and-global-tools (Temporary-UserRights): Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3655999 (10EddieGP) a:03EddieGP >>! In T176754#3655920, @Dzahn wrote: > ..of...
[02:16:48] 10DBA, 10Community-Tech, 10MediaWiki-General-or-Unknown, 10Operations, 10Stewards-and-global-tools (Temporary-UserRights): Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3656030 (10Legoktm) 05Open>03declined I agree with T176754#3636245 and am...
[05:24:56] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T177171#3656157 (10Marostegui) 05Open>03Resolved Raid back to optimal - thank you Chris!: ``` root@db1056:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name...
[05:28:16] 10Blocked-on-schema-change, 10DBA, 10Readers-Community-Engagement, 10Community-Liaisons (Oct-Dec 2017): Help communicate read-only time for Commons for schema change required by adding 3D filetype - https://phabricator.wikimedia.org/T176883#3656159 (10Marostegui) Sure! Just let me know if you need anything...
[05:32:50] 10DBA, 10Analytics: Drop MoodBar tables from all wikis - https://phabricator.wikimedia.org/T153033#3656161 (10Marostegui) >>! In T153033#3653939, @Nuria wrote: > @marostegui: let's put them on a mediawiki-archive database, the staging database (if I am not mistaken) has open permits for everyone to delete /up...
[05:35:48] 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3656163 (10Marostegui)
[05:44:24] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3656164 (10Marostegui)
[06:10:27] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3656189 (10Marostegui)
[06:11:03] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3635130 (10Marostegui)
[06:31:53] 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3656200 (10Marostegui)
[06:34:43] 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#3656202 (10Paladox) Bump.
[06:36:11] 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#3628916 (10Marostegui) >>! In T176532#3656202, @Paladox wrote: > Bump. Hey Paladox Chec...
[07:02:33] 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3656247 (10Marostegui)
[07:15:31] 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3656262 (10Marostegui)
[07:17:37] 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3656264 (10Marostegui)
[07:21:57] elukey: https://phabricator.wikimedia.org/T168303#3653222 and following. ok with the plan?
[07:22:20] 1TB back!!!
[07:22:32] marostegui: and hopefuly replication
[07:22:55] jynus: I am super ok, thanks!
[07:22:56] jynus: I guess your last comment means dbstore1001?
[07:23:01] (on the ticket)
[07:25:02] yes
[07:31:09] thanks a lot people for all the help on dbstore1002
[07:38:58] I always say we have no problem helping
[07:39:14] but that is the key- I am helping yout team, you own the service
[08:20:12] yes completely agree
[08:21:04] BTW, marostegui s5 backups were created flawlessly on dbstore2001 for s5 during the night
[08:21:37] dbstore1001 is choking trying to do something
[08:23:11] trumping into start and stopping all slaves, which conflits with itself (even if delayed replication is disabled)
[08:27:34] jynus: great news about dbstore2001!! :)
[08:29:33] 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3656417 (10Marostegui)
[08:30:31] 10DBA, 10Operations, 10ops-codfw: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3656422 (10Marostegui) a:03Papaul db2010 is ready to be fully decommissioned by @Papaul
[08:40:24] so I am thinking of creating a /srv/backups/ logical + raw + binlog / in_progress + latest + 24_hours
[09:33:23] 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3656592 (10Marostegui)
[09:50:52] I am going to upgrade labsdbs to validate the new package and test the rolling restart workflow
[09:51:09] mmm
[09:51:13] let me check 1010
[09:51:16] because I was altering it
[09:51:26] oh, ok
[09:51:30] I can wait
[09:51:40] 1010 was actually upgraded already
[09:51:42] it should be done in a 2-3 hours I think
[09:51:46] I am only touching 1010
[09:51:51] I was going to do 9 and 11
[09:51:53] ah
[09:51:55] then go ahead :)
[09:52:20] labsdb1009 10.1.25
[09:52:27] labsdb1010 10.1.28
[09:52:36] labsdb1011 10.1.25
[09:53:03] ah :)
[09:53:04] nice
[09:53:11] so 1009 and 1011 I am not touching
[09:53:55] see also https://gerrit.wikimedia.org/r/#/c/382144/
[09:54:17] 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#3656638 (10Paladox) @Marostegui oh thanks. Is there a way we can fix this please? As it w...
[10:01:15] 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#3628916 (10jcrespo) > As it was working before It wasn't working before- there was a sec...
[10:35:16] I am thinking of breaking s4 replication so that it cannot start back
[10:35:26] on dbstore1001
[10:35:45] sure
[10:35:47] each of the 900 backups that are happening run
[10:35:49] it is lagging anyways, no?
[10:36:01] START SLAVE and STOP SLAVE on all shards
[10:36:16] and it takes 30 minutes on commons for that to happen
[10:36:19] buf
[10:36:34] 30*900, imagine our throughupt
[10:36:51] if I break replication temporarilly, at least that will be instant
[10:37:05] and that may at least the backups finish
[10:37:06] yeah
[10:37:10] let's just do it
[10:37:13] it is a bug for the START SLAVE
[10:37:21] to happen on an already stopped slave
[10:37:28] on --slave-info
[10:38:00] combined with broken s5 and s4 replication threads
[10:38:15] doing full table scans instead of indexes for writes
[10:38:20] it is makign things not working
[10:38:30] so I can add a row on recentchanges
[10:38:40] so replication breaks
[10:38:53] and then delete it after backups finish
[10:39:05] that should stop the sql thread for good
[10:41:41] or maybe ven disconnect it and then configure it
[10:41:48] but yes, breaking it would also work
[10:41:49] oh, that is actually easier
[10:41:53] thanks
[10:42:07] although it is dangerous
[10:42:18] because the async STOP/START
[10:42:39] probably I can STOP;SHOW;RESET?
[10:42:45] yeah
[10:42:51] make sure to do the set default_connection bla bla bla
[10:42:52] or reply on the error log for the coords?
[10:42:53] XD
[10:43:03] i would do a stop;: show; reset
[10:43:21] if it starts in the middle, it will fail?
[10:43:44] maybe you ca
[10:43:44] can
[10:43:46] disable events
[10:43:50] do all the stuff
[10:43:52] and start events?
[10:43:53] events are disabled
[10:44:01] the problem are the --slave-info
[10:44:05] ah
[10:44:11] that starts the replication even if it is stopped
[10:44:18] i guess the reset would complain if the replicaiton is working
[10:45:30] STOP SLAVE 's4'; SHOW SLAVE 's4' STATUS; RESET MASTER 's4' ALL;
[10:45:59] if it doesn't work, I will go with the breaking replication plan :-)
[10:46:10] that command looks good to me
[10:46:36] i normally use set default_master_connection at the start, just to be sure
[10:46:39] but that is personal manias
[10:46:52] so it is a combination of what I think is a backup bug from mysqldump
[10:47:10] and s4 lagging due to full table scans for some tables
[10:47:27] I think s5 was doing that but I possibly fixed it by reconstructing some tables
[10:47:49] but I cannot do that for s4 in the middle of the backups running
[10:47:55] yeah
[10:48:07] dbstore1001 has reached its performanced limit anyways
[10:48:14] with all the extra load from s5..
[10:48:39] it is mostly the crashes and tokudb
[10:48:53] plus the purging lag
[10:48:59] but dbstore2001 didn't have toku and we saw it happenning
[10:49:07] crashes
[10:49:16] plus one replica set goes wrong
[10:49:19] and all are affected
[10:49:19] that is true, since dbstore2001 crash..it was never the same
[10:49:21] because transactions
[10:49:31] affect all dbs
[10:49:38] btw
[10:49:42] did you see this
[10:49:53] ?
[10:50:03] https://phabricator.wikimedia.org/T149418#3653125
[10:50:35] yeah, I was waiting for the :-) "Time allowing, it'd be nice to try it in a test environment just to see if it actually works."
[10:50:39] haha
[10:50:47] remember I was the one to comment on the ticket "that will help us"
[10:51:10] oh yeah you are subscribed to the mariadb ticket, i forgot :)
[10:51:19] https://jira.mariadb.org/browse/MDEV-12012?focusedCommentId=99569&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-99569
[10:51:43] in fact, you commented right below
[10:52:04] either you forgot or you thought that was a different jaime
[10:52:07] :-D
[10:52:14] hahaha
[10:52:26] hey I have been on holidays!
[10:52:36] :)
[10:52:54] ou have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ''s4' ALL' at line 1
[10:52:56] was this jaime with beard or without beard commenting?
[10:54:02] just try with default_master_connection=s4
[10:54:06] and then the normal commands
[10:54:12] yeah, I just did that
[10:55:18] I have shared the output on the typical place
[10:55:49] that should unlock the backup process
[10:56:07] although we could directly drop s4 and s5 and forget about that
[10:56:29] is s4 on dbstore2001? if yes, we could run a manual backup run
[10:56:43] I can see the output now :)
[10:56:56] "see"
[10:57:05] s4 is not in dbstore2001
[10:57:28] we could test the remote backup and test it
[10:57:49] at least by next week
[10:57:55] OR
[10:58:02] setup multi-instance now
[10:58:20] (when backup finishes)
[10:58:31] you mean copying s4 on dbstore2001?
[10:58:37] no
[10:58:52] setting up s4 instance on dbstore1001
[10:58:57] aaaah
[10:59:02] which is a bit more involved
[10:59:33] but we are delaying it since months ago
[10:59:44] yeah, that is also a good idea
[10:59:49] we skipped the innodb conversion
[11:00:00] but I hope multi-instance is the way
[11:00:11] i really think it is
[11:00:36] and if some sets keep having issues like these
[11:00:38] i would start converting 1001 to multi-instance indeed
[11:00:54] it "should" be easy
[11:01:01] just a transfer
[11:01:09] yeah
[11:01:20] no filtering or engine comversion
[11:01:20] we could actually convert the whole dbstore1001 maybe in a whole week
[11:01:26] (maybe)
[11:01:45] although we have the lack of performance
[11:01:57] seen on dbstore2001
[11:02:08] yeah, but with delayed replication, i think we could be ok
[11:02:13] (which is not what we want)
[11:02:15] but for now...
[11:02:26] we should talk to mark and see his opinion on purchases
[11:02:33] we can do that tomorrow
[11:02:45] having a couple of extra replicas + dbstore could be enough
[11:02:52] as it is our bi weekly meeting with him
[11:04:26] puppet agent -tv
[11:04:33] systemctl set-environment MYSQLD_OPTS="--skip-slave-start"
[11:04:39] systemctl start mariadb
[11:05:22] mysql_upgrade --skip-ssl
[11:05:49] on labsd?
[11:05:51] labs
[11:06:00] on every upgrade
[11:06:14] I was thinking that if I write it I will remember it
[11:06:18] haha
[11:06:22] I do have it on my own notes
[11:06:23] XD
[11:06:38] I am not sure my draining method was too successful
[11:06:46] I think I hard-closed connections
[11:07:01] what did you do to drain them?
[11:07:03] I will try again with the other haproxy
[11:07:22] I tried drain
[11:07:50] but not sure it worked, so I just overwrite the haconfig and reload
[11:07:54] what did you do? just disabled in on haproxy?
[11:07:56] ah right
[11:08:11] i guess some people just leave the connection open and reuse it?
[11:08:15] https://github.com/rancher/rancher/issues/8627
[11:08:32] the thing is that drain/weight 0 may not work with us
[11:08:41] because we do not do load balancing, but backup
[11:09:01] so maybe it needs 3 phases, load balacing + drain
[11:09:14] and then removed from the pool?
[11:09:34] I also need to test it with a long running connection
[11:09:49] then we can just script it
[11:10:05] at least I checked connections were moved
[11:10:32] so worse case scenario, conections dropped and reconnected, which is the "right" behaviour
[11:11:16] yeah, i guess there was no error given to an active conneciton
[11:11:18] connection
[11:11:23] just killed an open sleeping one
[11:11:49] I don't know
[11:11:59] I will wait for buffer pool to heat again
[11:12:04] and will reload the proxy
[11:12:25] labsdb1011 10.1.28 516G 6m 1s 325 Yes 38m
[11:12:31] \o/
[11:12:45] I do not want to assume you know less than me
[11:12:56] but you should get familiar with haproxy commands
[11:13:00] (I wasn't)
[11:13:15] and in an emergency, there will be little time to look at manuals
[11:13:33] yeah, I normally do some reload, check the status of a given weight etc
[11:13:44] but normally not more than that
[11:13:44] yeah, reload and check, yes
[11:13:50] exactly,
[11:13:52] same here
[11:14:36] do you normally use more than that?
[11:14:46] no, that is what I am trying to learn
[11:14:58] so this was more of a self reminder
[11:15:01] that I extended to you
[11:15:42] and if possible, have predefined script with "failover to 10" or similar
[11:15:56] oh that'd be nice indeed
[11:15:59] just a "button"
[11:16:12] with e.g. drain, wait X minutes, hard failover
[11:17:08] in the past I think sean used haproxies for master failover
[11:17:29] to minimize read only time
[11:17:38] on core, I mean
[12:42:14] I'm cleaning up ores_classification in enwiki, the deletes will be high a little
[12:42:34] but it's almost finished some small things
[12:42:54] thanks for the heads up
[12:53:42] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3657128 (10Marostegui)
[12:54:04] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3638543 (10Marostegui)
[12:55:06] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3638544 (10Marostegui)
[13:17:22] done now
[13:19:09] I think it would be great if you shrink it to claim some space
[13:20:47] it is currently 5.3G on the master
[13:21:05] let me see how much we can get in codfw
[13:25:13] after the optimize it goes from 5.4 to 784M
[13:25:26] I will run an optimize on codfw with replication
[13:25:33] thanks
[13:25:37] it toon only 2 minutes
[13:30:39] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 10MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), and 5 others: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753#3657239 (10Marostegui) >>! In T159753#3657238, @Stashbot wrote: > {nav...
[14:44:19] "ORDER BY RAND() LIMIT 10569" ugh, I think I am going to get sick
[14:52:15]