[05:39:05] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) @jcrespo would you mind taking a look at the above patches ^ I have also updated our etherpad with the plan Thanks!
[06:10:05] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui)
[06:10:28] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) p:05Triage→03Normal a:03Marostegui
[06:11:35] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui)
[06:11:37] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui)
[06:16:01] jynus: check the sre etherpad, I have added our goals but please double check them and re-org as you feel, line 62 to 81
[06:16:19] jynus: also added this week updates to our section (line 84)
[06:41:19] 10DBA, 10Gerrit, 10Operations, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) 05Resolved→03Open let's only resolve stuff that is actually resolved, not what will be resolved i...
[06:58:12] sorry, I was looking at our etherpad, and was getting very confused
[06:58:46] I have updated the SRE one only
[06:59:01] it is ok, I was like "but I don't see it" :-D
[06:59:06] did it move up or down?
[06:59:08] haha
[06:59:12] and I was like, wat?
[06:59:13] on the SRE?
[06:59:16] ah
[06:59:17] hahahah
[06:59:18] on ours
[06:59:41] ours only have the usual new entry when we do a failover
[07:02:47] I need to fix the backups before updating that, they failed but run fast without replication
[07:03:55] yeah, I didn't add any updates there really
[07:03:58] just the structure
[07:04:02] as it was empty
[07:04:16] title, key points tasks etc
[07:04:31] thanks for doing that
[07:04:34] it is very useful
[07:09:10] no problem!
[07:23:28] I am acking the other alarm for db2044
[07:23:36] (yes, don't ask me why there are 2)
[07:24:13] thanks
[07:24:23] and apparently dbstore1001 is low again on disk
[07:26:07] I don't think it can take snapshots of s1 at the time
[07:26:13] so I will change it
[07:41:11] question, marostegui: will private data check fire wrongly this monday again, or was it created already?
[07:41:24] no, it was created fine
[07:41:27] cool
[07:41:28] And I ran it to double check
[07:41:30] thanks
[07:41:31] And it was good
[07:42:06] https://phabricator.wikimedia.org/T212625#5074335
[07:43:11] thanks, I wasn't subscribed to that one
[07:43:59] no worries!
[08:56:07] Hi, i am trying to find the right Icinga URLs for Monitoring checks, what about this one:
[08:56:10] description => 'eventlogging_sync processes',
[08:56:30] it checks if eventlogging_sync.sh is running
[08:56:38] is that more https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging or more DBA
[08:56:54] because it is in the mariadb profile
[08:57:31] profile/manifests/mariadb/misc/eventlogging/replication.pp
[08:58:30] mutante: that is eventlogging indeed, not us as DBA
[08:59:06] ok, thanks !
[08:59:22] mutante: those database db1107 and db1108 are databases, but the eventlogging ones :)
[09:00:16] the other ones in this profile are 2x "haproxy_failover" a
[09:00:55] ok, yep. in this case i am only going by puppet modulees/profiles and had no idea about host names
[09:01:04] makes sense
[09:02:23] so if we had an alert about haproxy failing to load balance between replicas.. that would go to https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[09:02:39] it also mentions haproxy there i see. so that's cool
[09:03:25] mutante: not really, it doesn't explain how that works on misc (haproxy is only on misc now)
[09:04:09] We need a new section, point it there, but we need a new section to troubleshoot it
[09:05:12] ok, well. as long as the page is the right one we can use it and add sections later
[09:09:26] mutante: we also have https://wikitech.wikimedia.org/wiki/MariaDB#HAProxy which is incomplete
[09:10:34] marostegui: cool, that seems good. it doesnt necessarily have to have all the content yet, just want to get to the place where we can make it a required param for new checks
[11:59:50] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10jcrespo) With replication down, backup of s8 (with x1 concurrency, but it shouldn't matter in this case) took 4h45m: * 2h45m for the 1.5TB transference * 25s...
[12:12:37] Apr 5 12:06:03 cloudservices2002-dev mysqld[2134]: 2019-04-05 12:06:03 0 [Note] /opt/wmf-mariadb103/bin/mysqld (mysqld 10.3.8-MariaDB-log) starting as process 2134 ...
[12:12:37] Apr 5 12:06:03 cloudservices2002-dev mysqld[2134]: 2019-04-05 12:06:03 0 [ERROR] Can't find messagefile '/opt/wmf-mariadb10/share/errmsg.sys'
[12:12:37] Apr 5 12:06:03 cloudservices2002-dev mysqld[2134]: 2019-04-05 12:06:03 0 [ERROR] Aborting
[12:12:45] should I just create that file?
[12:13:42] no, you have missconfiguration there
[12:14:24] oh I see, 103 vs 10
[12:14:35] we don't even have 10.3 packages
[12:15:13] you should probably review your profile and how it uses the mariadb module
[12:15:47] or, alternatively, not use the mariadb module, which is intended for production configuration
[12:16:01] (and can be used for non production, but needs work)
[12:17:16] blame andrewb (jk) who I think was the person that set that up in the past :-)
[12:17:20] :-P
[12:17:44] I'm staring at modules/profile/manifests/openstack/base/pdns/auth/db.pp with little clues on what to do
[12:17:57] also affected because I'm rushing :-P
[12:18:13] one trick I highly recommend
[12:18:19] no matter what you do
[12:18:38] set up on hiera enable_notifications: 0
[12:18:48] so that after install, even if there are errors
[12:19:07] they don't go off (disabled notifications by default)
[12:19:17] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10Marostegui) I think it is pretty good compared with doing it without stopping replication. Probably once the sources have been migrated to the final HW will r...
[12:19:22] mysql servers are tricky to setup because they require provisioning
[12:19:46] and that is automated outside of puppet
[12:20:55] we always do the setup with either disabled_notifications:0 or install them as spares, which has the same effect (just a tip)
[13:05:15] 10DBA, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10Marostegui) I have blocked a window on Tuesday to tentatively get it deployed if no one objects...
[13:07:28] Reserved also a window for the failover
[15:49:43] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Papaul) a:05Papaul→03Marostegui complete
[16:21:33] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Marostegui) Thanks, it is rebuilding ` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 54% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 1I:1:2 (port...
[17:56:37] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Marostegui) 05Open→03Resolved Finished correctly, thanks! ` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Name: 1I P...
[18:12:30] 10DBA, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10aaron) >>! In T210725#5065110, @Marostegui wrote: > I would like to elaborate more on my idea on...
[18:14:48] 10DBA, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10Marostegui) >>! In T210725#5089115, @aaron wrote: >>>! In T210725#5065110, @Marostegui wrote: >>...