Thursday, November 19, 2020

MySQL replication

At one of my previous jobs, I was tasked with taking an existing master/slave MySQL DB setup, and upgrading it from v5.x to v8.x.

This also included some hardware upgrades, SSDs etc, and complete reinstalls of the OS.

I did that a year and a half ago, so I don't remember the specifics, but what I do remember, is that I did some research into the safest and most convenient way to do it.

This involved running up a couple of temporary machines, running v8, and connecting them as slaves (or backups, since master/slave is apparently outdated terminology..), and then being able to use these to pivot the other servers around.

For some confusing fun, I seem to recall that the hostnames of those (hardware) systems included "master" and "slave", however due to some previous issues, they had been swapped around, and the machine called master was actually the slave, and vice versa, and there was only one way replication going on.

So, I spun up a virtual machine, and installed v8 on it, and connected it as a slave, after first being able to lock and dump the existing slave in order to obtain a consistent dump. With that working, just for fun, and security, I added a 4th system, another VM.

So at this point, there were the original master and slave hardware systems, and my new 3rd VM system, which I configured to be a slave to both the original master and slaves. I then connected the 4th VM system to the 3rd.

This is when I found out about relay logging, vs just master/slave logging.. The 3rd machine had to be configured as a relay, or else the 4th system would get left behind. That got configured/enabled/tested, and in the process I seem to recall needing to reload the data into it, which was fine, because I was able to enact read/write locks on any of the DBs in order to get consistent dumps, other than the sole master/primary server, eg, the slave/hardware, or the 3rd/VM.

There were lots of other quirks and issues discovered in the process of connecting these new v8 temp machines to the primary v5 machine, as things have changed, such as user management, not being able to create users by granting privs to them etc, if they don't already exist, and some stuff to do with the handling of character sets, latin vs utf8 etc.

You can read up on the process and some of those traps:

So, with that all in place and working, I was then able to take the original slave server hardware offline, safe in the knowledge that we had 2 more slaves/backups to the primary running.

The hardware was upgraded, it may have involved swapping CPUs and memory into it from some other system(s), I can't exactly recall, along with the existing rust/spinner hard drives becoming secondary/additional space in the systems, with new SSDs being installed (probably a RAID 1, since that's what I do..)

The OS was reinstalled, and MySQL, v8, and then I was able to lock and dump all the data out of either the 3rd or 4th systems, load it into the new reinstalled system, now called "server01", as opposed to being a master or slave, as the intention was to implement cyclic replication, AKA master/master, and also to eventually (soon) promote this system to be the primary, so that the current master/primary could be taken offline, upgraded, wiped, reinstalled etc..

So the upgraded hardware system ("01") had all the data loaded, and was connected to the existing primary as a slave (probably along with either the 3rd of 4th system, so that it would catch up and stay up to date after the data was dumped and loaded).

I can't remember, but I think the current primary server was then connected to the new/upgraded system ("01") as a slave, so that it would stay in sync, for the brief period before it was phased out.

With that all working, all the client systems were reconfigured to connect via a DNS CNAME record, which was then pointed to server01, and restarted/reconnected.

This took the original master/primary server hardware out of the loop, at which point I was able to upgrade it, install CPUs/RAM/SSDs etc, reinstall the OS, and MySQL etc..

With it back up and running, now called "server02", it was loaded with all the data, by locking and dumping it from the 3rd or 4th systems, so as to not interfere with the production function of the DB.

It was then connected as a slave to whichever system (3rd or 4th) it had the data dumped from, and was able to catch up.

It was then connected to the new primary server ("01") as a slave, and in turn server01 was connected to the new server02 as a slave, for cyclic replication and redundancy.

Servers 3 and 4 were at some point disconnected and discarded, as they had served their purpose.

I left that job a short while after, solely because I couldn't handle the hours; needing to be at my desk for 8:30am every morning, which was seriously affecting my depression, and insomnia, and I was so tired all the time I couldn't think straight, couldn't function properly, was making stupid mistakes all the time. This was just a feedback loop into my depression.

It didn't make any sense either, as the majority of my co-workers that I worked with were in another office, in another timezone, 2 hours behind, so it wasn't like I could do much until they got into the office in the morning, and then we'd lose another 2 hours in the middle of the day because of the offset lunch breaks.

I would have been happy to have offset my entire day by 2 hours, to work in sync with the other office, but that wasn't an option, nor was working from home, which also would have helped me to be able to start work earlier in the day than what my brain prefers.

Speaking of lunch breaks, I quite often used to have to go out to my car on my lunch break and sleep in the back of it for 45 minutes or so, in order to be able to be able to get anything useful done during the day. I thought my head was just going to fall off.

I'd already received a written warning for my tardiness, the first written warning I have ever received in my career, and later, one Friday afternoon, I had been told that I was pretty much going to be fired right then, but that my manager had pleaded for them to keep me on.

I stayed around for another couple of months, but I knew that I was on incredibly thin ice, and that if I was late for any reason again, that I would probably just be fired and told to bugger off as soon as I arrived at the office.

Anyway, that's all mostly irrelevant to this story, and I don't want this to turn into one of those recipe pages where you have to read through 20 pages of someone's incredibly boring life story before you can get to the part where you find out how many cups of water you are supposed to use with a cup of rice (1 3/4), when you are trying to make fried rice.

So, getting back to the point.. I have been gone from this job for almost a year, and they get back in contact with me.

Someone was trying to do some DB upgrades, and the gist of it is that they made a mess of it.

They changed the DNS CNAME pointer (I presume), to point to the other DB, but because this isn't going to update immediately/everywhere, they ended up with a split brain scenario, and some clients writing in the primary DB still, and others moving and writing in the backup DB, which should normally be OK, and it's part of why cyclic replication is a thing, however..

In some circumstances, when there are triggers on tables, and/or the way primary keys are used etc, this can, and will, cause everything to break. Something will try to update in one DB, and sync to the other, where that has already been changed to something else, or the next primary key in an incremental count has been assigned to something else..

And then everything falls apart into a heap. Replication stops in both directions. Your tables/DBs are out of sync, and continue to get more and more out of sync.

So they contacted me, and I remoted into the 2 DB systems to work out what was going on, and how to fix it, and fix it.

Since the systems are running v8, and have been setup using GTIDs, it makes things a little more tricky to resolve, vs the old way, when you could just skip transactions and start the slave running again, skipping any of the conflicting transactions (although you still end up with data inconsistency, caused by these conflicting transactions which don't get resolved, just skipped/ignored).

The new way is similar, but it's not just a matter of telling it to skip 1 or more lines in the binlogs.

The new way, with GTIDs, is that you need to look at the slave status, and the received and executed sets of GTIDs, and then manually run dummy version of the GTID(s) which it is jamming on, because of the conflict.

So you disable the auto tracking of GTIDs, set it to point to the next GTID, run a dummy commit which doesn't do anything, except "use" that GTID, causing it to auto increment, and then you might be able to start the slave process again, and it will catch up and resume.

If you are lucky.

If you are not, then the conflicting statement will have some effect on the next statement/GTID, and it will conflict and fail.

So then you repeat the process, setting the GTID manually, dummy committing it, and trying to resume. And you do this again. And again, and again.

And eventually, you will hopefully have skipped all the affected transactions, and it will start replicating again, and catch up.

However, you now have inconsistency between the 2 tables/DBs.

Most people just ignore this and get on with their lives. Stuff is working again, so who cares?

If/when you flip the DBs over, and/or you are reading from both of them at the same time, you will now occasionally get weird results, depending on what you are querying, and if the inconsistent tuples are involved..

So how do you fix that?

You need to disconnect both systems from talking to each other, ie break the replication again, then you need to reset/wipe the settings related to the master/slave replication (I found the hard way).

Once that's done, you need a way to get a consistent dump of the data in the primary (or whichever system you deem to be the authoritative system), and load this into/over the top of the other one, to make them consistent, and then restart the replication, on both.

This seems pretty straight forward, but the issue occurs when you try to obtain that consistent dump, if you only have one system now, and it's your production system, as in order to obtain a consistent dump, it needs to be locked, so that no changes are occurring, while the data is dumped out of it.

I tried this several different ways; the old way, which is to reset the binlogs, flush and lock the tables, obtain the binlogging coordinates, and then start the dump, which should (does) work.

Once the dump is complete, you can unlock the tables and allow changes to start occurring again.

This is fine, except how do you handle this when your dump takes over 10 minutes?

I was reading that once you start the dump, you can apparently unlock the tables, and it will still work..

It should, because anything that happens after the locks are unlocked, should happen, and also get logged, so when the dump is loaded, and the log replayed, it should bring it up to sync, even if it tried to do transactions again which had already been done (though I guess that would be bad if they were incrementing statements or something).

That doesn't work. You end up with a few hundred rows missing, somehow.

Other "tricks" I found involved using --single-transaction, which is supposed to get a historical read lock on the tables at the point the dump starts, so it's basically reading them back in time to be in the state they were when the dump started, and then again, you just replay the log to catch up.. 

Doesn't work.

There's a new (to me at least) switch, of --master-data, which obtains the lock for you when the dump starts, and it writes the coordinates into the dump, and then unlocks when it's done.

This works.. except.. It's basically the original old method, of needing to flush and lock the tables and blocking changes for the entire time the dump is running. No good when you need to do this when people are trying to use the DB.

Also, the CHANGE MASTER TO statement that it includes in the dump, only includes the log file and position. So, that won't work, because the slave won't know who or where or what its master server is, nor the user/password to connect.

You may have already had these set, from before you stopped it.. but what I then found out the hard way, is that the GTID set(s) are dumped into the dump file, and when you try to load these on your backup server, it will conflict with the existing set(s) it has in the master/slave settings..

So these will conflict, and you can't set/adjust them manually (or via the load, which is manually)..

It will error out with something like "ERROR 3546 (HY000) at line 24: @@GLOBAL.GTID_PURGED cannot be changed: the added gtid set must not overlap with @@GLOBAL.GTID_EXECUTED"

You need to reset/wipe the master/slave settings, and re-enter them all.

And you can't just set a log file name and position when you don't have any other settings (eg server, user, password) set..

So, the correct method is probably to stop the slave on the backup (and primary), reset/wipe the settings (on both?), re-renter the hosts, users, passwords, and some dummy log settings.

Then you can load the dump, and it will update the log settings correctly, and then you can start the slave, and it will replay the logs and catch up the DB to what's currently running on the primary.

This is basically what I ended up doing, except I had to grep the CHANGE MASTER statement out of the dump in order to find the log settings, and set it manually, after resetting/wiping the master/slave settings, so that the load would succeed, and because I hadn't reset the host/user/password, the log file and position wouldn't set.

Anyway, with that all in place, I was able to start the slave, and it connected, replayed the logs, and was back up to date.

The next issue is that the primary is not a slave to the backup system, so there is only one way replication.

Thankfully at this point the servers contain the same data (or should!), so there's no need to dump it out and copy it over and load it back in to the other one..

It's just a matter of obtaining the log file name and position on the backup (maybe after locking the tables briefly?) and setting those coordinates on the primary server, and starting it, and then unlocking the backup, and allowing it to continue.

At that point, the master and slave status can be checked on both ends, and the received and executed GTID sets should match, and increment together.

Oh.. that's only if you have valid replication users on both ends..

For some reason, it looks like each DB had their own replication user created locally, which worked initially, but once the DB was dumped from the primary, and loaded over the secondary, it replaces the user table.. Since the replication user allowing the primary to connect to the secondary didn't exist in the primary (which it doesn't need to), it was then missing, and the primary was unable to connect to the secondary..

On the primary, I recreated the replication user for the secondary, which doesn't make much sense initially.. however this user will replicate to the secondary, and then exist in both places (along with the counterpart user for the opposite direction replication, which already existed on both).

Once that was done, and the fiddling of making sure that the FQDN was included in the username, as that's what each DB provides to the other, and the password(s) set to something known, and configured in the master/slave settings on each, they could both successfully connect and replicate to/from each other.

Some takeaways:

It's a good idea to have a 3rd, and/or 4th systems, perhaps only as slaves, connected to your primary/backup systems, in case you need to be able to block the tables for 10 minutes or more, while you get a consistent dump of the data.

Let's say we just have a 3rd system. it would also make good sense, even if it is only a slave to both other systems, and nothing else uses/reads from it, as it could be used to ascertain quorum in the case of writes being performed in the wrong DB, causing conflicts.

There are probably many other good reasons to have a 3rd system..

I realise there's not a lot of useful commands and copy/pastas in here, this is more of a theory than a practical article in the end..

Here's some useful links though..