DIY database replication
Sometimes you need to have the same data in two different databases. A common scenario (among many) is that one database is used for production, and the other is used for reporting, and we don't want that all that reporting to put any load on the production database.
Most enterprise databases offer some solution to do just that. But what if the two databases are from different vendors? Or what if your database vendor's solution does not do exactly what you need?
In that case, there are many (usually quite expensive) solutions from various vendors, and they're not trivial to master.
In this tutorial, we're going to see that there is a much easier way for simple scenarios, but with quite a few caveats.
We'll be using SQL Server as the main database, and MySQL for the replica database, but of course any other database products would work just as well, as long as the main database is supported by Gallium Data.
This tutorial takes about 10 minutes.
Note: even though Gallium Data fully supports the ARM chipset, this tutorial does not work on Mac M1/M2 because the SQL Server Docker image does not (yet?) support platforms other than x86/AMD.
The setup
Our database client is going to connect to SQL Server through Gallium Data.
Gallium Data will intercept any requests that change the database (INSERT, UPDATE or DELETE), and will also send them to MySQL.
And, well, that should be it, actually. How hard can it be?
Start up the three servers
Everything has been set up in Gallium Data, so we're ready to go.
⇨ Run this from a command line:
docker network create gallium-net
If it gives you an error message about the network already existing, that's fine -- carry on.
⇨ Run this from a command line:
docker run -d --rm --name gallium-mssql --network gallium-net galliumdata/gallium-data-demo-mssql:4
docker run -d --rm --name gallium-mysql --network gallium-net --cap-add=sys_nice -e MYSQL_ROOT_HOST='172.%.%.%' -p 3806:3306 galliumdata/gallium-data-demo-mysql:8
docker run -d --rm --name gallium-data --network gallium-net -p 8089:8080 -p 3706:3306 -e repository_location=/galliumdata/repo_replication galliumdata/gallium-data-engine:1.8.3-1851
We now have three Docker containers running:
SQL Server: the main database
MySQL: the replica database
Gallium Data: the database proxy that does the replication
Insert some data into the main database
Let's open a SQL Server command line:
⇨ Run this from a command line:
docker exec -it gallium-mssql /opt/mssql-tools/bin/sqlcmd -S gallium-data -d GalliumDemoDB -U sa -P Password1
We are now connected to SQL Server through Gallium Data. Let's insert one row into the customers table:
⇨ Run this from the SQL Server command line:
INSERT INTO gallium_demo.customers VALUES (100, 'Alfonso', 'Alcazar', 'AR');
Important: when using the SQL Server command line, you need to type go and hit enter to execute a command.
Now let's make sure the new row is present in SQL Server:
⇨ Run this from the SQL Server command line (don't forget "go"):
SELECT * FROM gallium_demo.customers WHERE id=100;
Good, the row is there. But was it replicated in MySQL?
Look in the replica (MySQL)
Let's open a MySQL command line:
⇨ Run the following from a different command line:
docker exec -it gallium-mysql /usr/bin/mysql -h localhost -D gallium_demo -u root -pPassword1
Let's see if our new row has also been added to this database:
⇨ Run the following from the MySQL command line:
SELECT * FROM customers WHERE id >= 100;
And sure enough, the new row is there too! Gallium Data has executed the insert statement in both databases. Let's do a few more updates in SQL Server to convince ourselves, and check in MySQL that those changes are reflected there.
⇨ Run these from the SQL Server command line (don't forget "go"):
INSERT INTO gallium_demo.customers VALUES
(101, 'Bernhard', 'Blanca', 'BE'),
(102, 'Christie', 'Chen', 'CH');
DELETE FROM gallium_demo.customers
WHERE id = 100;
UPDATE gallium_demo.customers
SET last_name = 'Carravagio'
WHERE id = 102;
Now use the MySQL command line to verify that these changes in SQL Server have been reflected in MySQL.
⇨ Run the following from the MySQL command line:
SELECT * FROM customers WHERE id >= 100;
All the changes made in the SQL Server database are reflected in the MySQL database. Replication achieved!
Now let's see how it works.
The proxy (Gallium Data)
Open Gallium Data at: http://127.0.0.1:8089
There are three things that have been set up in Gallium Data:
a connection to SQL Server in the Simple replication demo project
a library (the JDBC driver for MySQL) under Libraries
a filter that does the replication, in the Simple replication demo project
The first two are fairly mundane, so let's take a look at the interesting part: the filter . You will find it in project Simple replication demo, in the Request filters area -- it's called Mirror updates.
It is a Query filter, with its Query Patterns parameter to:
regex:insert.*
regex:update.*
regex:delete.*
This means that, every time a simple request comes in from a SQL Server client, and the SQL in that request matches one of these regular expressions, the code in this filter will be executed.
Take a look at the code in the filter: it makes sure that we have a connection to MySQL open, and then just executes the incoming SQL against the MySQL database.
Obviously the two databases have been set up to have equivalent schemas.
That's all it takes! And for some scenarios, that might indeed be enough. But now come the caveats.
A long list of caveats
The setup we have does work, but there are lots of potential issues that it does not address.
Different SQL syntaxes
If the SQL sent to SQL Server uses SQL Server-specific syntax, we have to correct it in the filter. This could get difficult if there are lots of different types of SQL commands being sent to SQL Server, especially if some of the commands are complex and cannot realistically be translated to the replica's syntax (like commands with common table expressions). So we're stuck with simple SQL commands.
Generated values
What do we do if one of the values being inserted is a function like getdate()? There is an equivalent called curdate() in MySQL, but we may not always get exactly the same value, for instance if the main database and the replica database use different time zones, or if their clocks are off.
This gets even worse if the schema includes some sequences or identity columns. There is no easy way for us to know what value was actually inserted in the main database, so replicating that would be difficult.
Prepared statements
This example only handles straight SQL updates. It could be extended to support prepared statements, but it would require some work.
Stored procedures and triggers
If the client invokes a stored procedure, or does something takes causes a trigger to be executed, there is no realistic way for us to know what changes have been made to the database, so this type of replication is essentially impossible here.
What if the update is rolled back, or fails?
We can easily imagine a scenario in which a number of updates are executed (and replicated), but then the client rolls back the transaction, or the update fails. We've already sent the commands to the replica. We could set up the filter so that it does not commit to the replica until there is a commit to the main database, but that makes replication more complex, and is only possible if the replica database supports transactions.
What if the replica fails?
If the replica database becomes unresponsive, we're in a pickle. We have three options:
stop accepting update commands - the system becomes read-only until the replica database is back up
save all the commands to a file and execute them when the replica database is back up
ignore the problem
Ignoring the problem may actually be a viable solution in some cases. There are many databases that contain non-critical data, and that can afford to skip a few records.
What if the two databases get out of sync?
It's easy to imagine something happening that causes the data in the two databases to diverge. Such a divergence can easily cause increasing disparities between the databases, since some updates and deletes may not make sense in the replica.
The lowdown
The list of caveats looks scary, for good reasons, and it's not even complete. This is why database replication is hard.
But there is a class of scenarios in which all of these problems can be avoided or ignored. They typically include simple, predictable SQL commands, which are unlikely to fail, and data which does not have to be perfectly in sync between the main database and the replica.
For that kind of scenario, using a simple setup like the one shown in this tutorial may be good enough. Not everything has to be industrial-strength.
Cleanup
Once you're done with this tutorial, and you want to remove everything that was installed,
⇨ Execute the following commands from a command line:
docker stop gallium-mysql
docker stop gallium-mssql
docker stop gallium-data
docker network rm gallium-net
This will stop all the Docker containers started during this tutorial.
If you also want to remove the Docker images:
docker rmi galliumdata/gallium-data-demo-mssql:4
docker rmi galliumdata/gallium-data-demo-mysql:8
docker rmi galliumdata/gallium-data-engine:1.8.3-1851
This will remove everything installed by this tutorial.
We'd love to hear from you -- good or bad! Please drop us an email and let us know if you have any questions or comments.