How do linked servers work?
What are linked servers?
For quite some time now, Microsoft SQL Server has offered the helpful concept of linked servers. This is a powerful feature that allows you to stitch multiple databases together (not just SQL Servers) and run queries that span multiple database servers, thus creating a distributed database.
To create a linked server, you run the sp_addlinkedserver stored procedure, for instance:
@server = N'MyLinkedServer',
@srvproduct = N'',
@provider = N'SQLNCLI',
@datasrc = N'192.168.1.69,1432', -- IP address and port number of the destination server
@catalog = N'AcctDb'; -- database name
and you now have a connection to the specified SQL Server instance. There are several different types of linked servers, so there is quite a bit more to it than this, but you get the gist.
This new linked server will then appear in e.g. SSMS:
We'll skip over the question of authentication: there are many options, you can map one account to another, etc...
Once that's all in place, you can run queries that include tables both on the server itself and on linked servers, e.g.:
SELECT cust.name, acct.balance
FROM custdb.customers cust, MyLinkedServer.AcctDb.dbo.Accounts acct
WHERE acct.custid = cust.id
And it all behaves exactly as you'd expect.
The interesting thing is what happens between these two databases when you run this query. I had to look into this because one of our Gallium Data users complained that his queries to linked servers were getting mysteriously rewritten -- and they were, but not by mistake.
What's going on down there?
Looking at the network traffic between the two database servers, we see that the main SQL Server instance does a lot more than just forward that query to the linked server. It should be obvious that this would not be possible anyway with the example query above.
Let's look at the exchange between the main server and the linked server. The full exchange is not necessarily run for every request, but if this is the first request to that linked server, the flow is as follows.
First, the main SQL Server instance opens a connection to the linked server and authenticates. Depending on how authentication has been set up, there can be several exchanges of various keys and tokens, but ultimately, a connection is opened to the linked server.
After that, the main server will ask a few questions of the linked server:
what is my session ID?
what kind of collation are you using?
what is the schema information for the tables involved in the query? This includes column definitions, indices and constraints
The main server then calls the following:
sp_reset_connection to make sure the connection is clean
SET XACT_ABORT OFF to avoid rolling back the transaction if an error is detected
The main server then sends a transaction manager request TM_BEGIN_XACT to open a distributed transaction. Now we're cooking with gas! But we're not done yet.
The next call is to sp_table_statistics2_rowset, an undocumented stored procedure that (obviously) returns some statistics about the tables participating in the query.
We then have a call to SHOW STATISTICS, again to get some more information about the tables we're about to access.
Obviously, at this point, the main server is preparing a query plan, so it needs to know the lay of the land in the linked server.
The next call is a transaction manager request to abort the transaction: TM_ROLLBACK_XACT.
Then there is another call to sp_reset_connection and SET XACT_ABORT OFF
We then start another distributed transaction with TM_BEGIN_XACT
We now have a call to sp_schemalock, which is an internal stored procedure that makes sure that the schema does not change while the query is prepared and executed.
Are you still with me?
And now, finally, the main server executes our query. Except, of course, it can't be the query as submitted by the user, because that query spans multiple servers. The main server figures out which part(s) of the query need to be executed on the linked server, and formulates it in a way that usually looks something like:
SELECT "Tbl1002"."custid" "Col1001","Tbl1002"."balance" "Col1005"
FROM "AcctDb"."dbo"."Accounts" "Tbl1002"
WHERE "Tbl1002"."custid" = @P1
Obviously, your mileage will vary a great deal here. The main server is going to formulate a query that it believes is most appropriate based on the nature of the query, the statistics for the various tables involved in the query, and probably other factors. It may be a prepared statement, or it may not -- there is really no way to know in advance, because the query analyzer has a will of its own (and it usually does an excellent job).
Depending on the query, the main server may need to execute this statement many times to retrieve all the relevant data. It will join all that data as required with any other data as specified in the original query, and a result set will be assembled and returned to the client. This is all completely transparent to the client.
The formulation of the query happens even if you run a simple query that only involves the linked server. For instance:
SELECT [id], [custid], [balance]
gets sent to the linked server as:
SELECT "Tbl1002"."id" "Col1004","Tbl1002"."custid" "Col1005","Tbl1002"."balance" "Col1006"
FROM "AcctDb"."dbo"."Accounts" "Tbl1002"
This explains why our Gallium Data user was confused -- his queries, when received by the linked server, did not look like what he was executing on the main server, but that's just the way SQL Server works.
Anyway, once the query has executed, the main server closes the prepared statement(s) (assuming it used any).
After the query has completed, everything gets cleaned up:
the schema lock is released with a call to sp_releaseschemalock
the distributed transaction is closed with a call to TM_ROLLBACK_XACT (if this was just a select - obviously a statement that modifies data would usually invoke TM_COMMIT_TX)
the connection is reset with a call to sp_reset_connection
And the query is over. If you're feeling a little dizzy, I don't blame you.
The connection is usually kept open for a while, in case another request comes in that involves data in that same linked server, but it will get closed after a period of inactivity.
That's a lot of stuff!
Distributed transactions are notoriously hard to do correctly. I hope this peek behind the curtain has given you a better appreciation for how much work SQL Server does for you. This is not easy, but the folks at Microsoft have done an excellent job of making it transparent to their users. If the linked servers are full-fledged enterprise databases (Oracle, DB2, PostgreSQL, etc...), you can run distributed transactions with little difficulty.
We've only looked at two SQL Server instances working together, but linked servers can also be much more simple-minded: there are providers for CSV files and spreadsheet files, for some NoSQL databases, and so on. With these providers, distributed transactions are usually not an option, since these data sources tend not to be transactional, but they still allow you to run distributed queries easily.
It's not a bad idea to have at least a sense of how this is all done, because in your linked servers, you may see some activity that you did not expect. It's not unheard of, for instance, to get a schema lock that does not get properly released, leading to potential trouble. You may also see some queries that no one has specifically asked for -- they may be caused by another server sending queries to your server as a linked server.