суббота, 19 марта 2016 г.

Integrating R and SQL Server prior to 2016 version

Introduction

It was announced about year ago that upcoming SQL Server 2016 version would have integration with R. Now with the latest CTP candidates released by Microsoft it becomes more clear how this solution will look like.

I believe that it may be really great feature, especially for the people who extensively used both R and SQL. But what if one is not able to upgrade to 2016 easily and in the same time is eager to use ability to call R scripts from SQL Sever?

I found out that it is surprisingly easy to implement such integration by using standard means.

In this article I would like to demonstrate such proof-of-concept implementation of R integration with SQL Server versions prior to 2016.

My idea is to implement SQL Server CLR stored procedure which will use R.NET CodePlex in-process interoperability bridge to execute arbitrary R scripts from T-SQL.

The suggested stored procedure interface looks as following:

EXEC dbo.RunRScript @Script = N'
 # put your R code here
'
This interface looks similar to the new stored procedure sp_execute_external_script in SQL Server 2016 CTP, but can be deployed on any prior version which supports CLR integration (2012, 2008, etc).

Implementation

I created CLR stored procedure by using database project in Visual Studio by using the following instruction.

I loaded the latest version of R.NET CodePlex library and added references to DLL's to the project.

Then I wrote the following C# code for CLR stored procedure:

[Microsoft.SqlServer.Server.SqlProcedure]
public static void RunRScript(SqlString script)
{
    lock (_lock)
    {
        // initialize R engine
        if (_engine == null)
        {
            SqlContext.Pipe.Send("Starting R engine");
            REngine.SetEnvironmentVariables();
            _console = new SqlConsole();
            _engine = REngine.GetInstance(null, true, null, _console);
            SqlContext.Pipe.Send("R engine has been started");
        }

        // Execute R script
        SqlContext.Pipe.Send("Starting R script execution");
        SymbolicExpression res = _engine.Evaluate(script.ToString());

        // Parse and return results if any
        if (res != null && res.IsDataFrame())
        {
            DataFrame df = res.AsDataFrame();

            SqlMetaData[] cols = new SqlMetaData[df.ColumnCount];
            for (int i = 0; i < df.ColumnCount; i++)
            {
                cols[i] = new SqlMetaData(df.ColumnNames[i], SqlDbType.Variant);
            }

            SqlDataRecord rec = new SqlDataRecord(cols);

            SqlContext.Pipe.SendResultsStart(rec);

            var rows = df.GetRows();

            foreach (DataFrameRow row in rows)
            {
                for (int i = 0; i < df.ColumnCount; i++)
                {
                    rec.SetValue(i, row[i]);
                }
                SqlContext.Pipe.SendResultsRow(rec);
            }

            SqlContext.Pipe.SendResultsEnd();
        }

        SqlContext.Pipe.Send("R script execution completed");
    }
}

This stored procedure consists of three parts:

1. Initialization of R engine. I immediate made significant simplification here - there is only one R engine for all connetions which is implemented as a singletone. In case of multiple user calls there will be only one call executed at any time. Still I belive that if required it should be possible to initialize multiple R engines per SPID or use some kind of engines pool.

2. Execution of R script. Actually this is the simplies part - R script passed as an input parameter is executed by using Evaluate() function from R.NET library.

3. Return of results. If R script results are of data frame type, then they are returned to SQL Server in form of record set. Again, this code can be improved to recognize matrices and vectors (atom) values as well.

Also I am using ICharacterDevice interface to capture R script console output and redirect it to SQL messages log.

    private class SqlConsole : ICharacterDevice
    {
        public SymbolicExpression AddHistory(Language call, SymbolicExpression operation, Pairlist args, REnvironment environment) { return null; }

        public YesNoCancel Ask(string question) { return YesNoCancel.Cancel; }

        public void Busy(BusyType which) { }

        public void Callback() { }

        public string ChooseFile(bool create) { return null; }

        public void CleanUp(StartupSaveAction saveAction, int status, bool runLast) { }

        public void ClearErrorConsole() { }

        public void EditFile(string file) { }

        public void FlushConsole() { }

        public SymbolicExpression LoadHistory(Language call, SymbolicExpression operation, Pairlist args, REnvironment environment) { return null; }

        public string ReadConsole(string prompt, int capacity, bool history) { return null; }

        public void ResetConsole() { }

        public SymbolicExpression SaveHistory(Language call, SymbolicExpression operation, Pairlist args, REnvironment environment) { return null; }

        public bool ShowFiles(string[] files, string[] headers, string title, bool delete, string pager) { return false; }

        public void ShowMessage(string message) { }

        public void Suicide(string message) { }

        public void WriteConsole(string output, int length, ConsoleOutputType outputType)
        {
            SqlContext.Pipe.Send(output);
        }
    }

Deployment

First of all you will need to enable CLR execution on your server. This can be done by using the following T-SQL code:

sp_configure 'show advanced options', 1;
GO
RECONFIGURE;
GO
sp_configure 'clr enabled', 1;
GO
RECONFIGURE;
GO

Then you will need to decide to which database you are going to deploy assembly with stored procedure and mark it as trustworty. This is required that this assembly will need to be registered with UNSAFE option.

In my case I will create assembly in master database:

ALTER DATABASE master SET TRUSTWORTHY ON
GO

Next you will need to load R.NET and SP assemblies:

CREATE ASSEMBLY RDotNetNativeLibrary
FROM 'c:\..\RDotNet.NativeLibrary.dll' WITH PERMISSION_SET = UNSAFE;
GO
CREATE ASSEMBLY RDotNet
FROM 'c:\..\RDotNet.dll' WITH PERMISSION_SET = UNSAFE;
GO

CREATE ASSEMBLY RSqlServerIntegration
FROM 'c:\..\RSqlServerIntegration.dll' WITH PERMISSION_SET = UNSAFE;
GO

CREATE PROCEDURE dbo.RunRScript (@Script nvarchar(MAX))
AS EXTERNAL NAME
RSqlServerIntegration.StoredProcedures.RunRScript;
GO

Note: you will need to specify correct path to assembly DLL files on your disk.

The assemblies need to be registered with UNSAFE flag, since R.NET assembly is not verifiable.

Usage

Once steps below are done, you will be use store procedure dbo.RunRScript as follows.

Just executes R statements and prints outputs in SQL message log:

EXEC dbo.RunRScript @Script = N'
 cat(1 + 2)
 cat("Hello There!")
'
Performs operations on two matrices and return results as record set:

EXEC dbo.RunRScript @Script = N'
 m1 = matrix(rnorm(10), 2, 5)
 m2 = matrix(rnorm(10), 2, 5)
 as.data.frame(m1 + m2)
'

Run linear regression on bigger vector:

EXEC dbo.RunRScript @Script = N'
 x = rnorm(1e5, 1, 0.01)
 x = rnorm(1e5)
 y = 5 * x + rnorm(1e5, 0, 0.00001)
 r = lm(y ~ x)
 print(r)
'

Loads RODBC package, run SQL query to extract data, and returns record set back to SQL:

EXEC dbo.RunRScript @Script = N'
 library(RODBC);
 
 conn = odbcDriverConnect(''driver={SQL Server};server=..;database=master;trusted_connection=true'')

 sqlQuery(conn, "
  select top 3
   object_id,
   name
  from
  sys.tables
 ")
'

Note: there is an immediate drawback in the code above. In regular CLR stored procedure I would be able to establish so called context connection to SQL Server. It seems that this is not possible to do by using ODBC driver. It is hard to say what would be impact on the query performance, but it seems that the same drawback would be present in CTP release.

Drawbacks

Obviously the solution above has multiple drawbacks which restricts its usage in production environments:
  • It is not safe, anyone who has rights to execute this stored procedure will be able to easily crash the server or get unauthorized access to data.
  • It requires admin rights to deploy on SQL Server.
  • It is not optimized for parallel usage from multiple connections.
  • It may be unstable when working with big data sets.
But at the same time this solution is proof-of-concepts which demonstrates how R and SQL Server can be integrated together and it can be used relatively easily & safely for say local SQL Server Express instances.

Комментариев нет:

Отправить комментарий