WebsiteSpark

Wednesday, 25 January 2012

Replacing unwanted characters using regular expression and SQL CLR

Let’s say you want to compare two NVARCHAR strings from two different source say source1 and source2. Now the problem is source1 has got special characters other than alphabets and digits to it while source2 application filters those characters. Now if we compare these two source data, our comparison is bound to return a false i.e unequal value, but for end user these two strings are same.
Solution to this problem is Source1 and Source2 both needs to filter off these characters and feed into database, else database needs to handle the issue. Changing the source1 application would be a tedious task, involving multiple rounds of testing and regression, so better we shall take care in database (that is the objective of this post as well).
What all we need for this to work?
1)   SQL CLR assembly with regular expression
2)   A test application for testing the assembly
3)   Finally our database to host the assembly.
Let’s start by building the assembly, by creating a new class library project in VS2008 and adding the code given below

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

    public class RegEx
    {
        [Microsoft.SqlServer.Server.SqlFunction
            (IsDeterministic = true, IsPrecise = true)]
        public static SqlString ReplaceIfMatch
            (SqlString ip, SqlString exp, SqlString replace)
        {
            // some error handling
            if (exp.IsNull || ip.IsNull || replace.IsNull)
                return  SqlString.Null ;
            Regex r = new Regex(exp.ToString());
           //bool match = r.Match(ip).Success;
            return new SqlString( r.Replace(ip.ToString(),
                replace.ToString()));
        }

        [Microsoft.SqlServer.Server.SqlFunction
            (IsDeterministic = true, IsPrecise = true)]
        public static SqlBoolean IsMatch
            (SqlString ip, SqlString exp)
        {
            // some error handling
            if (exp.IsNull || ip.IsNull )
                return SqlBoolean.False;
            Regex r = new Regex(exp.ToString());
            return new SqlBoolean( r.Match(ip.ToString()).Success);
        }

        // repeating for normal windows
        public static string ReplaceIfMatchN
            (string ip, string exp, string replace)
        {
            // some error handling
            if (string.IsNullOrEmpty(exp)
                || string.IsNullOrEmpty(ip)
                || string.IsNullOrEmpty(replace))
                return string.Empty;
            Regex r = new Regex(exp);
            return r.Replace(ip, replace);
        }
    }



Here we have 3 functions, 2 defined with SQL data types and one for windows or normal application. The functions marked for SQL server will only be executed as SQL CLR. For detailed list of mapping between CLR data types and SQL server data types can be found in BOL at http://msdn.microsoft.com/en-us/library/ms131092.aspx.
Now compile the assembly in release mode and we are ready to get this deployed in SQL server. To do this, we have to first ensure CLR assemblies are enabled for our database. As usual we will be using AdventureWorks2008R2 database for our experiment.
Following SQL statements configures database for CLR

USE AdventureWorks2008R2;
GO

EXEC SP_CONFIGURE 'clr enabled',1;
GO

RECONFIGURE;
GO


Next step is to create assembly within the database

CREATE ASSEMBLY
      RegExBase
FROM
      'D:\<PATH>\RegExBase.dll'
WITH PERMISSION_SET = SAFE ;
GO


Next is to create function, which in turn invokes this assembly.

CREATE FUNCTION
      RegExReplace
            (@ip NVARCHAR(4000),@exp NVARCHAR(4000), @replace NVARCHAR(4000) )
RETURNS NVARCHAR(4000)
AS EXTERNAL
      NAME RegExBase.RegEx.ReplaceIfMatch ;
GO


Note that the data type specified for the function is NVARCHAR(4000), which corresponds to SQLString in CLR data type.

Now we can invoke the function and see the result.

SELECT
      dbo.RegExReplace(N'abc%^&123+{}|¢£µƒá',N'[^a-zA-Z0-9]',N'') ;

Result
-------------------------------------------------------------------------------------------------
abc123

(1 row(s) affected)


Note the regular expression supplied is “[^a-zA-Z0-9]”, which states any character other than a to z, A to Z and 0 to 9 and for demonstration there are some non ASCII Unicode characters added to input string as well.

Reference to regular expressions can be found at

To drop the CLR assembly, first drop the function which references that and then drop the assembly.


DROP FUNCTION RegExReplace;
GO

DROP ASSEMBLY RegExBase ;
GO




Enjoy!!!

No comments:

Post a Comment