WebsiteSpark

Friday 27 January 2012

SQL Server 2008 Bulk Insert and database recovery model

The Bulk insert is one of the bulk load TSQL commands, used to insert loads of data into a table form a verity of source. Source can be a text file, a CSV file it can also be a OPENROWSET , XML files etc. The other bulk load commands are BCP and Insert Into.

This post describes the bulk insert activity under various recovery model of database. One of the three recovery models can be set for a database based on recoverability NFR, the models are
1)   Simple recovery
Ø  Minimal logging and recovery is up to last full backup of database
2)   Full recovery and
Ø  Logs each and every activity, recovery is till the last full backup + last differential backup + last transaction log backup, uses maximum amount of transaction log space.
3)   Bulk-logged recovery model.
Ø  As similar to full recovery model, but minimally logs the bulk insert action. For details http://msdn.microsoft.com/en-us/library/ms190692.aspx
To experiment, what we need is data and database.
Let’s create a test database for our testing purpose and the best place to get data would be the Microsoft sample database AdventureWorks2008R2. When you download the installation package (available here http://sqlserversamples.codeplex.com/ ) and extract it, you can find the scripts for creating the schema, bulk insert scripts and the data files.

We will select the DimCustomers table for our experiment.
Let’s start by creating a new database BulkInsertTest and set the recovery to simple mode.

USE master;
GO


IF EXISTS (SELECT name FROM master.dbo.sysdatabases WHERE name = N'TestBulkInsert')
BEGIN
      ALTER DATABASE TestBulkInsert SET OFFLINE WITH ROLLBACK IMMEDIATE ;
      ALTER DATABASE TestBulkInsert SET ONLINE ;
      DROP DATABASE TestBulkInsert ;
END;
GO

CREATE DATABASE TestBulkInsert ;
GO

Alter database TestBulkInsert set recovery SIMPLE;
GO


Now let’s create the table for bulk insert

USE TestBulkInsert;
GO

-- from AdventureWorksDW2008R2

CREATE TABLE [dbo].[DimCustomer] (
    [CustomerKey] [int] IDENTITY(1, 1) NOT NULL,
    [GeographyKey] [int] NULL,
    [CustomerAlternateKey] [nvarchar] (15) NOT NULL,
    [Title] [nvarchar] (8) NULL,
    [FirstName] [nvarchar] (50) NULL,
    [MiddleName] [nvarchar] (50) NULL,
    [LastName] [nvarchar] (50) NULL,
    [NameStyle] [bit] NULL,
    [BirthDate] [date] NULL,
    [MaritalStatus] [nchar] (1) NULL,
    [Suffix] [nvarchar] (10) NULL,
    [Gender] [nvarchar] (1) NULL,
    [EmailAddress] [nvarchar] (50) NULL,
    [YearlyIncome] [money] NULL,
    [TotalChildren] [tinyint] NULL,
    [NumberChildrenAtHome] [tinyint] NULL,
    [EnglishEducation] [nvarchar] (40) NULL,
    [SpanishEducation] [nvarchar] (40) NULL,
    [FrenchEducation] [nvarchar] (40) NULL,
    [EnglishOccupation] [nvarchar] (100) NULL,
    [SpanishOccupation] [nvarchar] (100) NULL,
    [FrenchOccupation] [nvarchar] (100) NULL,
    [HouseOwnerFlag] [nchar] (1) NULL,
    [NumberCarsOwned] [tinyint] NULL,
    [AddressLine1] [nvarchar] (120) NULL,
    [AddressLine2] [nvarchar] (120) NULL,
    [Phone] [nvarchar] (20) NULL,
    [DateFirstPurchase] [date] NULL,
    [CommuteDistance] [nvarchar] (15) NULL
) ON [PRIMARY];
GO

ALTER TABLE [dbo].[DimCustomer] WITH CHECK ADD
    CONSTRAINT [PK_DimCustomer_CustomerKey] PRIMARY KEY CLUSTERED
    (
        [CustomerKey]
    )  ON [PRIMARY];
GO
   
ALTER TABLE [dbo].[DimCustomer] ADD
    CONSTRAINT [IX_DimCustomer_CustomerAlternateKey] UNIQUE NONCLUSTERED
    (
        [CustomerAlternateKey]
    )  ON [PRIMARY];
GO

CREATE INDEX [IX_DimCustomer_GeographyKey] ON [dbo].[DimCustomer]([GeographyKey]) ON [PRIMARY];

GO


And now let’s view our log info

DBCC LogInfo
GO



Now we are ready for Bulk insert

BULK INSERT DimCustomer FROM N'D:\<Path>\AdventureWorks2008R2\AdventureWorks 2008R2 Data Warehouse\DimCustomer.csv'
WITH (
   CODEPAGE='ACP',
   DATAFILETYPE = 'widechar',
   FIELDTERMINATOR= '\t',
   ROWTERMINATOR = '\n' ,
   KEEPIDENTITY
)


(18484 row(s) affected)



Now again looking into our log file

DBCC LogInfo
GO

/*------------------------
DBCC LogInfo
GO
------------------------*/

(67 row(s) affected)
DBCC execution completed. If DBCC printed error messages, contact your system administrator.




We can see our log file getting increased to 22 MB from 2 MB initial size. All the entries having status as 2 are active. Issue a CHECKPOINT, then we can see these getting written to respective data file and respective log VLFs getting invalidated to status 0.
Now execute the bulk insert with “TABLOCK” Hint

BULK INSERT DimCustomer FROM N'D:\<Path>\AdventureWorks2008R2\AdventureWorks 2008R2 Data Warehouse\DimCustomer.csv'
WITH (
   CODEPAGE='ACP',
   DATAFILETYPE = 'widechar',
   FIELDTERMINATOR= '\t',
   ROWTERMINATOR = '\n' ,
   KEEPIDENTITY,
   TABLOCK
)

 
Now if we look at log file, we can see that the amount of VLFs generated is just 6 instead of 67.

DBCC LogInfo
GO


Now let’s drop the database, recreate with Full recovery, then again bulk insert and then see the effect on Log file.

USE master;
GO


IF EXISTS (SELECT name FROM master.dbo.sysdatabases WHERE name = N'TestBulkInsert')
BEGIN
      ALTER DATABASE TestBulkInsert SET OFFLINE WITH ROLLBACK IMMEDIATE ;
      ALTER DATABASE TestBulkInsert SET ONLINE ;
      DROP DATABASE TestBulkInsert ;
END;
GO

CREATE DATABASE TestBulkInsert ;
GO

Alter database TestBulkInsert set recovery SIMPLE;
GO

USE TestBulkInsert;
GO
--Rest of the script remains same


Again database has logged everything to log file.
Now let’s try the same thing with “TABLOCK” hint in BULK insert command

BULK INSERT DimCustomer FROM N'D:\<Path>\AdventureWorks2008R2\AdventureWorks 2008R2 Data Warehouse\DimCustomer.csv'
WITH (
   CODEPAGE='ACP',
   DATAFILETYPE = 'widechar',
   FIELDTERMINATOR= '\t',
   ROWTERMINATOR = '\n' ,
   KEEPIDENTITY,
   TABLOCK
)

Now the VLFs has reduced to 39
Now repeat the same with Bulk logged recovery model. Result is as similar to Simple recovery model. And here too the “TABLOCK” hint effects the manner in which logging is done by SQL server.

So in summary, more than selecting a recovery model, hints are of much more importance for the particular operation for better performance. 
Enjoy!

1 comment: