Parallelize and load – Basics of splitting a table

If you are loading data from one SQL Server to another, you have options:

    1. Bcp-out the file and then Bcp-in
    2. Use SSIS to load data directly.

If the table is large you can split up the file creation or increase the data flow tasks in SSIS to load the table faster. This is a very basic way to parallelize a load.  Here is a simple trick to split up the table and it works well if the table in question has at least one somewhat distinct integer type column.

Example:

Table DDL –

CREATE TABLE [dbo].[Tablename](

[DateKey] [int] NOT NULL,

[AccountId] [int] NULL,

[OrderItemId] [bigint] NOT NULL,

[AdId] [int] NOT NULL,

[More columns]….

)

GO

Using a mod function on the distinct integer type column to split the table in 5 pieces-

Select * from [DatabaseName]..[Tablename] where OrderItemId%5 = 0

Select * from [DatabaseName]..[Tablename] where OrderItemId%5 = 1

Select * from [DatabaseName]..[Tablename] where OrderItemId%5 = 2

Select * from [DatabaseName]..[Tablename] where OrderItemId%5 = 3

Select * from [DatabaseName]..[Tablename] where OrderItemId%5 = 4

This way of splitting up a table can be used in a Bcp-out call or in SSIS tasks. You can use any divisor number for the mod function.  Just remember to use a 0 based number on the other side of the equal sign.

As an example if you want to divide your table into 10 pieces –

Select * from [DatabaseName]..[Tablename] where OrderItemId%10 = 0

Select * from [DatabaseName]..[Tablename] where OrderItemId%10 = 9

Following script can help in creating Bcp-out scripts.

DECLARE @servername sysname, @dbname sysname, @tablename sysname, @outputdir sysname
        ,@splitbyCol varchar(300), @numsplits int, @totalsplits int
SELECT  @servername = @@SERVERNAME
       ,@dbname = DB_NAME()
       ,@outputdir = ‘c:\temp\’
       ,@tablename = ‘customer’
       ,@splitbyCol = ‘customerID’
       ,@numsplits = 0
       ,@totalsplits = 5

WHILE (@numsplits < @totalsplits)
BEGIN
SELECT ‘bcp "Select * from ‘ +@dbname+’.’+OBJECT_SCHEMA_NAME(object_id)+’.’+ name + ‘ Where ‘+@splitbyCol+ ‘%’+Cast(@totalsplits as varchar(5))+ ‘ = ‘+Cast(@numsplits as varchar(5)) +’" queryout ‘
       + @outputdir + name + ‘.dat.’+Cast(@numsplits as varchar(5))+’ -b 10000 -T -c -S ‘ + @servername
       FROM sys.objects
       WHERE type_desc = ‘USER_TABLE’
       AND name = @tablename

       SET @numsplits = @numsplits+1
END

Some blogsites that I follow