Monthly Archives: November 2011

SQL Server Partitioning for large production tables

SQL has introduced from SQL 2005 versions, data partitioning. It is a new feature meant to improve the performance of SQL queries, data management, index management and storage management. The greatest benefit of Data partitioning is that you can have a highly available system while extracting, moving, removing information from tables with hundreds of millions of rows without creating blockages.

Due to a wrong estimation of number of rows or no estimation at all when creating a table, the Database Administrator can be confronted at some point with a big problem: the table has too many rows and indexing doesn’t help anymore to retrieve the data in reasonable time accepted by the whole IT structure. Querying very large tables start to cause big performance issues as a direct efect of the big I/O put each time to extract the data. You can be in the situation that old data can’t be even removed from that table without generating long time locks on the whole table.

Also, If you migrated from SQL 2000 system to SQL2005, it is time to consider partitioning for your biggest tables.

So let’s get into the facts and add partitions to a new table that will replace the un-partitioned one. Of course, I don’t suggest that this can be done easy and with no downtime for the Production environment, but with good planning ahead, it is possible and worth it. I had to deal with a table that had over 150 milllions of rows, measuring 60 GB in data+index size. The actual downtime was 1 hour. Because the physical storage of a table is set at create table statement, you can not add partitioning without re-creating the table. The downtime I was speaking about, refers to the duration of moving the data from the old un-partitioned table to the new partitioned table.

Pre-Requistites

For the demonstration purpose, the table in question will have the following structure:

1. Create the table that does not have any partitions

CREATE TABLE [dbo].[Table_without_partitions](
	[Id] [bigint] IDENTITY(1,1) NOT NULL,
	[Username] [varchar](255) NULL,
	[CreatedDate] [datetime] NOT NULL,
	 CONSTRAINT [PK_Table_without_partitions] PRIMARY KEY CLUSTERED 
([Id] ASC) 
)

2. Populate with values – Insert rows for 4 days

INSERT INTO [DB_Test].[dbo].[Table_without_partitions] ([Username],[CreatedDate])
VALUES      ('USER 1',GETDATE ())
GO 20 

INSERT INTO [DB_Test].[dbo].[Table_without_partitions] ([Username],[CreatedDate])
VALUES      ('USER 2',GETDATE ()+1)
GO 20 

INSERT INTO [DB_Test].[dbo].[Table_without_partitions] ([Username],[CreatedDate])
VALUES      ('USER 3',GETDATE ()+2)
GO 20

INSERT INTO [DB_Test].[dbo].[Table_without_partitions] ([Username],[CreatedDate])
VALUES      ('USER 4',GETDATE ()+3)
GO 20

You need to select a partitioning column that would serve as the limiting interval for a certain partition. This would be the most used column as “filter” in your queries and the most logical to separate various portions of data. In my scenario, [CreatedDate] column will be the partitioning column. Partitioning column can be also any other data type. For example price for a product can be a partitioning column. You will have in that case different partitions for products with the price from: 1 to 1000, 1000 to 2000, 2000 to 3000 etc.

Create the partition function and the partition schema

Method 1:

First I have to create the partition Function. It defines the method SQL Server will use on HOW to split the data inside the table.
In this method I will create a Partition Function that will contain all 4 boundaries, meaning 4 partitions for the 4 days inside our table.

a) Find out what will be the function boundaries:

 select distinct convert(varchar, CreatedDate, 112) from table_without_partitions

------------------------------
20111129
20111130
20111201
20111202

b) Create the Partition Function that will contain all partition boundaries:


CREATE PARTITION FUNCTION [Table_with_partitions_PartFunc](bigint) AS RANGE LEFT FOR VALUES 
(20111129,20111130,20111201,20111202)
GO

c) Create a partition scheme that specifies a ‘NEXT USED’ filegroup:

The partition schema is the second step in preparing the new table for partitioning because it will let SQL Server know WHERE you want to partition the data. You can select a different filegroup from the PRIMARY one. The big improvement will be in the SPEED of data access. To achieve this, you can add more filegroups to the database that will contain secondary data files (.ndf) located on different disks than the primary data file (.mdf) – possibly much faster. Once the filegroups are created these can be assigned to different partition boundaries, considering the most accessed partitions.
In my example though, I will use the default filegroup for all boundaries: PRIMARY.

You will notice that there are 5 filegroups specified instead of 4 – the number of partitions. This is because I specified the next file group that will be used by the next partition added.

CREATE PARTITION SCHEME [Table_with_partitions_PartScheme] 
AS PARTITION [Table_with_partitions_PartFunc] TO ([PRIMARY], [PRIMARY], [PRIMARY],[PRIMARY],[PRIMARY])
GO

ALTER PARTITION SCHEME [Table_with_partitions_PartScheme] 
              NEXT USED [PRIMARY]

Method 2:

Create the Partition Function for a single boundry – the lowest, and then alter the partition function in order to add the next needed boundaries, using an SQL script that will generate the partitions values based on your table values. I used this method while doing the partitioning in Production environment, as there where over 200 boundaries and it was not very fun to write down a list of boundaries for the partition function.

a) Find out the lowest boundry:

 

select min(convert(varchar, CreatedDate, 112)) from table_without_partitions

------------------------------
20111129
 

b) Create the partition function that will contain only the lowest boundry:

CREATE PARTITION FUNCTION [Table_with_partitions_PartFunc](bigint) AS RANGE LEFT FOR VALUES 
(20111129)
GO

c) Create the partiton schema that specifies a ‘NEXT USED’ filegroup:

CREATE PARTITION SCHEME [Table_with_partitions_PartScheme] 
AS PARTITION [Table_with_partitions_PartFunc] TO ([PRIMARY], [PRIMARY])
GO

ALTER PARTITION SCHEME [Table_with_partitions_PartScheme] 
              NEXT USED [PRIMARY]

d) Alter the Partition Function and add the next needed boundaries:

The following script will generate and execute the SQL statements for the next 3 partition boundaries.

Declare @StartDate datetime
Declare @EndDate datetime
Declare @Script nvarchar (4000)
DECLARE @PartValue BIGINT

Set @StartDate = (select min(CreatedDate)+1 from table_without_partitions)
Set @EndDate = (SELECT max(CreatedDate) from table_without_partitions)

WHILE @StartDate <= @EndDate 
Begin	
		SET @PartValue = Convert(bigint,Convert(varchar(8),@StartDate,112))				
		Set @Script = 'ALTER PARTITION FUNCTION [Table_with_partitions_PartFunc] () split RANGE ('+ Convert(VARCHAR,@PartValue) +')
		ALTER PARTITION SCHEME [Table_with_partitions_PartScheme]
					  NEXT USED [PRIMARY] '
		print @Script   		   
		exec sp_executesql @statement = @Script   
		Set @StartDate = @StartDate +1   
End

/*Results:
ALTER PARTITION FUNCTION [Table_with_partitions_PartFunc] () split RANGE (20111130)
ALTER PARTITION SCHEME [Table_with_partitions_PartScheme] NEXT USED [PRIMARY] 

ALTER PARTITION FUNCTION [Table_with_partitions_PartFunc] () split RANGE (20111201)
ALTER PARTITION SCHEME [Table_with_partitions_PartScheme] NEXT USED [PRIMARY] 

ALTER PARTITION FUNCTION [Table_with_partitions_PartFunc] () split RANGE (20111202)
ALTER PARTITION SCHEME [Table_with_partitions_PartScheme] NEXT USED [PRIMARY] 
*/

In both methods you will end up having one partition function for all 4 boundry values and one partition schema created.

Partition a table

1. Create the new table that will replace the old one without partitions

The new table will be created ON the new partition schema, which will point to the partition function.
The table will have a new BIGINT computed column named [Partition_column] - based on the existing [CreatedDate].

CREATE TABLE [dbo].[Table_with_partitions](
	[Id] [bigint] IDENTITY(1,1) NOT NULL,
	[Username] [varchar](255) NULL,
	[CreatedDate] [datetime] NOT NULL,
	[Partition_column]  AS (CONVERT([bigint],CONVERT([varchar],[CreatedDate],(112)),0)) PERSISTED,
 CONSTRAINT [PK_Table_with_partitions] PRIMARY KEY CLUSTERED  ([Id] ASC, [Partition_column] ASC)
 ) ON Table_with_partitions_PartScheme ([Partition_column])

Go to table properties and check that the table is partitioned:

Partitioned table

Partitioned table

Migrate rows from the un-partitioned table to the partitioned table

1. Check the distribution of rows per each partition from the [Table_with_partitions]

SELECT partition_number, sum(rows) FROM sys.partitions
WHERE OBJECT_ID = OBJECT_ID('Table_with_partitions') group by partition_number order by partition_number

partition_number SUM_ROWS
---------------- --------------------
1                0
2                0
3                0
4                0
5                0

2. Insert all the rows from [Table_without_partitions] to [Table_with_partitions]

This can be done by various methods:

  • SQL Server Export/Import Wizard - you may need to check the "Enable Identity Insert" option from "Edit Mappings" window.
  • SQL statement: Insert into [...] Select from [...]
    
    SET IDENTITY_INSERT [Table_with_partitions] On
    GO
    insert into [Table_with_partitions] ([Id],[Username],[CreatedDate])
    select [Id], [Username], [CreatedDate]
    from [Table_without_partitions] 
    SET IDENTITY_INSERT [Table_with_partitions] Off
    
  • BCP utility to export and BULK INSERT to import. I noticed that BULK INSERT is faster than BCP utility in import operations.
    EXEC master..xp_cmdshell 'BCP "SELECT [Id], [Username], [CreatedDate], ''1111'' as [PARTITION_COLUMN] FROM [DATABASE].[dbo].[Table_without_partitions]" queryout D:\BCP_export_file.TXT  -w -t\t -r\n  -SSERVERNAME -T'
    
    BULK INSERT [DATABASE].[dbo].Table_with_partitions FROM 'D:\BCP_export_file.txt' WITH (KEEPIDENTITY, FIELDTERMINATOR = '\t', ROWTERMINATOR ='\n') 
    
    

You are free to select the one that suits you best considering the amount of rows needed to be migrated. From my experience first two methods are efficient while handling up to 5-10 millions of rows. When we are speaking about tens or hundreds of millions I strongly suggest you use the BCP method. It is the fastest and the safest for your SQL Server because it doesn't put any locks.

3. Check again the distribution of rows per each partition from the [Table_with_partitions]

SELECT partition_number, sum(rows) FROM sys.partitions
WHERE OBJECT_ID = OBJECT_ID('Table_with_partitions') group by partition_number order by partition_number

partition_number SUM_ROWS
---------------- --------------------
1                20
2                20
3                20
4                20
5                0

Now, all rows have been distributed per each created partition and the 5'th one is empty waiting to be populated with the next partition data.

Deploy in Production

1. Add the next partition

You should create a SQL Job that will run daily and add the partition for the next day, so that the rows inserted for 2011/12/03 will be assigned to the right partition. Of course you can create more partitions ahead and schedule the job to run weekly or monthly.

ALTER PARTITION FUNCTION [Table_with_partitions_PartFunc] () split RANGE (20111203)
ALTER PARTITION SCHEME [Table_with_partitions_PartScheme]  NEXT USED [PRIMARY]

2. Testing

Insert 10 rows for the next day into the new partitioned table [Table_with_partitions].

INSERT INTO [DB_Test].[dbo].[Table_with_partitions] ([Username],[CreatedDate])
VALUES      ('USER 5',GETDATE ()+4)
GO 10

3. Checking

The new inserted 10 rows are assigned to the 5'th partition.

SELECT partition_number, sum(rows) FROM sys.partitions
WHERE OBJECT_ID = OBJECT_ID('Table_with_partitions') group by partition_number order by partition_number

partition_number sum_rows
---------------- --------------------
1                20
2                20
3                20
4                20
5                10
6                0

That's everything you have to do in order to get rid off the old un-partitioned table.

[UPDATE....]

A small update is in place here, because I missed one detail about creating clustered indexes on partitioned tables.

Create Indexes on partitioned table

Creating Primary Key/Unique Key on partitioned table

When speaking about clustered indexes, I am referring to either the Primary keys or the Unique keys. For a partitioned table the basic behavior is different from a non-partitoned table. Unlike the non-partitioned table that can have a primary key composed of one column that will uniquely identify a certain row in the table, the partitioned table Primary Key must contain the partition column as well which would be a subset of the index key. The default behavior of clustered indexes built on partitioned tables is to align the index with the partitioned table on the same scheme.

That is the reason the Primary Key from my example above is composed of [Id] and [Partition_column]. It ensures that the Index will also be partitioned.

Whereas this is available for Clustered indexes, Non-clustered indexes can be PARTITIONED or NON-PARTITIONED indexes. You define this by specifying the ON clause when creating the index. If you don't specify anything, by default it will be created as partitioned index on the partitioning scheme. To create an non-partitioned index you need to specify the ON PRIMARY clause instead.

I found a very good article written by Michelle Ufford, and she explains very nice the performance impact of having both partitioned and non-partitioned indexes.

Creating Foreign Key referencing a partitioned table

Since the Primary Key on a partitioned table contains also the partionining column, when you try to create a foreign key that will reference a column from the partitioned table that is part of the Primary Key (other than partitoning column) you will receive the below error:

There are no primary or candidate keys in the referenced table ' ' that match the referencing column list in the foreign key ' '.

To overcome this issue, you will need first to create a Unique Non-Clustered Index on the referenced column from the partitioned table. A FOREIGN KEY constraint does not have to be linked only to a PRIMARY KEY constraint in another table; it can also be defined to reference the columns of a UNIQUE constraint in another table.