I work for Microsoft Consulting Services in the UK. My primary focus is enterprise applications using C# and SQL Server.
However I also have an interest in other programming paradigms, hence the interest in F#.
A summary of my previous publications can be found here.
A list of posted code can be found here.

Implementing LOB Storage in Memory Optimized Tables

Memory optimized tables do not have off-row or large object (LOB) storage, and the row size is limited to 8060 bytes. Thus, storing large binary or character string values can be done in one of two ways:

This document will provide a mechanism for implementing the split of LOB values into multiple rows, and outline several mechanisms for retrieving the data.

Associated with this post is a set of scripts covering all the TSQL code presented, along with some simple testing scripts, which can be downloaded from here.

Approach Overview

The approach to be taken will allow for the processing of LOB data such that when the data length is greater than a predefined upper limit, segment size, it is broken down into smaller chunks.

If the data fits into a single segment then the process will default to operating on just a primary table with a single row for the key identifier. In this instance reading the data means just returning the data solely from the primary table.

If the data is too large for a single segment then it will be broken down into multiple segments and each part saved as a row with an associated part number; linked to the primary element via the key identifier. In this instance, reading the data then becomes a determination of how many segments the data is stored in, and either concatenating them into a single field or return the individual parts through a SELECT statement.

The procedures to support this approach would be as follows:

·Insert a LOB column when the data length is less than a predefined segment size

·Insert a LOB column when the data length is greater than the predefined segment size and a (max) column definition is needed

·Provide Upsert operations that allow an entry to be made after an insert has previously been made; or to insert data if the current state is not known

·Provide a SELECT process that returns the collection of LOB segments; taking into consideration the fact the data may not be segmented

·Read a LOB column when the size is less than a predefined segment size

·Read a LOB column when the size is greater than the predefined segment size, returning the data in a (max) datatype definition

When implementing your own solution consideration should be given to the predefined segment size. The goal should be such that this size is as small as possible but big enough that a large percentage of the applications data can possibly fit into a single segment. For small transient data, such as session data and small encrypted data elements, adjusting this value will impact the overall performance.

The other added advantage of having the data fit into a single segment is that the data handling Stored Procedures can all be natively compiled. In this case the application would need to know the size of the data it is processing, which may or may not be possible.

Of course, if you are saving MBs of data this optimization will not be feasible so the size should probably be set to 8000 bytes.

However, the primary approach outlined may not be suitable for storing very large quantities of LOB data which are accessed infrequently; say document or images. In this instance a standard file based table approach may be more suitable. However for highly transient data such as session data, encrypted authentication codes, etc. using this approach can dramatically speed up access.

Implementation

For this implementation I have created a new database, called [MemBinaryStore]. In reality the object definitions below would be placed into your own application database:

One must remember Memory Optimized tables have a restriction on the code page for (var)char columns. These columns must use code page 1252, hence the reason the collation has been specified.

Table Objects

This implementation uses two base tables. A Primary table that holds the base binary information and also all the necessary data when the LOB data can fit into a single segment. An Extended table that holds any necessary LOB data segments; multiple records existing for each primary entry when the data is split into segments.

CREATETABLE [dbo].[BinaryItem]

(

[BinaryItemId]binary(16)NOTNULL

CONSTRAINT [PK_BinaryItem] PRIMARYKEYNONCLUSTEREDHASH

WITH (BUCKET_COUNT = 1000000),

[ItemCreated]datetimeNOTNULL,

[ItemAccessed]datetimeNOTNULL,

[BinaryData]varbinary(6000)NULL,

[BinaryExParts]intNOTNULL

)

WITH (MEMORY_OPTIMIZED =ON, DURABILITY = SCHEMA_AND_DATA);

GO

CREATETABLE [dbo].[BinaryItemEx]

(

[BinaryItemId]binary(16)NOTNULL

INDEX IDX_BINARYITEMEX_ID HASH

WITH (BUCKET_COUNT = 1000000),

[BinaryPart]intNOTNULL,

[BinaryDataEx]varbinary(6000)NOTNULL,

CONSTRAINT [PK_BinaryItemEx] PRIMARYKEYNONCLUSTEREDHASH

([BinaryItemId], [BinaryPart])

WITH (BUCKET_COUNT = 20000000)

)

WITH (MEMORY_OPTIMIZED =ON, DURABILITY = SCHEMA_AND_DATA);

GO

When defining your tables care should be taken in defining the correct bucket count. As a very rough guideline, the bucket count should be set to 1-2X the maximum expected cardinality of the table.

In this instance I have chosen a key type of binary(16), rather than a varchar() or uniqueidentifier datatype. This is mostly to avoid specifying a binary collation that is needed for the primary key and index specifications; such as “nvarchar(34) collate Latin1_General_100_BIN2”.

This collation requirement is necessary because indexes on (n)(var)char columns can only be specified with BIN2 collations.

Write Procedures

The process for inserting or updating LOB items is dependent on the length of the LOB data. Calling application should determine what procedure to call based on the size of the data being passed in.

If the data is less than the selected segment size then it is just a simple INSERT operation, which can be performed using a natively compiled Stored Procedure:

You can of course define the primary key to be whatever datatype is applicable for your application.

In the instance that the data length is larger than the defined segment size you have to pass in the data using a (max) datatype definition. In this case the procedure will have to chunk the data down and INSERT it into the extensions table:

In this case the binary header element is still written but the binary block is NULL. Also the part count, [BinaryExParts], is defined as the number of segments the data has been split into.

Breaking down the LOB column is merely a simple case of using the SUBSTRING, which can be used on binary data, to define each segment.

For the INSERT/UPDATE, upsert operations, the process follows the same pattern. In the case of the data being less than the selected segment size an UPDATE operation is first performed. If no entry is found then an INSERT operation is performed:

As in the case of the insert operations, it is only for when the data is less than the selected segment size that the Stored Procedures can be natively compiled. As such, ensuring that a large portion of the data fits into this selected segment size will give the best performance gains.

Access Procedures

The most efficient way to access the binary data is just to return binary segments, through a SELECT statement, and have the application layer combine them back into a single element. The process to do this has to look at the primary element for the specified number of elements and return the corresponding data:

CREATEPROCEDURE [dbo].[GetBinaryDataParts]

(

@BinaryItemIduniqueidentifier

)

WITH NATIVE_COMPILATION,SCHEMABINDING,EXECUTEASOWNER

AS

BEGIN ATOMIC WITH

(

TRANSACTIONISOLATIONLEVEL=SNAPSHOT,

LANGUAGE=N'English'

)

DECLARE @now datetime=GETUTCDATE();

DECLARE @binaryId binary(16)=CAST(@BinaryItemId ASbinary(16));

DECLARE @binaryData varbinary(6000);

DECLARE @binaryParts int= 0;

-- Get the binary data and see if parts exists

UPDATE [dbo].[BinaryItem]

SET[ItemAccessed] = @now,

@binaryData = [BinaryData],

@binaryParts = [BinaryExParts]

WHERE [BinaryItemId] = @binaryId;

-- If parts exists then return these otherwise just return the single element

SELECT @BinaryItemId AS [BinaryItemId], 0 AS [BinaryPart], @binaryData AS [BinaryData];

END

END

The beauty of this approach is that the called Stored Procedure can be natively compiled.

Also, the application calling this Stored Procedure then has the option of better handling the necessary memory allocations for concatenating the LOB segments into a single element.

The approach I have taken here is to update the primary element with the last accessed time. This would allow for elements to be removed based on a defined life expectancy. If this is not needed one could just perform a SELECT of the primary data, rather than an update.

If a SELECT process is not optimal as in the case that the application requires the data to be concatenated one can use a Stored Procedure with an OUTPUT field.

In the advent that the application knows that data size is less than the specified segment size there is the option of directly calling a natively compiled Stored Procedure:

CREATEPROCEDURE[dbo].[GetBinaryData]

(

@BinaryItemIduniqueidentifier,

@BinaryDatavarbinary(6000) OUTPUT

)

WITHNATIVE_COMPILATION, SCHEMABINDING, EXECUTEASOWNER

AS

BEGINATOMICWITH

(

TRANSACTIONISOLATIONLEVEL = SNAPSHOT,

LANGUAGE = N'English'

)

DECLARE@nowdatetime = GETUTCDATE();

DECLARE@binaryIdbinary(16) = CAST(@BinaryItemIdASbinary(16));

-- Get the binary data for output

UPDATE[dbo].[BinaryItem]

SET[ItemAccessed] = @now,

@BinaryData = [BinaryData]

WHERE[BinaryItemId] = @binaryId;

END

In the case that the data size is unknown, or it is larger than the specified segment size, the Stored Procedure has to perform the concatenation into a max datatype, and thus cannot be natively compiled:

CREATEPROCEDURE [dbo].[GetBinaryDataEx]

(

@BinaryItemIduniqueidentifier,

@BinaryDatavarbinary(max)OUTPUT

)

AS

BEGIN

DECLARE @now datetime=GETUTCDATE();

DECLARE @binaryId binary(16)=CAST(@BinaryItemId ASbinary(16));

DECLARE @maxSegmentLength int= 6000;

DECLARE @blobSegment varbinary(6000);

DECLARE @binaryParts int;

DECLARE @count int= 0;

SET @BinaryData =NULL;

-- Get the binary primary and see if parts exists

UPDATE [dbo].[BinaryItem] WITH (SNAPSHOT)

SET[ItemAccessed] = @now,

@BinaryData = [BinaryData],

@binaryParts = [BinaryExParts]

WHERE [BinaryItemId] = @binaryId;

-- If parts exist then build the binary element

IF (@BinaryData ISNULLAND @binaryParts > 0)

BEGIN

WHILE (@count <= @binaryParts)

BEGIN

SELECT @blobSegment = [BinaryDataEx]

FROM [dbo].[BinaryItemEx] WITH (SNAPSHOT)

WHERE [BinaryItemId] = @binaryId AND [BinaryPart] = @count;

IF (@BinaryData ISNULL)

SET @BinaryData =CAST(@blobSegment ASvarbinary(max));

ELSE

SET @BinaryData = @BinaryData +CAST(@blobSegment ASvarbinary(max));

SET @count += 1;

END

END

END

In this case the LOB concatenation is performing simply using the plus operator. Of course this concatenation is only necessary if extended binary segments are found.

From the performance perspective it is probably better to take the original SELECT approach and have the application layer handle the necessary data concatenation; as it can better handle the memory allocations necessary to perform this operation.

Alternative Considerations

Error Checking and Retries

The Stored Procedures outlined above do not take into consideration error checking and retry logic. The approach to either can be that it is the responsibility of the calling application or of the actual Stored Procedures.

The associated scripts included with this post also include a version of the Stored Procedures that perform both error checking and retry logic. As an example here is the [UpsertBinaryItem] Stored Procedure:

This code covers the common errors associated with using an optimistic concurrency control mechanism, as used by the In Memory OLTP engine.

In this case I have set the retry count to be 3, but this can easily be configured to suit your application and environment requirements.

Durability

In the samples above I have defined the tables to have a durability option of SCHEMA_AND_DATA. In the case of transient session data one could again improve performance using just the SCHEMA_ONLY option.

File Based Extension

In addition to taking the approach outlined above there is the object that if the data size is larger than the specified segment size a normal file based table can be used for saving the LOB data. In this case the table definitions would be more like the following:

CREATETABLE [dbo].[BinaryItem]

(

[BinaryItemId]binary(16) NOTNULL

CONSTRAINT [PK_BinaryItem] PRIMARYKEYNONCLUSTEREDHASH

WITH (BUCKET_COUNT = 1000000),

[ItemCreated]datetimeNOTNULL,

[BinaryData]varbinary(6000)NULL,

[IsExtended]intNOTNULL

)

WITH (MEMORY_OPTIMIZED =ON, DURABILITY = SCHEMA_AND_DATA);

GO

CREATETABLE [dbo].[BinaryItemEx]

(

[BinaryItemId]binary(16)NOTNULL,

[TotalLength]intNOTNULL,

[BinaryDataEx]varbinary(max)NOTNULL,

CONSTRAINT [PK_BinaryItemEx] PRIMARYKEYCLUSTERED

([BinaryItemId])

)

ON [DEFAULT] TEXTIMAGE_ON [MemBinaryStore_Binary];

In this case the number of segments is not required, but rather just an indicator of whether extended data exists.

One could even take the approach of defining the database such that the LOB data is saved into a dedicated performant filegroup: