Journal

In publishing, and I’m sure many other industries, we get data at both the daily level and the weekly level. To properly tie these two pieces of data you need to aggregate the daily data and adjust the date to align with the weekly date. Our weekly data is set to the Saturday of the week.

As an example all daily data acquired between 10/21/2018 and 10/27/2018 must have a WeekEndingDate of 10/27/2018. This can be done with a case statement, but a there is an inline formula that will come in handy when doing projections (YTD, previous 6 months, etc). I’ll present the formula and then show modifications to handle variations.

select dateadd(dd,7-datepart(dw,'10/21/2018'),'10/21/2018')

The above formula determines how many days away from 7 the current date is and simply adds that many days. We use 7, because Saturday is the 7th day of the week. I’ll break it down.

datepart(dw,'10/21/2018')

The above snippet returns a value of 2, because it is a Monday. 7 if it is a Saturday. You subtract 7 from that to determine how many days you have to move forward. In the example, you need to move 5 days from Monday to make it to Saturday. This gives you the number of days to add. Reducing the example:

select dateadd(dd,5,'10/21/2018')

This comes in very handy when you are looking to move backward in time a certain number of days, weeks, months, or years but align to a Saturday. An example is below, which goes back 1 month.

Here is a simple method of finding the median value using SQL. The first will be a simple example that finds the median number of units sold for a given book (ISBN) over a period of time. There will be 3 sections of code (that will be combined using CTEs). First we need to find the median number, that is for a given book, how many times was it sold. The sales_datatable contains3 columns: ISBN, SaleDate, UnitsSold

select ISBN,
ceiling(count(*)/2.0) as median
from sales_data
group by ISBN

The use of ceiling in the above equation accounts for the fact that if we have only 1 sale for an ISBN, then 1/2=0, whereas ceiling(1/2.0)=1. The need for the median to start at 1 and not 0 will become evident in the text below.

select ISBN,
Units_Sold,
ROW_NUMBER() OVER(Partition by ISBN Order by Units_Sold) as r_num
from sales_data

Since we want to know the median of UnitsSold we must Order by UnitsSold. Using the rownumber() function we get a set of numbers, starting at 1 (here is the need for ceiling above) and continuing until we get a new ISBN. Once we get a new ISBN, the numbering restarts at 1. Let’s say that our test data looks like this:

ISBN Units_Sold r_num

9780000000001 10 1

9780000000001 13 2

9780000000001 20 3

Our median will be ceiling(3/2.0)=2. That means the median value for Units_Sold for ISBN=9780000000001 is 13. The full code is below. Please note I create this as a view so there is no need to be a ; before the first with.

Create view MedianUnitsSold
as
with base
as
(
select ISBN,
ceiling(count(*)/2.0) as median
from sales_data
group by ISBN
),
population
as
(
select ISBN,
Units_Sold,
ROW_NUMBER() OVER(Partition by ISBN Order by Units_Sold) as r_num
from sales_data
)
select p.ISBN,
p.Units_Sold
from population p
join base b on (b.ISBN=p.ISBN
and b.median=p.r_num)

As always, I hope this post has been able to help you solve a problem that you’ve had or provided you with some new insight. Feel free to post comments below.

I came across a situation where I need unique numbers for both a large and small set of data. The standard approach is to create a table with an identity column and do your inserts. For a table with tens of thousands of records, this might be the best approach, but there is another one: ROW_NUMBER().

The key to using the ROW_NUMBER() function is to provide a constant so you will get a contiguous sequence. Let's look at sample data.

Format AB AL CD CE GG LI

Let's look at the function:

ROW_NUMBER() OVER (Partition by <group> order by <columns>)

I can first construct the function as follows

select Format, ROW_NUMBER() OVER(PARTITION BY Format ORDER BY FORMAT) as id

However this will give us the following results:

id Format 1 AB 1 AL 1 CD 1 CE 1 GG 1 LI

The reason is your PARTITION BY creates a unique number for each format. To solve this problem you need to put a value that will never change over the life of the select. I use 1.

select Format, ROW_NUMBER() OVER(PARTITION BY 1 ORDER BY FORMAT) as id

After a rather long course change (I was doing little work with SQL Server and work with other RDBMS systems), I'm back working on Microsoft SQL Server and a few other Microsoft products. Here is an interesting problem presented to me a few days ago.

A colleague of mine need timed data grouped by 15 minute intervals. I found this to be an interesting problem and didn't see much online, so thought I'd post this quick tidbit for others to find. The technique can be used to group by any interval with a little tweaking.

The field in question is a datetime field. The code needed to group data in 15 min intervals is this:

datepart(mi,C001_TimeEnd)/15)*15

Let's break it down. I take the minutes of the datetime field and divide it by 15, which is my interval. That will give me a number between 0 and 3, because it is an integer/integer. I then multiple that by 15. The distribution is as follows:

(00-14)/15 = 0*15=0

(15-29)/15= 1*15 = 15

(30-44)/15=2*15=30

(45-59)=3*15=45

Datepart(mi,"2018-5-9 11:48:33")/15) = 48/15 = 3*15 = 45. Now add in the the date and hours.

I decided to revisit my framework to see if I could improve the performance and possibly increase the number of scenarios this framework could be used in. Turns out I was able to do both. In this post I will outline the changes I've made and go through those new scenarios.

Version 3 of the framework relies less on the components provided by SSIS and more on actual C# coding. The new framework is rooted in Script Task, .NET 3.5, System.IO, and the System.Thread libraries.

The core of the framework is parallel distribution of work. To do this we will enlist the help of C# threads. In simple terms a thread is a way to assign work to one of the processors your program is to run on. SSIS does this behind the scenes by assigning threads to various components in your package. If you have ever created 2 or more components and executed the package on a system with multiple processors you would see many of those components turn yellow and be active at the same time. The number of components active simultaneously would depend upon a few SSIS settings and the number of logical processors on your machine. We are going to bypass those settings and limitations by creating our own pool of threads. This new design was put to the test by IMPLAN.

Implan

They have a need to import data contained in Microsoft Access databases. To further complicate this process, each database contains 162 different tables.

One year's worth of data amounts to 3,757 Microsoft Access databases, each containing 162 tables for a total of 3.1 billion records. At the time I was designing this, they had a backlog of 12 years of data, with the caveat that any year could be reloaded at any time.

Current Process

When ingesting even a single year the process took between 72 and 96 hours to complete, but failed often.

ETF solution:

SQL Server and SSIS were running on separate machines, but with identical configurations.

16GB memory

64-bit Windows operating system

Intel Xenon CPU E5-2690 v2 @ 3GHz (6 processors)

Results

The custom solution can run 1 year's worth of data (3.1 billion records) in approximately 11 hours; averaging 79,000 records/second. The process runs to completion without error and can be rerun whenever needed.

To note: The 79,000 records/second limitation seems to be the maximum number of records the SQL Server configuration could handle. The framework itself is limited by only three factors:

Maximum amount of data the destination server can ingest.

Maximum rate the source data can be read.

Maximum number of threads the framework can run on the SSIS machine.

If you think this framework would be of benefit to your company, please feel free to send me an email.