Thursday, December 31, 2015

The most friendly database tool, Management Studio was released with SQL Server 2005 and it has become the right-hand for both database administrators and developers. SQL Server Management Studio is an integrated environment for doing all types of operations and management related to SQL Server and it is a rich tool equipped with many functionalities and an editor. Although we have been using it for long time, Query Shortcuts are still unknown to many. Here is a post on it, and you will surely find the usefulness of them.

Query Shortcuts allows you to configure key combinations for executing your common commands. By default, three shortcuts are already configured and can be used immediately. Seeing configured shortcuts and new shortcut configuration can be done with Options interface which can be opened with Tools menu.

As you see, three shortcuts are already configured. If you press Ctrl+1 in Query Editor, sp_who procedure will be automatically executed and you will see the result of it.

In addition to that, we can configure our own short cuts. Here are some example;

"SELECT * FROM " is configured with Ctrl+5 and "EXEC" is configured with Ctrl+6. This allows us to highlight either a table name or stored procedure and get it either queried or executed using assigned shortcut. For example, if Person.Person is selected and Ctrl+5 is pressed, the statement SELECT * FROM Person.Person will executed.

Make sure you do not assign shortcuts for all operations. Operations like delete should not be configured because you can accidentally delete records once configured.

There are many ways of collecting statistics of SQL Server such as CPU usage, IO made, reads and writes. One facility given by SQL Server is, an Extended Stored Procedure, sp_monitor. This procedure can be used for seeing statistics related to resource use as SQL Server keeps these information using system statistical functions. This procedure shows values since SQL Server was restarted and values since last run sp_monitor was run.

For more details: https://msdn.microsoft.com/en-us/library/ms188912.aspx

Here is the result of the procedure;

This has been mainly given to DBAs to have a quick look. Because of that, collecting them regularly for future analysis is bit difficult. Format is not much user-friendly and returns multiple resultsets, hence calling it using TSQL and saving it in a table is not much easy. However, it can be called and save the result using simple c#.net code. Let's make a table for collecting them first.

If you continuously run this code or schedule for every one hour, you have enough of data to see how busy is your SQL Server and whether it needs more resources or not. Here is a simple chart created using SSRS, based on the saved data.

We either author or consume reports which makes us a part of some Reporting Scenario. Regardless of the activities users perform, users can be fallen into common role based on their involvement. Here are some of the roles noticed and identified, this will be useful specially on documentation and making proposals.

Database/Application Developer

A professional software engineer who works mainly on databases and building application. This person creates reports as a part of application development and will work more on formal/tabular reports.

BI Developer

A professional BI developer creates special reports such as analytical reports, interactive reports and dashboard reports connecting with data marts and data warehouses. BI developer works more on data warehousing and BI than general OLTP database related solutions.

IT Professional

This role plays a technical role that always involves with infrastructure such as servers and network. This role includes job type such as database administrator, IT administrator and they rarely work with reports. However, they might involve with creating reports for their work and requirement received from other users, specifically business users.

Power User

Power user is not considered as a technical professional but a business user who have thorough knowledge on the domain and business. This role is played by professionals like business analyst or account who do not have much knowledge on development but uses special tools for creating their own reports with the knowledge they have for analysis they want to perform.

Information WorkerJust like Power User, Information Worker a business user who involves only with consumption. This person uses data available with reports for doing various analysis mainly for decision-making and hold no knowledge on technical implementations of them.

There can be many other roles and classifications in terms of report development and consumption. Any missing common ones?

Wednesday, December 30, 2015

This is based on a question I was asked today on concurrent connections. For running some procedures and making sure that no one can access the server while they are running but few, how to instruct SQL Server to allow only n number of connection?

Simply, if you want to limit your instance to limited connection, it can be configured with a property in Connection section;

By default, it is 0, means unlimited but limited to connections allowed based on version and edition. If you want to make sure that only 3 concurrent connections can be made, set this as 3 and restart the service. Same can achieve using sp_configure stored procedure too.

Tuesday, December 29, 2015

While I was writing some Hive queries for getting words from a lengthy string, I came across a useful extended stored procedure that allows us to split values in a string considering space as the separator. It has some limitations but it can be useful in some scenario.

Monday, December 28, 2015

My first post related to GUID discussed the main issue we face when using GUID as the key, specifically as the Clustered Key. Here is another concern we should consider when maintaining a table with GUID as the key. This issue does need to be considered as a critical issue if the table contains low number of records. But if it contains millions of records, this is something we need to consider for sure.

Let's create two tables just as the first post. The below code creates two tables named dbo.Sales_WithGUID and dbo.Sales_WithoutGUID and loads same data set for both tables.

As you see, first table which is based on GUID consumes 10.7MB for data and 80KB for indexes. Second table which is based on int+char(2) as the key consumes 10.0MB for data and 56KB for indexes. Now, let's additional non-clustered indexes for both tables.

Focus on the space used for indexes. First table uses 11MB while second table uses only 7MB. Note that this is only for 60,000 records but if the table has more records, gap will be significantly huge.

How can it be happened when we have same number of records and same types of indexes. This is because of every non-clustered key value is maintained with the clustered key. When it comes to the first table, since it is 16 bytes, the size of the non-clustered key value will be 16+4. But with the second table, it is 6+4 bytes. This is the reason for high usage of the spave for non-clustered indexes.

Here is the code that shows the number of data pages used by each index. Large number of pages increases IO operations and usage of resources.

Here are two ways of deleting records, both work fine, however first method might give better performance over a large dataset.

-- Here is the way of deleting duplicates
WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ProductCode ORDER BY ProductCode) AS RowNumber
FROM dbo.ProductCode)
DELETE FROM cte
WHERE RowNumber > 1 ;
-- Checking records
SELECT * FROM dbo.ProductCode
ORDER BY ProductCode;

-- Here is another way of deleting them
SET ROWCOUNT 1
WHILE 1=1
BEGIN
DELETE
FROM dbo.ProductCode
WHERE ProductCode IN (SELECT ProductCode
FROM dbo.ProductCode
GROUP BY ProductCode
HAVING COUNT(*) > 1)
IF @@ROWCOUNT = 0
BREAK;
END
SET ROWCOUNT 0
-- Checking records
SELECT * FROM dbo.ProductCode
ORDER BY ProductCode;

Saturday, December 26, 2015

While analyzing a database related to one of the companies, I noticed that the key of most tables is based on GUID (uniqueidentifier) that drastically hinders the performance unless it is managed properly. Unfortunately, as per my experience in consultancy, 90% of databases that use GUID as the key for tables are not properly managed and many are unaware of the damages it does though they experience poor performance on data manipulation and retrieval.

Most reason out that they need GUID as the key because of a need of an uniqueness among records specifically when records are joined from multiple companies or regions. But with many cases, it can be handled without using Unqiueidentifier but using a composite key that is formed using an integer value and a code that represents the company or region. Understanding the cost of it and understanding how it slows down your queries will surely guide you to take the right decision, hence here is a post on it.

Let's try to understand the most common issue related to GUID as the key. Following code creates a table called Sales_WithGUID and loads data from AdventureWorks. The key of the table, which is SalesKey is a GUID. Note that I do not load records as a batch, but load one by one simulating the real operation.

Let's create another table called Sales_WithoutGUID and loads the same. Note that its primary key is a combination of int and char column. This is what I explained above and suggested to the company: using two columns as a composite primary key rather than using uniqueidentifier. Using this composite key, company can be identified and records can be uniquely identified.

Now let's analyze and see. The below code has two SELECT statements that check the fragmentation of both tables. As you see, first table is Externally Fragmented and second table is not fragmented. This is the biggest and most common issue with GUIDs. Since there is no sequential values when generating, when inserting records, in order to maintain the clustered key order, SQL Server splits data pages for moving records to new page, causing External Fragmentation (Read more on fragmentation at: http://dinesql.blogspot.com/2013/08/understanding-index-fragmentation-ss.html). Because of this, reading causes accessing larger number of data pages, slowing down the query and consuming resource.

Other issue is the space required. As you see, second table requires only 10MB for maintaining records while first table requires 15MB.

There is another considerable impact on non-clustered indexes if you, will make another post on it.

As you see the disadvantages with GUID now, try to minimize the usage of it if possible. There is a workaround if the order of the key is only the issue. The function NEWSEQUENTIALID can be used instead of NEWID for generating GUIDs but still the second issue discussed is remained unsorted.

Friday, December 25, 2015

We have been working with databases and SQL for long time but some of the important theories and concepts are purposely ignored assuming that they are not crucial and can complete the implementation without any issue. But understanding them and applying to implementations make our solutions perfect and less error prone because operations implemented flow on correct way producing the right result. Do you know that terms like SET Theory, Predicate Logic are not much known among engineers including database engineers?

Let's try to understand Predicate Logic in simple form. It is a mathematical basis for the relational database model and as per the theory, it is an expression or property that is either true or false. It is sometime considered as a Boolean Expression too.

Although the theory describes two possible outcomes (true or false), unknown values are not uncommon in database world, hence it needs to be considered. This extends the outcomes of Predicate Logic to three: True, False or Unknown.

How can we apply this to a database? Definitely for comparison, simply as an expression that allows us to evaluate it to true, false or unknown. Remember, unknown is null in database model.

Predicates do not stand alone when implementing with SQL Server databases. It is always a part of the query and following are the possible roles it can play;

Thursday, December 24, 2015

If you are working with Azure SQL Databases, you are well aware that Azure SQL Server needs your IP address for allowing you to access the Azure SQL database. Although it is considered as a one-time-setting, it is not if the IP address of your machine gets frequently changed or dynamically assigned from a service like DHCP. In a situation like that, we have to always open the portal and add the IP address via Firewall Settings of Azure SQL Server which is time-consuming and inconvenient.

Good news! Now this setting is available with Management Studio and if it detects that the current IP address is no longer a valid one for accessing, it allows you to login to Azure Subscription and add the current IP address to the Firewall Settings. This is really a good facility and saves time. Remember, it is available only with Management Studio 2015 December Preview.

I think that this is only related to SSMS November 2015 Preview because there is no issue with SQL Server 2012 Management Studio. However, if you want to use the latest, install December Preview 2015, it sorts out the issue.

Tuesday, December 22, 2015

No surprises, we still see many dynamic SQL codes that are wrongly implemented which allow smart hackers to go into unauthorized databases and manipulate data as they want. There are many SQL Injections that are famous and widely discussed but some are tiny, unnoticed hence ignored. Here is something like that.

WHERE ""="" is Always True. This is something you have to remember. If you have not parameterized your code or have not used relevant technique for executing your dynamically constructed TSQL code, you open your database for unauthorized people. Have a look on below stored procedure.

This stored procedure returns details of the order based on the supplied purchased order number. If you execute the code (directly using SSMS or via an interface developed), he gets the desired result;

Now what if someone has no purchased order number but still wants to see orders. This is where we can use WHERE ""="" injection. See the code below;

If someone pass a value like this via an interface, he might be able to see records he is not supposed to see. This can be controlled by parameterizing the query and execute the dynamic code using sp_executesql instead SQL.

Monday, December 21, 2015

The SQL Server Management Studio - December 2015 release is available for downloading with few enhancements related to November release and some bug fixes. Once installed, you should see the version upgrade from 12.0.2269.0 to 13.0.900.73. Here is the link for downloading;

Saturday, December 19, 2015

Here is this week Brain Basher. This is based on Analysis Services Multidimensional model which allows you to create models based on dimension and fact tables. Biggest advantage with multidimensional model is, it creates aggregates (once enabled) based on the hierarchies defined and stores them for improving the performance of data retrieval from the model. The creation of aggregation is done during processing and it extracts data from sources (preferably from the RDW), creates aggregates if it is enabled and then transfers and stores dimension and fact data (row level - based on grain level) including aggregates in the model. Now the question is;

Which dimension type does not calculate aggregates based on the hierarchies defined when the cube is processed?

Calculating aggregates is not applicable for all dimension types. If the dimension is Parent-Child dimension which is a imbalance and changing dimension, the hierarchy structure is calculated at run time, not during processing. This makes the dimension write-enabled, means that records in the dimension table can be updated and updates can be seen in the model without processing the dimension.

Remember, even though this reduces the processing time, it increases the time it takes for retrieving data. If you see this as a disadvantage, you need to handle the dimension as a regular dimension.

Friday, December 18, 2015

One key thing we look at with the performance tuning on TSQL is, number of data pages read for retrieving data either from memory or disk. We always trust the SET STATISTICS IO ON and take decisions based on the result of it. But this does NOT always give the accurate value and it might mislead you for taking a wrong action on it.

Let's check this and see. The following code creates a scalar function that returns number of products related to the order accepting the SalesOrderId. This reads SalesOrderDetail table for getting the count.

Since we access both tables, number of pages to be read should definitely be more than 689 but it looks like this shows only pages related to SalesOrderHeader. How this can be happened? I do not have an answer now but will surely find the reason. Now the question is, how can we see the accurate value for this? Is it possible to see it?

Yea it is possible. It can be clearly seen with Profiler. If you analyze SQL:BatchCompleted and SQL:StmtCompleted, you can get the correct value. Here is the accurate value related to the query using Profiler.

Just be aware on this, this can happen statement like TOP too. In a way, it is always better to use Profiler for seeing pages read but still this can be used for small queries with no functions.

Thursday, December 17, 2015

While I was reading a newsletter received from Hortonworks, I noticed an article related to data, titled with Data Wrangling. With my experience, dozens of business intelligence implementations, though I have worked with hetorogenious data sets, I have never used this term, even when discussing ETLing. Asked few, this is an unknown to many, hence thought to make this post, just discussing how I see the Data Wrangling and how I see it differently comapring well known ETLing.

ETL, Extract, Transform and Loading, is a common technique we use with Data Integration (DI), Data Warehousing and Business Intelligence. This is more on structured data with well know data sources and mostly with automated tools. This extracts data from various, scattered systems, and prepares data as rich-consumable and loads to the destination, specifically data warehouse. Data Wrangling does the same but few differences.

Data Wrangling works more on unorganized, unstructured, large data set rather a set of structured data. This talks about a manual process that coverts data from one raw form to another format which is more readable and organized for analyzing data. As per the articles read, the term was introduced by Trifacta that offers a tool to help on this process. More on this, the person who does this process is called as Data Wrangler.

Wednesday, December 16, 2015

When creating a Tabular Data Model project, it creates a workspace database in Analysis Services Tabular Instance for maintaining all the data and metadata while it is being developed. This database is an in-memory database and with default setting, it gets dropped from the instance when the project is closed. However this default setting can be changed by setting workspace database retention property, adjusting the behavior of the workspace database. That is why it is important.

Be default, workspace retention is set to Unload from memory. When creating a project, it creates a database in the disk (default location of SSAS instance) as well as a database in-memory using the SSAS Tabular instance. Name of the database is created as ProjectName_UserName_GUID.

Here is the default setting. See the default setting.

Database created in the disk;

Database created in the SSAS Tabular instance;

With this setting, when the project is closed, in-memory database in the instance is removed. But the database in the disk remains and it is used for creating the in-memory database when the project is opening again. Below image shows the instance once the project is close. Note that in-memory database is dropped.

This setting frees up memory allowing system to use memory for other applications but this might slow down re-opening project as it has to build the database again.

If need, you can keep it in the memory without getting it dropped. This is useful if you continuously working with the project. However, be cautious, it does holds the memory consumed without releasing.

If you rarely open the project (usually, after completion of the development), it is better to set workspace retention to Delete workspace as it deletes the database completely.

Local or remote workspace database?
If possible, developer should use a local instance for the workspace database because it provides the best performance. However remote server can also be used with following limitations;

Monday, December 14, 2015

When creating calculated columns or measures in Analysis Services Tabular Mode, we use DAX expression language for adding them. DAX offers number of functions for implementing them and almost all general functions such as SUM, COUNT, AVERAGE are exist too. However some functions behave differently when compare these function behavior with other tools. Here is an example;

Assume that you need to get a count of two columns related to a loaded data table. Let's say we have loaded Internet Sales table from AdventureWorksDW2014, and need to create Count of Promotion Key and Count of Sales Order Number (of course, there is no business logic on these two counts, it is only for demo purposes), then we add two measures like below;

CountSalesOrderNumber:=COUNT([SalesOrderNumber])

CountPromotionKey:=COUNT([ProductKey])

Once added, you will see that CountPromotionKey works fine but CountSalesOrderNumber does not work. The reason for this is, Count function in DAX only works with numeric and dates. For counting non-numeric values, COUNTA has to use instead of COUNT.

CountSalesOrderNumber:=COUNTA([SalesOrderNumber])

Note that both functions do not count blank values. If you want to get the count including blank values, use COUNTROWS function. In addition to these, COUNTX and COUNTAX are given too. They are for getting the count based on a filter.

Sunday, December 13, 2015

Everyone knows that SELECT * is not a best practice and we always try to avoid it. Some considerable issues related to it are;

Unnecessary columns will be transferred from server to client consuming resources.

Application written may face issues when dropping or introducing columns.

However, when reading the table by SQL Server engine, in most cases, no difference in terms of performance. The reason is, it has to read all data pages for getting records regardless of the number of columns referred in the statement. Have a look on following query and output of it.

As you see, both statements require same number of data pages to be read for satisfying the requests. However this is not always true, have a look on this;

Now it is different. The reason for this is, types of data pages used with this table. This table has an xml column and it is maintained not with data pages, with text/image pages. Because of this, when reading columns excluding columns like xml, max type columns, text, images, engine needs to read only data pages, reducing number of IO operations. This introduces the third reason for not using SELECT *, hence, do not use SELECT * when the table has columns that requires different data page types other than data pages.

Saturday, December 12, 2015

SQL Server 2014 introduced In-Memory OLTP tables and many of you know the benefits you get with it. This feature allows you to create table in memory either as durable or non-durable and it is only for tables. Do we create in-memory databases? Can we create in-memory databases? Here is this week question based on it?

Which SQL Server component create in-memory databases that are visible for our implementation?

SQL Server Engine

SQL Server Analysis Services

SQL Server Integration Services

SQL Server Reporting Services

This question can be answered quickly if you have worked on Business Intelligence solutions. Yes, it is related to Analysis Services but for a specific type only. Analysis Services allows us to create two types of models; Multidimensional and Tabular. When working with a Tabular data model project, as you have experienced, it needs a workspace database which is created in Analysis Services tabular instance. This database is in-memory database that keep all data and metadata for the project while it is being developed. When the project is closed, it removes from the memory by default, however it can be controlled by setting Workspace Retention property.

Sunday, November 29, 2015

It is very common to see an Analysis Services database (or a cube) that holds multiple measure groups. If your cube holds multiple measure groups, then you have many number of dimensions and they might not link with all measure groups. In that case, what if a dimension not linked (or unrelated) is used when analyzing a measure in particular measure group?

By default, if an unrelated dimension is used with a measure, dimension is forced to use its top level for showing the value. Example, if Sales Amount from Sales Summary measure group is analyzed with Reseller dimension which is unrelated, result will be as below;

This is controllable. We can stop showing records, when a measure is being analyzed by an unrelated dimension. All we have to do is, set false for a measure group property called IgnoreUnrelatedDimensions.

Once processed, you can test the same and result will be something like below;

Saturday, November 28, 2015

Have you noticed that a new measure group is automatically created if you have created a new measure with Distinct Countusage? Microsoft Analysis Services create this not because of anything else, it is for improving the performance when it comes to specific analysis such as how many sales reps have sold a particular product.

You can create one with Cube Designer -> Cube Structure.

And once they are created, separate measure group are created;

And I will be able to see how many employees have sold a particular product in each year;

Friday, November 27, 2015

Microsoft SQL Server team used/uses various code names for SQL Server versions and some of its features, here are some of them found in the web. I tried to match the best image can find in the web, not exactly sure whether it is the right picture, hence please let me know if I have used wrong pictures.

Search This Blog

About Me

Dinesh Priyankara (MSc IT) is an MVP – Data Platform (Microsoft Most Valuable Professional) in Sri Lanka with 16 years’ experience in various aspects of database technologies including business intelligence. He has done many local and foreign business intelligence implementations and has worked as a subject matter expert on various database and business intelligence related projects. He is the Founder/Principal Architect of dinesQL (Pvt) Ltd and he consults, teaches and runs training programs on data related solutions and subjects.