Archive for: December 27th, 2016

Our group is distributing a detailed write up of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

The book I’ve been working on these pasts months (you can read about it here, and read it for free here) is now available on Leanpub! You can grab a copy and read it on your ebook reader or on your computer, and what’s even better is that it is available for free (but you can also decide to buy it if you really like it). Here is the link on Leanpub.

In the book, I show you the basics of functional programming, unit testing and package development for the R programming language. The end goal is to make your data tidy in a reproducible way!

Unfortunately this doesn’t make any objects in the cube that are not visible, like measures or dimensions, visible again – it just makes the cube itself visible. However, if you’re working on the Calculations tab of the Cube Editor in SSDT it is possible to make all hidden objects visible as I show here.

Important Notes:

Start-dfs.sh will start NameNode, SecondaryNamenode, DataNode on master and DataNode on all slaves node.

Start-yarn.sh will start NodeManager, ResourceManager on the master node and NodeManager on slaves.

Perform Hadoop namenode -format only once otherwise you will get an incompatible cluster_id exception. To resolve this error clear temporary data location for datanode i.e, remove the files present in $HADOOP_HOME/dfs/name/data folder.

If you’d like to set up your own Hadoop cluster rather than using one of the big vendors (Hortonworks, Cloudera, MapR) or a PaaS solution like HDInsight or ElasticMapReduce, this will give you a head start.

Aggregate Pushdown: This performance feature often gives a 2X-4X query performance gain by pushing qualifying aggregates to the SCAN node, which reduces the number of rows coming out of that iterator.

Index Build/Rebuild: Enterprise Edition can build/rebuild columnstore indexes with multiple processor cores, while Standard Edition only uses one processor core. This has a pretty significant effect on elapsed times for these operations, depending on your hardware.

Local Aggregates: Enterprise Edition can use local aggregations to filter the number of rows passing out of a SCAN node, reducing the amount of work that needs to be done by subsequent query nodes. You can confirm this by looking for the “ActualLocallyAggregatedRows” attribute in the XML of the execution plan for the query.

Glenn’s focus is around columnstore indexes and DBCC CHECKDB, but there are additional benefits as well, with the separator being improved performance rather than different feature surface areas.

Regardless of using unique constraint or unique index, the field can accept null values, however the uniqueness will result in only accepting a single row with null value.

The solution to allow nulls in unique fields is create a unique filtered index excluding the nulls of the index, due to that the uniqueness of the nulls will not be validated and multiple rows with nulls will be accepted.

Click through for the code. I enjoy asking this as an interview question. It’s a non-trivial problem with a non-trivial solution and isn’t a trick question.

SQL Server was struggling to compile the procedure in time and the application wouldn’t let it catch its breath. The query optimizer was attempting to create statistics automatically that it needed for optimizing the query, but after thirty seconds, the application got impatient and cancelled the query.

So the compilation of the procedure was cancelled and this caused two things to happen. First, the creation of the statistics was cancelled. Second, the next session in line was allowed to run. But the problem was that the next session had already spent 28 seconds blocked by the first session and only had two seconds to try to compile a query before getting cancelled itself.

The frequent calls to the procedure meant that nobody had time to compile this query. And we were stuck in an endless cycle of sessions that wanted to compile a procedure, but could never get enough time to do it.

There are two important lessons here: how Michael solved the problem and also a reminder that plan cache entries are dependent upon specific application settings.

If you click “view differences”, be aware that ReadyRoll opens a tab in Visual Studio but doesn’t switch focus to it automatically. Clicking the “Import and Generate Script” button will apply the changes to our ReadyRoll project.

Because let’s face it whole books are written on the subject and yet it’s one of the very first things a DBA should learn. Because it is one of those subjects everyone has to learn one way or another I had a large number of responses (which explains my delay in getting this rollup out, sorry about that). However, the large number of responses makes this list an excellent course on backup and recovery. It’s by no means comprehensive but if you read each of these posts you will have a great start into what’s necessary and what’s possible.

Here is the error message that is quite possible to encounter while creating principals.

Msg 15006, Level 16, State 1, Line 6‘SomeDOmain\jason’ is not a valid name because it contains invalid characters.

At first look, this error makes absolutely no sense. The error states there is an invalid character somewhere in the string “SomeDomain\jason”, yet every character in that string is supported and normal for the collation. This can be a head-scratcher for sure.