Archive for: May 23rd, 2016

An EMR 4.6 cluster running Spark 1.6.1 will still use Python 2.7 as the default interpreter. If you want to change this, you will need to set the environment variable: PYSPARK_PYTHON=python34. You can do this when you launch a cluster by using the configurations API and supplying the configuration shown in the snippet below:

I’m more of a SQL and Scala guy, but if you like Python and are on the Python 3 side of the divide, here’s a solution for you.

You see the CHECKSUM on the backup along with the RESTORE VERIFYONLY. The code was generated by right clicking on the database, selecting Tasks, then Backup, plug in the parameters, and select Script. I put it in a new query window as I may back up several databases in the same job. Sometimes I’ll just do a find/replace for the other databases since my backup. The Restore Verifyonly gives you some confidence that your backup is recoverable: NEVER assume that just because your backup ran that the database is restorable! The ONLY way to know is to actually restore it to another file! You don’t want to accidentally clobber your production that probably has newer data in it.

Corruption is a serious event when your entire job revolves around protecting data. Be prepared.

But sometimes you want to run a series of statements or procedures where you only want the execution plan for some of the statements. Here’s how:

The actual execution plan is enabled by turning on SET STATISTICS XML., not unlike enabling STATISTICS IO or TIME. And just like SET NOCOUNT, the SET statements apply to the current context, which could be a stored procedure, a session, etc. When this context ends, the setting reverts to that of the parent context.

I see code snippets with STATISTICS IO and TIME fairly regularly, but almost never see STATISTICS XML; instead, I see people (including myself) hit Ctrl-M or select the “Include Actual Execution Plan” button when generating execution plans is desirable.

This sounds paranoid, but as an example, here’s a corruption case I had recently: shortly after detecting corruption, the team realized they’d have to revert to a backup of the database from a few days ago. Rather than telling users about that possibility, they let the users keep adding data into the already-corrupt database while the DBAs did troubleshooting. Several days later, as the corruption got worse, even Microsoft couldn’t repair the corruption – and the affected tables went completely offline, permanently. If the users would have been alerted earlier, they could have avoided even more data loss.

That’s right; SQL Server will just pick any child. It will not update the parent row for each child. When the MERGE statement runs into this problem, it raises an error and rolls back the UPDATE but our query will silently pick any value and move on. We don’t want to update data like this, we want our query to dictate the logic explicitly. This will require some changes to our query.

I prefer to write these types of updates with the use of a CTE. This allows us to easily highlight the SELECT in the CTE and execute it to see which rows will be updated (@variables can cause problems here though). Adding a COUNT can help to identify the problem in the previous query.

Highlight and execute the code inside the CTE below to check for duplicates. If this returns 0 rows then you are good to remove the COUNT, GROUP BY and HAVING then add the name columns before executing the whole statement.

When writing T-SQL updates with joins, it’s important to consider whether the grain changes, and if that change can make a difference in your update set.

The core RC4 algorithm is well-known, and relatively simple. It would be better implemented in a .Net language for efficiency and performance reasons, but there is a T-SQL implementation below.

These two T-SQL functions implement the RC4 key-scheduling algorithm and pseudorandom number generator, and were originally written by SQL Server MVP Peter Larsson. I have a made some minor modifications to improve performance a little, and allow LOB-length binaries to be encoded and decoded. This part of the process could be replaced by any standard RC4 implementation.

Using WITH ENCRYPTION is a gentleman’s agreement that you won’t look at the underlying code. In practice, it’s trivial to get around, and Paul shows exactly why.

I was working with a computed column the other day, and realized I had the wrong definition. In this case, I was performing some large calculation, and the result was larger than an int. However the first part of the formula was an int, which resulted in an implicit conversion to an int.

I needed to change the formula, and then realized that plenty of people might not work with computed columns much, and not realize how you alter a computed column.

You don’t.

In fact, you need to drop the column and add it back. In my case, this was what I did.

Fortunately, this tends to be a pretty quick operation, especially if the computed column is non-persisted.

In the M language a let expression consists of two sections. After the let comes a list of variables, each of which has a name and an expression associated with it. In the previous example there are three variables: step1, step2 and step3. Variables can refer to other variables; here, step3 refers to both step1 and step2. Variables can be used to store values of any type: numbers, text, dates, or even more complex types like records, lists or tables; here, all three variables return numbers. The Query Editor is usually clever enough to display these variables as steps in your query and so displays then in the Applied Steps pane on the right-hand side of the screen