Archive for: June 29th, 2018

The segmentation of an image into superpixels are an important step in generating explanations for image models. It is both important that the segmentation is correct and follows meaningful patterns in the picture, but also that the size/number of superpixels are appropriate. If the important features in the image are chopped into too many segments the permutations will probably damage the picture beyond recognition in almost all cases leading to a poor or failing explanation model. As the size of the object of interest is varying it is impossible to set up hard rules for the number of superpixels to segment into – the larger the object is relative to the size of the image, the fewer superpixels should be generated. Using plot_superpixels it is possible to evaluate the superpixel parameters before starting the time-consuming explanation function.

Fun stuff. I’m glad that there’s a lot of work going into explaining neural networks rather than hand-waving them off as magic.

If your interests lean more towards traditional statistical analysis and inference as used within industries like manufacturing, finance, and the life sciences, I’d lean towards R. If you’re more interested in machine learning and artificial intelligence applications, I’d lean towards Python. But even that’s not a hard-and-fast rule: R has excellent support for machine learning and deep learning frameworks, and Python is often used for traditional data science applications.

One thing I am quite sure of though: neither Python nor R is inherently better than the other, and arguments on that front are ultimately futile. (Trust me, I’ve been there.) Which is better for any given person depends on a wide variety of factors, and for some, it may even be worthwhile to learn both. Brian Ray recently posted a good overview of the factors that may lead you towards R or Python for data science: their history, the community, performance, third-party support, use cases, and even how to use them together. It’s great food for thought if you’re trying to decide which community to invest in.

Embrace the power of “and.” The whole R versus Python bit is fun for purposes of arguing with people, but they’re both powerful languages and we’re seeing more and more overlap—for example, the Keras package David mentions runs Python’s TensorFlow under the covers.

Our Hadoop HDP IaaS cluster on Azure uses Azure Data Lake Store (ADLS) for data repository and accesses it through an applicational user created on Azure Active Directory (AAD). Check this tutorial if you want to connect your own Hadoop to ADLS.

Our ADLS is getting bigger and we’re working on a backup strategy for it. ADLS provides locally-redundant storage (LRS), however, this does not prevent our application from corrupting data or accidentally deleting it. Since Microsoft hasn’t published a new version of ADLS with a clone feature we had to find a way to backup all the data stored in our data lake.

We’re going to show you How to do a full ADLS backup with Azure Data Factory (ADF). ADF does not preserve permissions. However, our Hadoop client can only access the AzureDataLakeStoreFilesystem (adl) through hive with a “hive” user and we can generate these permissions before the backup.

Read the whole thing if you’re thinking of using Azure Data Lake Store.

Years ago someone said “Hey – why not drop auto-created stats, since the stats you need will just get created again and you’ll end up getting rid of those you no longer need.” That *may* be a reasonable step on some systems. If the risk of bad plans on first execution of a query needed stats that have been dropped is too high, its a bad deal. If the potential concurrent cost of auto-creating dropped stats is too high, that’s a bad deal. What about analyzing query plans over some period of time to see which stats are actually used in those plans? Then auto-stats which aren’t used in that set of plans could be dropped.

That type of stats analysis could have other uses, too. Prioritizing stats manual stats updates in regular maintenance comes to mind. Or, determining what stats to create/update on an Always On Availability Group primary based on secondary activity. And troubleshooting problem queries or identifying suspicious “watchlist” stats based on highly variable queries/plans they are involved with.

So I created this blog post almost 4 years ago. And now I’ll plead with you to not use the query there… it’s awful. If you want to query trace flag 8666 style stats from plan XML, please start from the query in this post instead – its much more well behaved 🙂

The massive scale and capabilities of Azure Data Lake Store are regularly used by companies for big data storage. As the number of files, file types, and folders grow, things get harder to manage and staying compliant becomes a greater challenge for companies. Regulations such as GDPR (General Data Protection Regulation) have heightened requirements for control and supervision of files that contain sensitive data.

In this blog post, I’ll show you how to set up alerts in your Azure Data Lake Store to make managing your data easier. We will create a log analytics query and an alert that monitors a specific path and file type and sends a notification whenever the path or file is created, accessed, modified, or deleted.

Auditing access has historically been tricky, so it’s nice that they were able to get that in.

Now we all know that having SQL Server Agent jobs owned by ‘Real’ users isn’t a good idea. But I don’t keep that close an eye on some of our test instances, so wasn’t surprised when I spotted this showing up in the monitoring:

In a nutshell, this spinlock is used to synchronize access to the list of pages that are present in a database snapshot, as follows:

If a page in a database with one or more database snapshots is being updated, check each snapshot’s list to see if the page is already in the snapshot. If yes, nothing to do. If no, copy the pre-change image of the page into the snapshot.

If a query is reading a page in the context of a database snapshot, check the list of pages to see whether to read from the snapshot or the source database.

This synchronization ensures that the correct copy of a page is read by a query using the snapshot, and that updated pages aren’t copied to the snapshot more than once.

The original question was because the person was seeing trillions of spins for the FCB_REPLICA_SYNC spinlock. That’s perfectly normal if there’s at least one database snapshot, a read workload on the snapshot, and a concurrent heavy update workload on the source database.

Great information. And a good reminder that if you are using database snapshots in SQL Server, you generally don’t want to have more than one on the same database.

It’s certainly more difficult to come up with a demo that works without string columns, but consider how the page compression algorithm works. Data can be compressed on page basis, which includes both multiple rows and multiple columns. That means that page compression can achieve a higher compression ratio when a row has identical values in different columns. Columnstore is only able to compress on an individual column basis and you won’t directly see better compression with repeated values in different columns for a single row (as far as I know).

Interestingly, Joe also comes up with a scenario where row-level compression can beat columnstore even without string values. All this said, the normal case when dealing with non-string data is that columnstore tends to compress a lot better.

The only way to overcome the performance impact of sorting is to create an index specifically for the OVER clause. In his book Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions, Itzik Ben-Gan recommends the POC index. POC stands for (P)ARTITION BY, (O)RDER BY, and (c)overing. He recommends adding any columns used for filtering before the PARTITION BY and ORDER BY columns in the key. Then add any additional columns needed to create a covering index as included columns. Just like anything else, you will need to test to see how such an index impacts your query and overall workload. Of course, you cannot add an index for every query that you write, but if the performance of a particular query that uses a window function is important, you can try out this advice.