Archives for December 2016

In the coming post I will explain how Isilon makes Hadoop so much easier to manage. First I thought I’d cover the basics on Isilon in my Isilon Quick Tips series below.

Hadoop Career

Over a year ago I switched teams to join Dell EMC working on the Data Lake team. One of the platforms I work with is the Isilon Scale-out NAS (Gartner #1 in Scale-out NAS). It’s a really mind blowing system that supports HDFS as a protocol but also NFS, SMB, REST, SWIFT, HTTP, FTP protocols as well. Think of being able to move data into HDFS by just moving a file in your Windows environment. Oh and by the way it scales up to 90 PB of data (talking about BIG DATA).

What makes Isilon so awesome isn’t just the hardware but the software that runs Isilon. OneFS is the software that gives Isilon it’s power to store data at astronomical heights. One file system or OneFS is key to giving developers the ability to access Hadoop data thru HDFS using other protocols. Think about not having to land your data on your machine before ingesting into to HDFS. All of this is possible because OneFS treats HDFS as a protocol not storage system. So data can sit on Isilon, but be read as HDFS.

A huge benefit to using Isilon for HDFS storage is the when replicating data for data protection. I’ll follow up with a blog post dedicated to data protection in Hadoop in the future. Just know Isilon provides that missing piece in Hadoop for replication and data protection. Want to replicate or copy over 20 PB of data? No problem just use SyncIQ in OneFS.

Share the Isilon Knowledge

Along the way on the Data Lake team I’ve acquired some knowledge about managing Isilon clusters and wanted to get it out to the community. All these demos can be done using the Isilon Simulator on your local machine. The demos are meant to be easily consumable and all should be around 5 minutes long with a few outliers that bump up to an hour.

Be sure to subscribe to my YouTube channel to ensure that you never miss an Isilon Quick Tip or other Hadoop related tutorials. As always leave a comment or drop me an email with any ideas you have about new topics or things I’ve missed in my posts.

Splunking on Hadoop with Hunk

Well, for starters it covers a ton about starting out in Splunk. Admins and Developers will quickly setup a Splunk development environment then fast forward to using Splunkbase to expand use cases. However the most popular portion of the course is the deep dive into Hunk.

Hunk is Splunk’s plugin that allows for data to be imported from Hadoop or exported into Hadoop. Both Splunk and Hadoop are huge in analytics (big understatement here) and with Hunk, users can visualize their Hadoop data in Splunk. One of the biggest complaints with Hadoop is the poor visualization tools to support this thriving community. Many admins are already using Splunk so it’s no wonder Splunk is filling that gap.

In my Analyzing Machine Data with Splunk course I dig into using Hunk with the Splunking on Hadoop with Hunk module. This module is close to 40 minutes of Hunk material from setting up Hunk to moving stock data from HDFS to Hunk. I’ve worked with Pluralsight to setup a quick 8 minute preview video on the Splunking on Hadoop with Hunk module checkout it out and be sure to watch on Pluralsight for the full Hunk deep dive.

Splunk on Hadoop with Hunk (Preview)

Never miss an update on Hadoop, Splunk, and Data Analytics.

Recently I released my 3rd Pluralsight course “Analyzing Machine Data with Splunk”, this course covers the basics of setting Splunk in your environment. One of the topics covered is Splunk Processing Language (SPL).

Unstructured data is doubling every 2 years which makes it a great time to be involved with data. One of the big challenges with working with data is getting applications talking to the database or source of data. Most of the times is this accomplished with using a query language to send request to data source. Splunk’s query language is called the Splunk Process Language. SPL is best thought of as a query language for Splunk. Similar to the way SQL allows us to query, update and transform data in our databases.

What is SPL

When I first started looking at SPL I assumed it was just using Lucene. Boy was I wrong! Search Processing Language is Splunk’s proprietary search language. My previous experience with ElasticSearch and the ELK stack ,which is a competitor in some respects to Splunk, made me assume the Lucene language would be used in Splunk. But that is not the case. Splunk is primarily for parsing log files where Elasticsearch has a wider us case in documents and other unstructured data use cases. Like Stack Overflow. Let’s take a looks at some SPL commands.

Here is an example we will use to test out SPL commands in Splunk. Let’s assume I have Apache Log Files in my Splunk environment. I want to query these log files to find our what IPs are hitting my site, look for errors, or track peaks in traffic.

Top SPL Commands in Splunk

Basic Search

1

2

3

source="apachelogs:*"host="Prod-Server"

Returns all log files from apache logs source on the Prod-Server.

Chaining Queries

For our Splunk queries I will use the | pipe to chain the queries together. Think of chain as a way to filter down results in a procedural way. Take all the results to the left of the pipe and find all the results that meet another criteria and we can then repeat this process again with another command.

1

source="apachelogs:*"host="Prod-Server"|<--Chaining with pipe

Filter

1

2

3

source="apachelogs:*"host="Prod-Server"|search IP=127.0.0.1

Returns all IP address for127.0.0.1

Dedup

1

2

3

source="apachelogs:*"host="Prod-Server"|Dedup IP

Returns only one instance per IP address.

Sort

1

2

3

source="apachelogs:*"host="Prod-Server"|search IP=127.0.0.1

Returns all IP address for127.0.0.1

Search

1

2

3

source="apachelogs:*"host="Prod-Server"|search IP=127.0.0.1

Returns all IP address for127.0.0.1

Head

1

2

3

source="apachelogs:*"host="Prod-Server"|head100

Returns first100results forquery

Tail

1

2

3

source="apachelogs:*"host="Prod-Server"|tail100

Returns last100results forquery

Reverse

1

2

3

source="apachelogs:*"host="Prod-Server"|reverse

Returns all results inreverse timestamp order

Sort

1

2

3

source="apachelogs:*"host="Prod-Server"|sort IP

Returns results sorted by IP ascending order(use-sort fordescending)

Review

Splunk is a powerful tool for data analysis and in particular for machine data. SPL helps developers and admins write queries to analyze their data. Just like SQL can be used in developing applications that query data in a database, SPL can be used to query data in Splunk for application development. Checkout my Analyzing Machine Data with Splunk course to learn more about SPL and Splunk Application Development.

OneFS offers many options to customize replication policies using SyncIQ. In this episode of Isilon Quick Tips we walk through those options in deep dive into SyncIQ.

Transcript

hello and welcome back to another episode of ISILON on quick tips today we’re going to talk about SyncIQ and we’re going to go a little bit deeper than we have in the past the SyncIQ so before it was all about just setting up a one-time replication now let’s talk about some of the options and how we can really customize are seeking jobs so what we’re going to do is I’m going to swing over to policies and we look at a policy of already got created i’m just going to edit that policy that policy is my home shares and so this is all my corporate home directories here in Huntsville and this is something that I’m replicating to my secondary cluster so the first thing i want to talk about I want to talk about some of the differences between copy and synchronization copy is just when you’re specifically moving data from one directory to another directory you’re not caring about if the date has been deleted or if the data is merged so when you’re synchronizing that’s going to be different because synchronizing is actually going to keep your primary cluster and your secondary cluster in sync so a file has been moved to another directory inside of that directory will be replicated that way in your secondary cluster and so will delete if you have a file has been deleted your primary once the scene job completes it will be deleted on your secondary cluster as well some of the areas you would use these is if you want to have a backup of your data but you don’t want to sink it or maybe there’s certain directors you want to pull out that you want to have all of those copy that’s typically when you’re going to see them but most cases you’re going to use the synchronization so in our run job or run job is our option of windows is dropping around so we want to sync you know this is our SyncIQ policy when they’re going to kick off you have a couple different options the one we did before was our manual so that just saying hey you know just going to manually push the button and every time i do that i’m going to sink my going to start that state policy then you also have the option to do it on schedule this is the most common one used so this is says hey you know two times a day three times a day you can set it up however you want that you’re going to set a schedule that you know the state is going to be replicated you know let’s say you do at six in the morning then six in the afternoon you know you doing a weekly monthly or yearly basis another common way to run these is whenever sources modified and so you can have a source of this modified so you know think about if you move a file if you delete a fall anything like that any changes to that directory it’s going to go ahead and modify the source you do have to set a time frame around that so let’s say that you modify a file how fast you want that see you can send delay to happen a few different seconds minutes hours or days so you can say hey you know every time some modified let’s wait a few minutes and then go ahead and never replicated over and then you can also have it set up where whenever a snapshot of the source directory is taken that’s going to run that the state policy job to setting the source directory is very simple right so what directory or want to move in this case i’m moving all my data that’s in ifs is my source directory the cool thing is where you can really customize this job is not only were sent in from different directories or also can include directories or exclude directories you can come in and say hey you know all the directories under data i’m only going to move over RI salon support or you know I’m only one will move over my easy gather / or i can exclude directories and say you know what move everything that’s in data but these two directories here I salon support a nice long gather those are all administrative things that you know not really trying to replicate over the change in a lot there’s not any data in there that’s not recoverable don’t replicate those so that just gives you a lot more control so you can set you said something at a high level in the tree and then only replicate the times that you want to in that tree and not have to worry about okay you have to set up you know 15 different policies because I’ve got different datasets now you can still come back and set one or two policies to replicate over the data and then we talk about some of these advanced settings so you can actually set a priority on this and you can take it from you know a normal default policy or you can have always have it but this specific policies always going to be high level and never just make sure that that priority has been lifted on this job here and also you can set a limit on how long you’re going to keep the reports from these jobs because you know depending on how often these jobs running you start to have a lot of reports and so you have that option there let’s cancel out of this and I’m going to close out i will show you one more thing so we’re talking about setting up those on modified depending on how often those files change that’s how much bandwidth is going to go over your network so if you have some performance concerns about how often or how much data is being problem across you can actually set of performance role in these jobs and so one of the cool things that I really like about it is you can set it on a schedule so you say hey you know I really want to replicate this anytime my name is modified but there are certain business our times maybe or certain times certain days of the week that I’m really want to throttle back and say hey you know what it’s not as a bigger problem during this time you know the rest of the week let’s go ahead and have it you know open throttle there and so you can set a schedule man you can even set some bandwidth rules around the limitations and so you can kinda throttle back saying hey you know always want to be modifying that data but let’s just set a performance rule about how much bandwidth can be taken up and so that’s just a deeper dive on CQ and so you can really see how you can customize and design those same policies to fit whatever kind of rules that you want to have for replicating your data between your eyes on clusters thanks for taking the time to watch and i hope you’ll join me for another episode of ISILON on quick tips

My Ultimate Agile Podcast blog post was such a hit I though it only appropriate to do one for Big Data. Who doesn’t need to data geek out when in the car, plane, train, or treadmill? Listening to podcast is one of the easiest ways to keep or skill up. However find a cultivated list of podcasts on just Big Data is not easy.

The list is intended to be a resource for the Big Data/Hadoop/Data Analytics community. So I will continue to update the list with new Big Data podcast or episodes.

If you a host of a big data related podcast below or new podcast and would like to interview me on your show, reach out by Twitter, comments, or etc..

Let me know you notice a podcast missing or broken links. Just add a comment or contact me and I will make the changes.

Since I’ve created this list, I’m putting the episodes of the podcast I was in first.

Big Data Podcast List by Category

Hadoop/Spark/MapReduce

Big Data Beard Podcast – Newly released podcast exploring the trends, technology, and talented people making Big Data a big deal. Host are Brett Roberts, Cory Minton, Kyle Prins, Robert Hout, Keith Quebodeaux, and myself. Join us as we talk to about our Big Data journey and with others in the community.

All Things Hadoop – Favorite episode Hadoop and Pig from Alan gates at Yahoo the title alone gives you an indication of how old it is but still awesome listen.

Puppet Podcast Provisioning Hadoop Clusters with Puppet – Learn how to use Puppet to automate your CDH environment with Puppet. Mike Arnold the creator of the Puppet module talks about to deploy CDH on a large scale with Puppet. If you virtualizing Hadoop (and you should be) then you’ll want to take note in the episode on how speed up your deployment process. My prediction is in the next year we will see more automation tools in the Hadoop ecosystem.

Roaring Elephant Podcast – Awesome insight from two guys working out in the field in Europe. They talk through hot topics in Hadoop ecosystem and also give some real world story from the customers they speak with. Great Podcast if you are just starting out in your Hadoop journey.

Business of Big Data

Hot Aisle with Bill Smarzo – One of my favorite podcast episodes (full disclosure: I work with both the hosts of the Hot Aisle and Bill Schmarzo) on the topic of the business of big data. Bill’s insight into to what Big Data can mean for a business is something a lot of us as developers/admin lack when talking outside of the wall of IT. One of the biggest reasons Hadoop projects fail is because they aren’t tied to a business objective. In this episode learn about how to tie your Hadoop project to a business to generate more revenue for the company, which brings in more money to expand your Hadoop cluster (win-win-win).

Cloud of Data – Wow talk about an all-star cast of interview it looks like a who’s who of Data CEOs . The first episode was with InfoChimp’s CEO, which I actually worked at CSC during the InfoChimp’s acquisition. Those were some really bright data scientist.

Data Analytics/ Machine Learning

Data Skeptic – usually short format on specific topics in data analytics. the podcast is great. It’s about data analytics and not just about big data but confused as the same thing. My favorite episodes are the algorithm explanations b/c as someone who mostly stays on the software side I like to keep up with the use of these algorithms b/c it helps when working with the DS team.

Partially Derivative– Another great podcast on data analytics, my favorite episode was done live from Stitchfix my wife’s favorite product and mine to but for a different reason. Stichfix is a monthly subscription company that matches a customer with their own personal stylist, but behind the dressing room curtain Stichfix is really a data company. Listen in to hear about all the experimentation that take place on a daily basis at Stichtfix. Also hear about how they are using machine learning to pick out clothes you’d like.

Data Crunch -Podcast devoted to highlighting how data analytics is changing the world. Released 1 -2 times a month coming in under 30 minutes per episode.

Internet of Things (IoT)

Inquiring Minds Understanding Heart Disease with Big Data – Not a podcast dedicated IT or Big Data but in this episode Greg Marcus talks about analyzing the heart with IOT. Think that smartwatch is just for tracking steps and sending text messages? That smartwatch could help advance the science behind heart disease by giving your doctor access. Really great episode to hear how IOT is offering lower cost research in healthcare and provide more data than traditional studies.

Oh, and if you are looking for a quick tips on Hadoop/Data Analytics/Big Data, subscribe to my YouTube channel which is all about getting started in Big Data. Make sure to bookmark this page to check for frequent updates to the list. As Big Data gets more popular this list is sure to grow.

Finding data data for testing in your own Hadoop projects doesn’t have to be hard!

There are many place to find free data sets for running in your development environments. Checkout this video to find out my Top 4 places to find Big Data. Spoiler alert you can also find small data in these places….

YouTube Video

—

Transcript

Hi and welcome back to Thomas Hanson com have you ever been working in your big data environment thought we could have one more data to test it be great if I could have more data synthetic eye test out this new open source tool or just maybe this new function that you want to run today I’m going to talk about my four favorite places to find big data number four on the list is Yahoo actually the yahoo finance section you can actually go in here and look up your favorite stock or even your favorite mutual fund and find historic information and so what I like to do is I like to come in here and get historic information that will give you daily values on the stop you can take that data and inserted into HDFS or a database for however you want there’s a lot of different options and this data actually export to csv it’s really accurate data but it is limited in the set because you’re only looking at stock values but if you need a quick fix to get some data this is where I come to first coming in at number 3 is actually some weather data from Noah this data is very accurate but one of the drawbacks to getting the data and the reason it’s only number three on the list you have to actually open an account and request hey in this geographic information i would like to compile the weather data from here and so if you’re looking for accurate data this is a very good site that i would use but if you’re looking for something quick this is not going to be something that you want to use typically you’ll receive the data in less than 24 hours but just know that it could be a lot longer and that’s why weather data is number three on the list coming in at number two and a really close favorite to number one is tableaus public website and their sample data sets and this is relatively new to me but they have a lot of different information sets and a lot of different categories so like government lifestyle health and then one of my favorites that sports the format’s come back in Excel or CSV format so it’s really easy another cool thing is you don’t have to login so you can just come in download these datasets upload them into HDFS and start playing away and so that’s why this is number two on my list tableaus public data sets and now for number one on my list of your favorite places to find data is Kaggle’s website and cable start off on the scene is just a contest side for data scientist or amateur data scientists to be able to test out and solve problems one of the famous examples was Netflix there was a contest out there to see if you could be Netflix data scientist in how to recommend better videos for people and so it’s really cool I think they gave out like a million dollars for the contest but now this website is more than just a contest site it actually has data sets and its really a one-stop-shop for data scientist so it’s one of those websites you want to come in and you want to check for me i really like the datasets ight now you do have to login to be able to access the data but you have a vast amounts of data sets and if I were stuck on an island i can only have one of these it would be the Kegel website because they’re always updating a lot of different datasets they have something small and something large and so you can see here you can go through in search and you can see the latest data that’s been updated you can search by different features and like I said it’s community-driven so there’s always new data sets available this is why it’s number one on my list and so just a recap remember for top favorite places number four was Yahoo’s finance section number three was the weather data and Noah number two and a close favorite was tableaus public website where they have the sample data sets and the number one the best place was Kaggle datasets thanks for tuning in and be sure to