Archives for January 2018

To say Streaming Analytics is popular is an understatement. Right now Streaming Engineering is a top skill Data Engineers must understand. There are a lot of options and development stacks when it comes to analyzing data in a streaming architecture. Today I sat down with Lewis Kaneshiro (CEO & Co-founder) and Karthik Ramasamy (Co-founder) of Streamlio to get their thoughts on Streaming Analytics and Data Engineering careers.

Streamlio Opensource Stack

Streamlio is a full stack streaming solution that handles the messaging, processing, and stream storage in real-time applications. The Streamlio development stack is built primary from Heron, Pulsar, and BookKeeper. Let’s dicuss each of these opensource projects.

Heron

Heron is real-time processing engine used/incubated by Twitter. Currently Heron is going through the transition of moving into the Apache software foundation (learn more about this in the interview). Heron is at the heart of real-time analytics by processing data before the time value expires.

Pulsar

Pulsar is an Apache incubated project for distributed publishing and subscribing messaging real-time architectures. The origin of Pulsar is similar to that of many opensource big data projects in that it was used first by Yahoo.

BookKeeper

BookKeeper the scalabale, fault-tolerant, and low-latency storage service used in many development stacks. BookKeeper is under the Apache Software foundation and popular in many opensource streaming architectures.

Interview Questions

Have we as a community accepted Hadoop related tools to be virtualized or containerized?

How do Data Engineers get started with Streamlio?

What are the biggest real-time Analytics use cases?

Is the Internet Of Things (IoT) the primary driver behind the explosion in Streaming Analytics?

What skills should new Data Engineer focus on to be amazing Data Engineers?

Why Use Cygwin?

I have been a long time Putty user when logging from my Windows to GNU environments. It works great, but sometimes I just want the feel of a Native Command Line.

Enter Cygwin….

Cygwin is an open source tool that allows POSIX like environment naively from Windows. Cygwin runs on most Windows machines and is a distribution of GNU. Just like with Putty Cygwin allows for customization of the tool through themes. There are a few tricks to customizing the theme versus Putty that give developer/administrators more options. Let’s step through developing and changing the default theme in Cygwin.

Changing Themes in Cygwin

Step 1 Install Cygwin

Step 2 Install Mintty

Step 3 Open Cygwin Options

After installing go to Cygwin options – Looks. Go to the 4bit theme generator or select ‘Color Scheme Designer’ to be director to 4bit Theme generator. The theme generator allows you to use prebuilt schemes or create your own custom shell. For my environment I created a custom theme with a slightly dark purple background.

Here’s my custom Cygwin Theme

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

BackgroundColour=15,0,51

ForegroundColour=217,230,242

CursorColour=217,230,242

Black=0,0,0

BoldBlack=38,38,38

Red=203,103,123

BoldRed=229,179,189

Green=123,203,103

BoldGreen=189,229,179

Yellow=203,183,103

BoldYellow=229,219,179

Blue=103,123,203

BoldBlue=179,189,229

Magenta=183,103,203

BoldMagenta=219,179,229

Cyan=103,203,183

BoldCyan=179,229,219

White=217,217,217

BoldWhite=255,255,255

Step 4 Download Or Copy .minttyrc

Once you have selected or developed your customer Cywin theme select the ‘Get Scheme’ button in top right. The file type you need to export is mintty (.minttyrc). If you have trouble getting the file to export just right click to open in new tab and copy the configurations.

Step 5 Overwrite Or Edit the .minttyrc file

Extremely important step and one that I always forget (which is the reason I created this post), where to edit theme file. If you download the .minttyrc file move and overwrite the file in the Cygwin64/home/[username] directory. Alternatively you can just edit the existing .minttyrc configuration file.

Awesome Themes

The ability to customize themes for your console is one of my favorite things versus staring at a generic white or black terminal. Use the 4bit Theme Generator to build your own custom theme or search around to find one that suits your style. Post screen shots of your terminal in the comments below. I’d love to see what other developers/administrators are using.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

The Hadoop Ecosystem is booming and so is the demand for Hadoop Developers/Administrators.

How do you choose between a Developer or Administrator path?

Is there a more demand for Hadoop Developers or Administrators?

Finding the right career path is hard and creates a lot of anxiety about how to specialize in your field. To land your first job or move up in your current role specializing will help. In this video I will help Data Engineers choose a path between Hadoop Developer or Administrator. Watch the video to get a breakdown of the Hadoop Developer and Administrator roles.

Video – Better Career: Hadoop Developer or Administrator

Transcript

Maybe it’s my background as a Web Developer, but whenever learning a new development stack I want to build a CRUD(Create, Read, Update, Delete) application first thing. Step one in creating a CRUD application for a database is create a table. In this post let’s see how quickly we use DynamoDB Create table from both AWS console and CLI.

DynamoDB is a NoSQL database built by Amazon for both the AWS cloud and off premise. The non-relational database supports both Key-Value and Document models. In this walk through we will focus on the key-value model. If you are interested in learning more DynamoDB Commands checkout 11 DynamoDB CLI commands.

Table Example

In this tutorial let’s create a table for global college team registry. We want to track name, shortname, mascot, colors, and location. For example the University of South California would contain the following:

Name – University of South California

ShortName – USC

Mascot – Trojan

Colors – Cardinal, Gold

Location – Los Angeles, CA

Before loading our data into the table we are going create a table and assign data types to the items.

DynamoDB Create Table From Console

Let’s walk through the steps to creating a table in DynamoDB to track College Teams from the AWS Console. Make sure to log in and navigate to the DynamoDB service.

Create Table – Our table name is ‘collegeteam‘. The table name must be unique per AWS region.

Partition Key (Primary Key) – Primary Key is a combination of the Partition Key (that we are entering here) and Sort Key (if sort key is used). Since we are using both Parition Key and Sort Key DyanmoDB will hash the keys across the AWS region. For our college teams table our Partition Key for sorting our teams is ‘name’ (string).

Add Sort Key -The sort key is additional key to enrich queries associated with the Primary Key. All queries will be associated with the Primary Key. For the college teams table we have ‘name’ as our primary key, lets’ add the ‘location’ as our sort key.

Secondary Indexes – Remember we added additional sort keys tied to our Primary Key? We can use Secondary Indexes when we want to setup indexes independent of the Primary Key. For the college team tables let’s set a secondary index with ‘mascot’ (string) as the Partition Key (Primary Key) and ‘colors’ (string) as the Sort Key. The mascot secondary index will have separate read/write capacity units from the primary index.

Creating DynamoDB Table From CLI

In the first example we used the AWS console to create our college team table, but now let’s use the same example to create a DyanmoDB table from AWS CLI. Before creating the table make sure we have the AWS CLI configured.

DyanmoDB Create Table – All of the different options for the college team table are passed into the ‘aws dynamodb create-table’ command.

Big Data Career Without Coding?

Do all career options in Big Data demand skills with coding or administration? Big Data projects are in high demand right now, but skill sets for these projects come from different backgrounds. If you are wanting to get involved with Big Data, but don’t have a technical background watch the video to learn your options.

Video – Non-Technical Careers in Big Data

Transcript

Hi, folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Welcome back to the new year. Our first thing that we’re going to tackle today in our first episode of Big Data Big Questions for 2018 is going to be non-technical jobs or career options inside of big data.

It’s submitted in from one of our YouTube users. You can find out more right after this.

Today’s question comes in from YouTube. Remember, if you have any questions around big data or anything that you want to ask and you want me to answer, you can submit those in our YouTube comments below on any of the videos, or you can go to my website at thomashenson.com/bigdataquestions. You can submit any questions there, and I’ll answer them as best I can on air, and give you my advice on the Hadoop community, or big data, or data engineers, or any questions that you have.

Today’s question comes in from YouTube, and it’s from Shahzad Khan. He says, “I work as a change manager, and I don’t know anything about Java or Hadoop, but I want to learn this technology. Is it all right for me to learn, since I’m not into coding? Also, I’ve never been involved in a development team, please suggest.”

Great question. Thanks for the comments and thanks for watching. Continue to watch. My first thing when I look at this is, we’ve talked about the ability, and I’ve had a couple other videos that you’ve seen where we’ve talked about, that you don’t have to know Java to be involved in Hadoop. If you have any questions around that, you can check into that. Really, I think this question, I want to frame it a little bit different, and think about, just because you want to be involved in big data, and you want to be involved in the community and all the things that are happening, you don’t necessarily have to have a technical role to be involved in that.

There’s three roles that I want to talk about that are non-technical from the aspect of coding and Hadoop administration that you can do to still be involved in data or even big data. I’m going to put them together. These aren’t just specifically for big data. This can be around data analytics.

The first one is around data governance. When we talk about data governance, we talk about, what’s the flow of data? Where did the data originate? Everybody’s probably heard of the adage or the example of garbage in, garbage out. Where’s your data coming from? Can you trust, and can you automate, and trust the data that’s coming in? Data governance is about where that data comes from, but it’s also about, how timely is that data? You’re really involved with the sourcing of the data. You’re also looking at things around… I remember one of my first career options. I remember sitting around, and we have a couple different applications, and the heads of each application were together, and we were all there to talk about the different ways that we name things in our own databases. If you think about it, we were trying to merge everything into an enterprise data warehouse. This is a little more old school, but it still happens in big data, when we have these different data sources.

You might have an instance where data from one data set is named or has a different key than data in a separate data set, but you want to be able to merge those. Data governance is around, you can help find and help be a part of that, where the data’s coming from, so that’s one option. I would look into data governance if you still wanted to be involved in big data but didn’t have the technical skills or didn’t have desire to have the technical skills.

Another one is project management. We always need good project managers. Project managers, they’re the ones, the workhorses that really help bring the developers, bring the data scientists, bring the front-end developers, bring everybody together, and really gets that project going. Makes sure that we’re communicating. If you’re interested in project management, you can do that from a non-technical perspective. One of the things, though. I’ve got some stuff on my website where I went through and did the scrum master training. Think of agile development. Just like you would in traditional application development, big data needs agile developers or agile project managers as well.

Then, also look at the scrum master training, but also look at DevOps, and see where that is, if there’s any DevOps certifications, or anything that you can provide in that background to be able to help and manage these teams. Project management is a second one, and then the big one, the next one, compliance and security. We always need compliance and we always need security, especially now with the maturity of the Hadoop community and how much Hadoop is taking over and being used in the enterprise. There’s always compliance around it. You think of HIPAA, you think of some of the SEC compliance here in America. Then, you can also think of GDPR. GDPR, General Data Protection Requirements compliance, I would look at that regulation.

That’s something that’s really interesting to me, and if I was somebody non-technical, and I was interested in compliance or security, that is one area I would start to look at, because I think there’s going to be a growing need. Anytime there’s any kind of regulation, and this is a political statement in any way, but anytime there’s any kind of regulation or change in regulation, there’s a lot of things that go on behind the scenes as far as interpreting that and making sure that you’re in compliance with your enterprise, or if you’re working for some kind of public institution, you want to make sure you’re doing that. Anytime something like that, if you can become an expert and move to that, that would be huge as well.

For securing the data, too. It’s an ongoing, probably overused joke. How many data breaches have you heard about? There’s one every day. Big data is not, we’re not, immune to that. In fact, we’re larger, a larger target. Think about the three Vs.

Volume. How much data do we have in your Data Lake? Big data has big data, right? You need to be able to secure that. Those are the three areas I would look at for non-technical jobs if you still want to be involved in data. Data governance, project management, and compliance and security. That’s all for today. Thanks for tuning in. Make sure you subscribe, so you never miss an episode. I will see you again on Big Data Big Questions.

What are the fundamental DynamoDB CLI Commands every AWS Data Engineer should know?

What is DynamoDB?

DynamoDB fully managed cloud NoSQL database that supports both document and key value store models. Amazon touts DynamoDB as the most popular Cloud based NOSQL database. DynamoDB is NoSQL database not a traditional relational database thus does not support Joins. Below are a few more key characteristics of DyanmoDB:

DynamoDB Characteristics

Fully Managed NoSQL Database

Key-value pair NoDSQL Database

Durable across 3 availability zones

Hardware is a performance play with dedicated SSD

Scales up or down without any down time

Heavy support for JSON

Ready to Use the DyanmoDB CLI Commands?

All the commands below are executed using the AWS CLI (Link to AWS CLI) with permission to DynamoDB. Once the AWS CLI is installed make sure to configure the AWS CLI to the DyanmoDB region. For the examples below I will be working on a DynamoDB table for college teams with table name of college-teams. The table will have 5 items (college-id, colors, location, name, mascot). The college-id item will the primary partition key and data type of number. The remaining items will all be classified as data type string.

DynamoDB CLI Commands

Scan

aws dynamodb scan – Returns one or more items from DynamoDB table. Different from query because scan is a brute force query. Since scan will walk the entire table make sure to use sparingly.

1

$aws dynamodb scan--table-name college-teams

Create Table

aws dynamodb create-table – Creates new unique table in Dynamodb. Ensure when creating table to follow AWS best practices with table name being unique in each region. Accepts parameters for table name, local secondary index, global secondary index, key schema, and other. Using –cli-input allows to pass in table configuration via JSON.

1

2

3

4

$aws dynamodb create-table--table-name Aircraft

Create Table-Complex(passing inJSON file)

aws dynamodb create-table--table-name

List Tables

aws dynamodb list-tables – List all tables in region.

1

$aws dynamodb list-tables

Put Item

aws dynamodb put-item – Add new items or update existing item to DynamoDB table. Using JSON helps add better structure for parameters and allows developer/administrators to not have type all on one line.

1

$aws dynamodb put-item–-table-name college-teams–-item file//

Describe Table

Get Item

aws dynamodb get-item – Return item with the passed in key value from DynamoDB table. To pass in the values to return it is best to pass in through JSON file. In the example below we want to return the value with a college-id = 1. The “N” specifies the data type which in this example is a number.

1

2

3

4

5

6

7

$aws dynamodb get-item–-table-name college-teams–-key file://get.json

//source of get.json

{

"college-id":{"N":"1"}

}

//end of source

Query

aws dynamodb query – Query returns one or more items based of primary keys. Supports any tables or secondary index from composite key. Use JSON to ease query input.

1

2

3

4

5

6

7

8

$aws dynamodb query–-table-name college-teams--index-name college-id

--expression-attribute-value file://query.json

//source of query.json

{

"college-id":{"N":"1"}

}

//end of source

Batch Get Item

aws dynamodb batch-get-item – Query multiple items from DynamoDB Table. They key to using the command is to pass the request to use keys in the JSON file. Used primary for large queries with upper limitation of 100. DynamoDB places an upper limit of 100 on batch-get-item. Make sure to take in to account the capacity read units for each query.

Batch Write Item

aws dynamodb batch-write-item – Allows for administrators/developers to add or delete multiple items in a DynamoDB table. Just like the Batch Get Item the request is passed in the JSON file. For adding items we will use the “PutRequest” and for deleting items use the “Delete-Item”.