Once you’ve identified a big data issue to analyze, how do you collect, store and organize your data using Big Data solutions? In this course, you will experience various data genres and management tools appropriate for each. You will be able to describe the reasons behind the evolving plethora of new big data platforms from the perspective of big data management systems and analytical tools. Through guided hands-on tutorials, you will become familiar with techniques using real-time and semi-structured data examples. Systems and tools discussed include: AsterixDB, HP Vertica, Impala, Neo4j, Redis, SparkSQL. This course provides techniques to extract value from existing untapped data sources and discovering new data sources.
At the end of this course, you will be able to:
* Recognize different data elements in your own work and in everyday life problems
* Explain why your team needs to design a Big Data Infrastructure Plan and Information System Design
* Identify the frequent data operations required for various types of data
* Select a data model to suit the characteristics of your data
* Apply techniques to handle streaming data
* Differentiate between a traditional Database Management System and a Big Data Management System
* Appreciate why there are so many data management systems
* Design a big data information system for an online game company
This course is for those new to data science. Completion of Intro to Big Data is recommended. No prior programming experience is needed, although the ability to install applications and utilize a virtual machine is necessary to complete the hands-on assignments. Refer to the specialization technical requirements for complete hardware and software specifications.
Hardware Requirements:
(A) Quad Core Processor (VT-x or AMD-V support recommended), 64-bit; (B) 8 GB RAM; (C) 20 GB disk free. How to find your hardware information: (Windows): Open System by clicking the Start button, right-clicking Computer, and then clicking Properties; (Mac): Open Overview by clicking on the Apple menu and clicking “About This Mac.” Most computers with 8 GB RAM purchased in the last 3 years will meet the minimum requirements.You will need a high speed internet connection because you will be downloading files up to 4 Gb in size.
Software Requirements:
This course relies on several open-source software tools, including Apache Hadoop. All required software can be downloaded and installed free of charge (except for data charges from your internet provider). Software requirements include: Windows 7+, Mac OS X 10.10+, Ubuntu 14.04+ or CentOS 6+ VirtualBox 5+.

JW

I feel as though the assessment questions could have been more specific and the assessment criteria when marking could have been more precise. But other than that it was a great course.

RL

Apr 10, 2019

Filled StarFilled StarFilled StarFilled StarFilled Star

As a undergraduate data analytics student, this course was an enlightening experience that complemented my more theoretical, less-applicational on campus course very well.

Na lição

Big Data Management: The "M" in DBMS

Managing big data requires a different approach to database management systems because of the wide variation in data structure which does not lend itself to traditional DBMSs. There are many applications available to help with big data management. In these lessons we introduce you to some of these applications and provide insight into how and when they might be appropriate for your own big data management challenges.

Ministrado por

Ilkay Altintas

Chief Data Science Officer

Amarnath Gupta

Director, Advanced Query Processing Lab

Transcrição

Most of you have heard of MongoDB as a dominant store for JSON style semi-structured data. MongoDB is very popular and there are a number of excellent tutorials on it on the web. In this module we would like to discuss a relatively new big data management system for semistructured data that's currently being incubated by Apache. It's called AsterixDB. Originally, AsterixDB was conceived by the University of California Irvine. Since it is a full fledged DBMS, it provides ACID guarantees to understand the basic design of AsterixDB, let's consider this incomplete JSON snippet taken from an actual tweet. We have seen the structure of JSON before. Here we point out that entities and user, the two parts in blue are nested, that means embedded, within the structure of the tweet. If we represent a part of the schema of this abbreviated structure in AsterixDB, it will look like this. Here a dataverse is like a name space for data. Data is declared in terms of data types. The top type, which looks like a standard data with stable declaration, represents the user portion of the JSON object that we highlighted before. The type below represents the message. Now, instead of nesting it like JSON. The user attribute highlighted in blue is declared to have the type TwitterUserType, thus it captures the hierarchical structure of JSON. We should also notice that the first type is declared as open. It means that the actual data can have more attributes than specified here. In contrast, the TweetMessage type is declared as closed, meaning that the data instance must have the same attributes as in the schema. AsterixDB can handle spatial data as given by the point data types shown in green. The question mark at the end of the point type says that this attribute is optional. That means all instances need not have it. Finally, the create dataset actually asks the system to create a dataset called TweetMessages, whose type is the just declared quick message type. AstrerixDB which runs on HDFS provides several options for credit support. First it has its own query language called the Asterix query language which resembles the XML credit language query. The details of this query language are not important right now. We are illustrating the structure of a query just to show what it looks like. This particular query asks for all user objects from the dataset TwitterUsers in descending order of their follower count and in alphabetical order of the user's preferred language. What is more interesting and distinctive is that AsterixDB has a creative processing engine that can process queries in multiple languages. For its supported language they've developed a way to transfer the query into a set of low level operations like select and join which their query exchange can support. Further, they've determined how a record described in one of these languages can be transformed into an Asterix. In this manner, the support hive queries, which is expressed in like this. Xquery, Hadoop map reduce, as wall as a new language called SQL++ which extends SQL for JSON. Like a typical DB BDms, AsterixDB is designed to operate on a cluster of machines. The basic idea, not surprisingly, is to use partition data parallellism. Each data set is divided into instances of various types which can be decomposed to different machines by either range partitioning or hash partitioning like we discussed earlier. A runtime distributed execution engine called Hyracks is used for partitioned parallel execution of query plans. For example, let's assume we have two relations, customers and orders, as you can see here. Our query is find the number of orders for every market segment that the customers belong to. Now this query need a join operation between the two relations, using the O_CUSTKEY as a foreign key of customer into orders. It also needs a grouping operation, which for each market segment will pull together all the orders which will then be counted. You don't have to understand the details of this diagram at this point. We just want to point out that the different parts of the query that are being marked, the customer filed here has two partitions that reside on two nodes, NC one and NC two respectively. The orders file also has two partitions. But each partition is dually replicated. One can be accessed either of nodes NC3 or NC2 and the other on NC1 and NC5. Hyracks will also break up the query into a number of jobs and then fill it out which tasks can be performed in parallel and which ones must be executed stage by stage. This whole thing will be managed by the cluster controller. The cluster controller is also responsible for replanning and reexecuting of a job if there is a failure. AsteriskDB also has the provision to accept real time data from external data sources at multiple rates. One way is from files in a directory path. Consider the example of tweets. As you have seen with the hands-on demo, usually people acquire tweets by accessing data through an api that twitter provides. Very typically a certain volume of tweets, lets say for every 5 minutes, is accumulated into a .json file in a specific directory. The next 5 minutes, in another .json file, and so forth. The way to get this data into asterisks DB, is to first create an empty data set called Tweets here. The next task is to create a feed. That is an externally resource. One has to specify that it's coming from the local file system called local fs here and the location of the directory, the format and the data type it's going to copy it. Next, the feed is connected to the data set and the system starts reading unread files from the directory. Another way for AsteriskDB to access external data is directly from an API, such as the Twitter API. To do this, one would create a dataset as before. But this time the data feed is not on the local file system. Instead it uses the push Twitter method which invokes the Twitter client with the four authentication parameters required by the API. Once the feed is defined it is connected to the data set as before.