Collection of few very interesting things that i have faced in my big data journey

Friday, 23 December 2016

This blog details about replacing a dead cassandra node. Recently, i have faced this situation and i had to struggle with few issues. I would like to detail all those issues and point right resolution in all the situations.1. First i would like to describe normal replace procedure. This should be first try, and this works only in ideal situations. Nevertheless, try it.a. Check the status of the node using "nodetool status". If any node is down, the status will appear with "DN" status.eg:Assuming that you have 6 nodes, here is how it looks. I am giving a masked details and renamed host id and IPs.

Here, in the above status, the node with IP 1.2.3.8 is DOWN.Assuming we have to replace that with a new machine, here are the steps.1) install cassandra on new node and do not start cassandra.2) make sure the seed details and every thing is fine in the cassandra installation.3) start cassandra with following command (assuming your cassandra installation directory is /usr/lib/cassandra)./usr/lib/cassandra/bin/cassandra -Dcassandra.replace_address_first_boot=1.2.3.8Now, the story begins.case 1: If it starts without any problems, you must be lucky and go to all the nodes and do repair on every one. That should be it.case 2: If you get waring saying, it is unsafe to replace and use cassandra.allow_unsafe_replace.then: /usr/lib/cassandra/bin/cassandra -Dcassandra.replace_address_first_boot=1.2.3.8 -Dcassandra.allow_unsafe_replace=trueif it starts after that, you can still consider your self lucky.. go ahead with repair on each node and you will be done.case 3: If it screams with error

java.lang.RuntimeException: Host ID collision between active endpoint

It means, the details from seed about the cluster information are still having the died machine in its gossip or system information. If you get this situation, proceed as follows.i) nodetool removenode --host 1.2.3.8Then run the cassandra command with replace as in step 1/step 2.

ii) If it still screams at you,what you can do is, go to the data folder of cassandra. This will be configured in cassandra.yaml. By default it will be, <cassandra_installation_directory>/data.check the system directory in data directory. This is system information collected from all the machines. Once it is created with old machine details, you will get into this situation.run the command on the new/fresh node that you want to replace. P.S. Do not run this command on any existing machine. This will destroy complete cluster information if mis-used.rm -r <data_direcotry>/system/*What it means: Removing all system tables data from the new cassandra node.now run the command with replace_address command above. If you encounter case 2, run with unsafe replace true.This should join the cluster now without any issues. When you check the nodetool status,you should see only new node, but with same host id as old machine, like below.

Friday, 22 July 2016

I have recently faced this question in stackoverflow, which i resolved. Though many people already done way advanced things than this in Scala and SBT, i just felt sharing this would be a good one.

The Problem:
User wants to change the dependency in SBT file based on condition. Users might be using different versions of APIs in different regions like Dev, Test etc. This question was posted asking that, the user wants to use different dependency based on the environment he is using.Solution:
I have just performed the dynamic build with two different spark version in my example. I have to use two different version based on specific condition.

You can do that in two ways. As you need to provide input in one or other way, so you need to use command line parameters.

1) using build.sbt it self.

a) you can define a parameter with the name "sparkVersion"

b) read that parameter in build.sbt, (you can write scala code in build.sbt, and it gets compiled to scala any way in build time.)

Tuesday, 3 May 2016

what is schema evolutionSchema evolution is the term used for how the store behaves when schema is changed after data has been written to the store using an older version of that schema. The modifications one can safely perform to schema without any concerns are:> A field with a default value is added.> A field that was previously defined with a default value is removed.> A field's doc attribute is changed, added or removed.> A field's order attribute is changed, added or removed.> A field's default value is added, or changed.> Field or type aliases are added, or removed.Rules for Changing Schema:1.For best results, always provide a default value for the fields in your schema. This makes it possible to delete fields later on if you decide it is necessary. If you do not provide a default value for a field, you cannot delete that field from your schema.2.You cannot change a field's data type. If you have decided that a field should be some data type other than what it was originally created using, then add a whole new field to your schema that uses the appropriate data type.3.You cannot rename an existing field. However, if you want to access the field by some name other than what it was originally created using, add and use aliases for the field.4.A non-union type may be changed to a union that contains only the original type, or vice-versa.How do you handle schema evolution with AVRO:Schema evolution is the automatic transformation of Avro schema. This transformation is between the version of the schema that the client is using (its local copy), and what is currently contained in the store. When the local copy of the schema is not identical to the schema used to write the value (that is, when the reader schema is different from the writer schema), this data transformation is performed. When the reader schema matches the schema used to write the value, no transformation is necessary.Schema evolution is applied only during deserialization. If the reader schema is different from the value's writer schema, then the value is automatically modified during deserialization to conform to the reader schema. To do this, default values are used.There are two cases to consider when using schema evolution: when you add a field and when you delete a field. Schema evolution takes care of both scenarios, so long as you originally assigned default values to the fields that were deleted, and assigned default values to the fields that were added.Avro schemas can be written in two ways, either in a JSON format:{ "type": "record", "name": "Person", "fields": [ {"name": "userName", "type": "string"}, {"name": "favouriteNumber", "type": ["null", "long"]}, {"name": "interests", "type": {"type": "array", "items": "string"}} ]}…or in an IDL:record Person {string userName;union { null, long } favouriteNumber;array<string> interests;}The following are the key advantages of Avro:* Schema evolution – Avro requires schemas when data is written or read. Most interesting is that you can use different schemas for serialization and deserialization, and Avro will handle the missing/extra/modified fields.* Untagged data – Providing a schema with binary data allows each datum be written without overhead. The result is more compact data encoding, and faster data processing.* Dynamic typing – This refers to serialization and deserialization without code generation. It complements the code generation, which is available in Avro for statically typed languages as an optional optimization.Example code to handle the schema evolutionCreate an External table with Avro Schema -- This can be tried with external or managed tableCREATE external TABLE avro_external_table ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://quickstart.cloudera:8020/user/cloudera/avro_data/' TBLPROPERTIES ( 'avro.schema.url'='hdfs://localhost:8020/user/cloudera/old_schema.avsc');-- Select query to check the dataSelect * from avro_external_table-- Alter table statement to alter the schema file, to check the schema evolutionALTER TABLE avro_external_table SET TBLPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/user/cloudera/new_schema.avsc');-- Select querySelect * from avro_external_table Old_schema:========={"type": "record","name": "Meetup","fields": [{"name": "name","type": "string"}, {"name": "meetup_date","type": "string"}, {"name": "going","type": "int"}, {"name": "organizer","type": "string","default": "unknown"}, {"name": "topics","type": {"type": "array","items": "string"}}]}New_Schema :(renames the going to attendance and adds new field with “location” ):================================================={"type": "record","name": "Meetup","fields": [{"name": "name","type": "string"}, {"name": "meetup_date","type": "string","java-class": "java.util.Date"}, {"name": "attendance","type": "int","aliases": ["going"]}, {"name": "location","type": "string","default": "unknown"}]}

Wednesday, 20 April 2016

You are given a 6∗6 2D array. An hourglass in an array is a portion shaped like this:

a b c de f g

For example, if we create an hourglass using the number 1 within an array full of zeros, it may look like this:1 1 1 0 0 00 1 0 0 0 01 1 1 0 0 00 0 0 0 0 00 0 0 0 0 00 0 0 0 0 0Actually there are many hourglasses in the array above. The three leftmost hourglasses are the following:

1 1 1 1 1 0 1 0 0

1 0 0

1 1 1 1 1 0 1 0 0

The sum of an hourglass is the sum of all the numbers within it. The sum for the hourglasses above are 7, 4, and 2, respectively.

In this problem you have to print the largest sum among all the hourglasses in the array.

Solution:

My solution works with iterative approach as of now. I am preparing a solution with recursive as well.

Tuesday, 9 February 2016

I have to join two HBase tables to get the result for one of project and i could not fing a concrete solution that can resolve this. So, i tried to resolve this by my own. Here is how the solution works:I am taking classic case of User and Dept joins, as i could not present my project work due to security reasons. Let's Say, User table has below structureColumnFamily Qualifier ------------------------------------useruseriduser nameuserdeptIdDept table has below structureColumnFamilyQualifier ------------------------------------departmentdeptiddepartment departmentdepartmentdescriptionConsidering, you have to join based on DeptID that is available on both the tables. Here is how the code will work.

Sunday, 31 January 2016

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.

In Short, Processing Huge data that is real time both in terms of processing and results.

Much like Spark is built on the concept of RDDs, Spark Streaming provides an

abstraction called DStreams, or discretized streams. A DStream is a sequence of data

arriving over time.

Spark Streaming uses a “micro-batch” architecture, where the streaming computation is treated as a continuous series of batch computations on small batches of data. Spark Streaming receives data from various input sources and groups it into small batches. New batches are created at regular time intervals. At the beginning of each time interval a new batch is created, and any data that arrives during that interval gets added to that batch. At the end of the time interval the batch is done growing. The size of the time intervals is determined by a parameter called the batch interval. The batch interval is typically between 500 milliseconds and several seconds, as configured by the application developer. Each input batch forms an RDD, and is processed using Spark jobs to create other RDDs. The processed results can then be pushed out to external systems in batches.

DStream: Core Concept of Spark StreamingInternally, each DStream is represented as a sequence of RDDs arriving at each time step (hence the name “discretized”). DStreams can be created from various input sources, such as Flume, Kafka, or HDFS. Once built, they offer two types of operations: transformations, which yield a new DStream, and output operations, which write data to an external system.

DStreams provide many of the same operations available on RDDs, plus new operations related to time, such as sliding windows.

Socket Listening Example:

Socket listening means, the application listens the specific TCP Port continuously and when there is a message/event at that port, it gets picked up and processed.

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.streaming.StreamingContext

import org.apache.spark.streaming.Seconds

object SocketListening {

def main(args:Array[String]){

val conf = new SparkConf();

val sc = new SparkContext();

val streamingContext = new StreamingContext(sc, Seconds(1));

val dStream = streamingContext.socketTextStream("localhost", 9999);

val heLines = dStream.filter(_.contains("error"))

heLines.print()

streamingContext.start();

streamingContext.awaitTermination()

}

}

To start receiving data, we must explicitly call start() on the StreamingContext. Then, Spark Streaming will start to schedule Spark jobs on the underlying SparkContext. This will occur in a separate thread, so to keep our application from exiting, we also need to call awaitTermination to wait for the streaming computation to finish.

Note that a streaming context can be started only once, and must be started after we

set up all the DStreams and output operations we want.

Transformations on DStreamsThey can be grouped into either stateless or stateful:

• StateLess: In stateless transformations the processing of each batch does not depend on the

data of its previous batches. They include the common RDD transformations we

have seen in Chapters 3 and 4, like map(), filter(), and reduceByKey().

• Stateful : In this type of transformations, in contrast, use data or intermediate results from previous

batches to compute the results of the current batch. They include transformations

based on sliding windows and on tracking state across time.

Transform It is not like regular transformations. The transform operation (along with its variations like transformWith) allows arbitrary RDD-to-RDD functions to be applied on a DStream. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. However, you can easily use transform to do this. This enables very powerful possibilities. For example, one can do real-time data cleaning by joining the input data stream with precomputed spam information (maybe generated with Spark as well) and then filtering based on it.

Map

Works on each rows

MapPartitions

Works on each partition

Transform

Works on each rdd

Scala Code:

Answer:

To clarify further:

A yourDStream.map(record => yourFunction(record)) will do something on every record in every RDDs in the DStream. Which essentially means every records in the DStream. ButyourDStream.transform(rdd => anotherFunction(rdd)) allows you to do arbitrary stuff on every RDD in the DStream.

For example, if you do yourDStream.transform(rdd => rdd.map(record => yourFunction(record)) is exactly same as the one in the first line. Only a map function.

However, you can also do

yourDStream.transform(rdd => rdd.map(...).reduceByKey(....).filter(...).flatMap(....).sortByKey(...) ) which obviously involves multiple stages of shuffles by keys. So transform is far more general operation than map that allows arbitrary computations on each RDD of a DStream. For example, say you want to sort every batch of data by a key. Currerntly, there is no DStream.sortByKey() to do that. However, you can easily use transform to do DStream.transform(rdd => rdd.sortByKey()).