Related

Post navigation

Sites

Agile
Apache®, Apache Spark, Spark®, and the Spark logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided at SparkDeveloper.com

At this point, the number of blogs and articles about Big Data probably surpasses the amount of data collected by a typical organization. For every company trying to solve the “data problem”, the issue isn’t just sheer size. That is just the starting point, and granted, it is a major hurdle to jump, but that […]

Guest author Nagapriya Tiruthani, Offering Manager, IBM Big SQL, IBM Why Big SQL? Enterprise Data Warehousing (EDW) emerged as a logical home for all enterprise data that captures the essence of all enterprise systems. But in recent years, there’s been an explosion of data being captured from social media, sensors, etc. This rapid growth has put […]

Recent versions of Hortonworks Data Platform (HDP) introduced several innovations in the areas of security, data governance, business cataloging, query optimization, visualization, and backup-and-restore. To help us keep pace with the rapid adoption of HDP, we are actively partnering with innovative companies within the big data ecosystem. Expanding the Ecosystem One such partner is Denodo, […]

Without the right knowledge at the right time in the healthcare industry, there can be dire consequences for many people. This means no one person or organization can monopolize knowledge, and the only way to ensure the transference of information is accessible data. Nothing is Solved in a Silo On Monday, we discussed how predictive […]

At Hortonworks we are constantly striving to achieve high quality releases. HDP/HDF releases are deployed by thousands of enterprises and are used in business critical environments to crunch several petabytes of data every single day. So maintaining the highest standards of quality and investing in an infrastructure to support the repeatable standards of quality is […]

Healthcare organizations face distinct challenges in managing the quantity of their data along with being able to utilize data from numerous different sources. Through the use of predictive analytics and data virtualization, healthcare organizations are able to better serve their members. One example of utilizing predictive analytics to provide better solutions and cost savings is […]

Big Data is rapidly changing our world, and the insurance industry is no exception. Leveraging and analyzing data leads to actionable insights that insurance companies use to help mitigate risks to prevent disasters. The availability of new data sources has created new opportunities to optimize risk, reduce exposure and create behavior based products. Insurance companies […]

The Enterprise Data Warehouse (EDW) has been a great custodian of enterprise data, and the foundation for business critical analytics for many years. The growing volumes of business data combined with recent advances in Big Data technology are introducing a market inflection point where enterprise legacy EDWs can no longer effectively address the scale, diversity, […]

Originally published by DataQuest It would not be an over statement to say that widespread cyber attacks crippling global businesses has become the new normal. The speed and scale of the recent ransomware attacks and cyber-security breaches have taught us one important lesson. Threat detection and mitigation will be the key to SOC (security operations […]

Bad data in any organization has damaging consequences, and in the housing market, the families that become affected only inflames concerns. Without the proper data at the right time, and the ability to effectively govern the data coming in, the process of ensuring the right candidates are able to live in the home of their […]

With today’s new rapid pace, speed to market is a huge factor for any business. The faster a company can gain insights from their data, the better they can serve their customers. If changes aren’t made quickly enough, there’s a significant risk of losing customers and market share. One example of gaining faster insights from […]

Economic progress can seem like a two-edged sword – we relish the opportunities for career and lifestyle choices offered by our expanding cities, but urban transportation woes can sometimes make us wonder if it’s all worth it. The Association of Southeast Asian Nations (ASEAN) recently celebrated its 50th anniversary this year – and there is […]

The rate of change in data management is astonishing. In just a few years we have seen the emergence of big data turn into a data lake and then we pushed these concepts to the edge where we capture our data. Along the way, the traditional paradigm of building a monolithic store to drive analytics […]

This is the second post in the Engineering @ Hortonworks blog series that explores how we in Hortonworks Engineering build, test and release new versions of our platforms. In this post, we deep dive into something that we are extremely excited about – Running a container cloud on YARN! We have been using this next-generation […]

One of the most exciting new features of HDP 2.6 from Hortonworks was the general availability of Apache Hive with LLAP. If you missed DataWorks Summit you’ll want to look at some of the great LLAP experiences our users shared, including Geisinger who found that Hive LLAP outperforms their traditional EDW for most of their […]

Suppose I have the following piece of code: val a = sc.textfile("path/to/file") val b = a.filter().groupBy() val c = b.filter().groupBy() val d = c. val e = d. val sum1 = e.reduce() val sum2 = b.reduce() Note that I have not used any cache/persist command. Since the RDD b is being used again in the […]

In my spark stderr I see only "Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend" when I try to increase my "yarn.nodemanager.resource.memory-mb" beyond 3512.However, I need it to be about 12GB or so as I need to allocate memory for containers that are being spawned by the job.This is on Sandbox docker HDP 2.6.1.0-129, spark2.Update:Directory […]

Hi All,I struggle to change number of cores allocated to spark driver process. As I am quite newbe to spark I have no more ideas what would be wrong. Hope somebody would advise me on this.It seems that drivers cores number remains at default value, that is 1 as I guess, regardless what value is […]

Hi Team,I am working on to get count out number of occurence in a file. RDD1 = RDD.flatMap(lambda x:x.split('|'))RDD2 = RDD1.countByValue()I want to save the output of RDD2 to textfile. i am able to see output byfor x,y in RDD2.items():print(x,y)but when tried to save to textfile using RDD2.saveAsTextFile(\path) it is not working.it was throwing as […]

I am trying to add multiple STS in my cluster. However, only the instance on same host as HiveServer2 is working, the rest are not.I suspect some problem with how I have added the multiple instances.I could not find a document for this scenario.

Scenario 1: Only one instance of Spark Thrift Server is neededApproach:If you are installing the Spark Thrift Server on a Kerberos-secured cluster, the following instructions apply: The Spark Thrift Server must run in the same host as HiveServer2, so that it can access the hiveserver2 keytab.Edit permissions in /var/run/spark and /var/log/spark to specify read/write permissions […]

Hi,I have business like getting file in daily basis do some process and and append output file into already processed file and add one more column id dynamically increment by 1 max(lastgenerated output id).See the below problem and facing problem from day2 like below//below issue not facing when run 10 to 20 recordsday 1i/p col1, […]

While running the spark-bench command ./examples/multi-submit-sparkpi/multi-submit-example.sh from spark bench distribution file, getting this errorException in thread "main" java.lang.Exception: spark-submit failed to complete properly given these arguments: --class com.ibm.sparktc.sparkbench.cli.CLIKickoff --master yarn-client /home/dialdev/spark-bench_2.1.1_0.2.2-RELEASE/lib/spark-bench-2.1.1_0.2.2-RELEASE.jar /tmp/spark-bench-3184271760086441425.conf at com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch$.launch(SparkLaunch.scala:47) at com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch$anonfun$run$2.apply(SparkLaunch.scala:39) at com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch$anonfun$run$2.apply(SparkLaunch.scala:39) at scala.collection.immutable.List.foreach(List.scala:381) at com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch$.run(SparkLaunch.scala:39) at com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch$.main(SparkLaunch.scala:16) at com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch.main(SparkLaunch.scala)In spark-bench-env.sh, the environment variables are set as follows:export SPARK_HOME=/usr/hdp/2.5.3.0-37/spark […]

I am using EMR & running a spark streaming job with yarn as resource manager and Hadoop 2.7.3-amzn-0, I want clean datanode files after completion of spark job : /mnt/hdfs/current/BP-2030300665-192.168.0.1-1495611838265/current/finalized/subdir0/subdir230/blk_1073800835 & blk_1073800835_60012.meta Its increase my storage and facing disk storage full issue. Is there any way to achieve the same or any impact on my […]

Below query when executed in Hive CLI produces result but not through spark using Hivecontext.val s = """select cast(date as String) as date_key, field1, field2, field3, field4, COUNT(CASE WHEN m.field5 in ('Core','A') THEN a.strRXNUM END) AS result1, COUNT(CASE WHEN m.field6 in ('Advanced','D') THEN a.strRXNUM END) AS resul2, COUNT(CASE WHEN m.field7 in ('Custom Core','g') THEN a.strRXNUM […]

I am planning to enable full-fledged development environment for Spark Practice, with Intellij / eclipse installed I am trying to enable Desktop mode for HDP 2.6 VM and it keeps on failing while trying to Add VNC Server; is it possible to achieve this?Error I am getting is shown in the screen-shot attached.

Hi Team,I was trying to load dataframe to hive table, bucket by one of the column. I am facing error.File "", line 1, in AttributeError: 'DataFrameWriter' object has no attribute 'bucketBy'Here is the statement I am trying to passrs.write.bucketBy(4,"Column1").sortBy("column2").saveAsTable("database.table")Can you please help me out in this

I was getting a zero-length error on /usr/hdp/apps/spark2/spark2-hdp-yarn-archive.tar.gz, which is documented as an issue after some upgrades. So I created and uploaded the file to hdfs using the following commands: tar -zcvf spark2-hdp-yarn-archive.tar.gz /usr/hdp/current/spark2-client/jars/* hadoop fs -put spark2-hdp-yarn-archive.tar.gz /hdp/apps/2.5.3.0-37/spark2/Now when running any spark job in yarn (say the example pi app), I get the following […]

On September 7th, we held our monthly Bay Area Apache Spark Meetup (BASM) at HPE/Aruba Networks in Santa Clara. We had two Apache Spark related talks: one from Aruba Networks’ Data Engineering team and other from Databricks’ Machine Learning team. For those who missed the meetup, below is the video and link to each presentation […]

Since Apache Spark 1.6, as part of the Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of the modern hardware. This effort culminated in Apache Spark 2.0 with Catalyst optimizer and whole-stage code generation. Because Spark […]

First I’ll start with the sad truth. The technology industry at large has taken many hits over the years for discriminatory practices and underrepresentation of both women and minorities. Ageism, too, is a beast that lurks in the Valley. So as an employee, I’m happy to announce that Databricks has formed a Diversity Committee to address […]

We are very excited today as we announce a partnership between Databricks and Looker. We have seen customers using these products together to provide an easy and intuitive way for business users to visualize and discover the powerful analytics results of Apache Spark. Using Looker and Databricks, you can experience the following benefits: Easy to […]

Since Apache Spark 1.3, Spark and its APIs have evolved to make them easier, faster, and smarter. The goal has been to unify concepts in Spark 2.0 and beyond so that big data developers are productive and have fewer concepts to grapple with. Built atop the Spark SQL engine, with Catalyst optimizer and whole-stage code […]

At the Spark Summit in San Francisco in June, we announced an open-source project Deep Learning Pipelines. Deep Learning Pipelines provides high-level APIs for scalable deep learning in Python with Apache Spark, and the library leverages Spark for its two strongest facets: In the spirit of Spark and Spark MLlib, it provides easy-to-use APIs that […]

This summer, I worked at Databricks as a software engineering intern on the Growth team. By introducing two new features, user groups and API tokens, I simplified the user management experience and improved security for API authentication. In this blog, I briefly discuss their use and merits and share my personal experience as an intern […]

At the Spark Summit in San Francisco in June, we announced that Apache Spark’s Structured Streaming is marked as production-ready and shared benchmarks to demonstrate its performance compared to other streaming engines. Structured Streaming is a novel way to process streams. Not only does this new way make it easy to build end-to-end streaming applications, […]

Developing custom Machine Learning (ML) algorithms in PySpark—the Python API for Apache Spark—can be challenging and laborious. In this blog post, we describe our work to improve PySpark APIs to simplify the development of custom algorithms. Our key improvement reduces hundreds of lines of boilerplate code for persistence (saving and loading models) to a single […]

GRAKN.AI is the database for AI. It is a distributed knowledge base designed specifically to handle complex data in a knowledge-oriented system — a task for which traditional database technologies are not the best fit. To ensure that their internal knowledge is the most up-to-date and relevant, AI systems are always hungry for newly updated […]

I've written a lot recently about the rise in genomic data and the applications being developed on top of this — for instance, a recent project featuring IBM and the New York Genome Center (NYGC), The Rockefeller University, and other NYGC member institutions. The work compared a number of techniques that are commonly used to […]

If in your mind, "Apache Flink" and "streaming programming" does not have a strong link, then you probably haven't been following news recently. Apache Flink has taken the world of big data by storm. Now is the perfect opportunity for a tool like this to thrive: stream processing becomes more and more prevalent in data […]

We've been tracking the growth of data created on the internet for several years, and have updated the information for 2017 to show you how much data that is being created on the internet — every day! The Amount of Data Created Each Day on the Internet in 2017 In 2014, there were 2.4 billion […]

In this article, I will do market basket analysis with Oracle data mining. Data science and machine learning are very popular today. But these subjects require extensive knowledge and application expertise. We can solve these problems with various products and software that have been developed by various companies. In Oracle, methods and algorithms for solving […]

Bayesian nonparametrics is a class of models with a potentially infinite number of parameters. High flexibility and expressive power of this approach enable better data modeling compared to parametric methods. Bayesian nonparametrics is used in problems where a dimension of interest grows with data, for example, in problems where the number of features is not […]

If you've been following software development news recently, you probably heard about a new project called Apache Flink. I've already written about it a bit here and here, but if you are not familiar with it, Apache Flink is a new-generation big data processing tool that can process either finite sets of data (this is […]

This is the first post in a blog series dedicated to Apache Kafka and its usage for solving problems in the big data domain. Using a hands-on approach and exploring the performance characteristics and limits of Kafka-based big data solutions, the series will make parallels with road racing. The reason for this is twofold. First, […]

Financial services firms spent $6.4 billion on data-related programs in 2015, according to Accenture, and they predict this amount to grow at approximately 26% each year for the next two years. So, what are these firms spending their budget on? In a word: Transformation. Today, financial services companies not only need access to vast amounts […]

"Doing more with less" is the main philosophy of credit intelligence, and credit risk models are the means to achieve this goal. Using an automated process and focussing on the key information, credit decisions can be made in seconds and can eventually reduce operational cost by making the decision process much faster. Fewer questions and […]

It is no secret that data has been exploding in virtually every facet of every industry, from financial services, manufacturing, retail, and biotechnology to telecom, healthcare, transportation, and energy. In fact, data created in the previous two years was greater than the previous 5,000 years of humanity and 2017 will create even more data in one year […]

When LinkedIn started growing its member base, the site’s functionalities got more complex day by day. In 2010, they decided to invest in redesigning the infrastructure to facilitate the blooming need of scaling their multiple data pipelines without much hassle. As a result, Kafka — a single, distributed pub-sub — was born to handle real-time […]

At Return Path, I work on a data science team that uses machine learning and natural language processing to produce features that augment data feeds that we sell directly to clients. Return Path’s core business is in email marketing optimization. In a nutshell, we use data and analytics to help marketers optimize how and when […]

Amidst the analysis of driving voluminous data, along with analytics challenges, there are concerns about whether the conventional process of extract, transform, and load (ETL) is applicable. ETL tools quickly "intrude" across mobile apps and web applications, as they can access data very efficiently. Eventually, ETL applications will accumulate industry standards and gain power.

Linear models — better known as linear regression — are one of the most common and flexible analysis frameworks to identify relationships between two or more variables. The widely used linear model is represented by drawing the best-fit line through a series of data points represented on a scatter plot. For any budding business analyst, […]

Blockchain is having a huge impact in the global financial market and is transforming the financial services of many business and industries. In short, it changes the financial aspect of business sectors, providing a clear record book that can't be altered between two individuals.

Thanks to Eric Mizell, V.P. Global Solutions Engineering at Kinetica, for taking me through how their GPU-accelerated database delivers new geospatial capabilities and enterprise-grade features for better performance and user experience (UX), as well as improved efficiency and use to meet the demands for real-time location-based analytics and analytic processing. “Kinetica’s GPU-accelerated database delivers at least 100x […]

I’m neck deep in research around data and APIs right now, and after looking at 37 of the Apache data projects, it is pretty clear that web APIs are not a priority in this world. There are some of the projects that have web APIs, and there a couple projects that look to bridge several […]

Indexing is a commonly used strategy among programmers. Without fully grasping the idea behind the technique, however, programmers are always eager to take advantage of it whenever they encounter a query performance problem, only to be disappointed by the result. By analyzing the principle of indexing, the article tries to show programmers the appropriate time […]

While developing scripts for a big data workflow, I was required to develop a master and child script execution scenario. In the master as well as the child script, we were passing parameters. Given my Java background and the fact that I use a self-developed command-line argument parser, I was feeling the pain of using […]