Tagged Questions

Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

I am trying to load a a hdfs file into my spark program by providing the location as a string but the string "hdfs://location/file" resolves to "hdfs:/location/file". I have tried couple of different ...

I have a Hive table partitioned by date(e.g. 20150730)
Furthermore, I created a hive query which consumes today's partition date and the most recent previous partition date which is not necessary to ...

I am using pandas to process information. The current work flow is to read the last 30 days of data from a hdf5 file, then add the latest data to it and perform some analysis.
I then need to append ...

I wrote an apache spark job that uses some configuration file. When I run this job locally, it works fine. But when I submit this job to a YARN cluster, it fails with a java.io.FileNotFoundException: ...

In hdfs as i understand all files are replicated, but we do certain logging in case of our jobs, the files of which we donot want replication as it might unnecessarily maintain replicated copies, is ...

We have a spooling directory source where we receive filenames in format
filename.yyyy-MM-dd.id.gz we use HDFSSink to push it to HDFS as is.
We want to adjust the agent...sink.hdfs.path value in such ...

I have a 8 node hadoop cluster (CDH 5.2.0) of which i'm using all the nodes as datanodes.
I observed this rather peculiar behavior where no writes are happening on one of the data nodes and the disk ...

I am trying out to load our data in hadoop hdfs. After some test runs, when check hadoop web ui, I realise that there is a lot of space consumed under title "Non-DFS used". In fact, "Non-DFS used" is ...

I have set up sqoop to fetch data from a distant database , and it worked fine , i can see data splitted in blocks in my hdfs .
NOw i want to explore this data with RHadoop , i have my algorithm when ...

I'm loading 28 GB file in hadoop hdfs using webhdfs and it takes ~25 mins to load.
I tried loading same file using hdfs put and It took ~6 mins. Why there is so much difference in performance?
What ...

I am very new in HBase. I started to work with HBase very recently,In my Ubuntu server Standalone HBase work fine with Zookeeper. However, while I am trying to work with Pseudo-Distributed Local, it's ...

Spark TaskLocations are further categorized into Executor cache location and HDFS cache location. From the name, I understand that the former refers to the cache held by the executor on a host, and ...

I have two logical data tables - the first contains raw data financial data with about 500 million rows per day. The second is a reference data table containing about 10,000 rows per day, with some ...