What do you like best?

The big problem that Hadoop HDFS solves is to store big data and process big data. It processes big data using Map reduce. Big data is a huge issue now and Hadoop can solve it! So, this is what I like about Hadoop HDFS. I believe it's open source, so doesn't really cost anything to install and start using it

What do you dislike?

Honestly, there is nothing to dislike about Hadoop. Hadoop is fairly a new technology that is benefiting a lot of companies. But, if you want to install in your PC, you require a high configuration laptop - 16GB RAM and a better processor.

Recommendations to others considering the product

As I mentioned earlier, it greatly benefits organizations to store their data and to process the data when ever they want. Hadoop is already benefiting a lot of organizations with their data.

What business problems are you solving with the product? What benefits have you realized?

As I mentioned Hadoop has been introduced to solve big data..My company and many other companies want to store data in HDFS, and when they want to process some of the data Hadoop map reduce comes into picture!

Sign in to G2 Crowd to see what your connections have to say about Hadoop HDFS

What do you like best?

It is a Hadoop Distributed File System which is made totally in Java It provides High Scalability and Redundancy. It Stores Both Structured as well as Unstructured Data and it also provides a really fast Data retrieval time. Hadoop has a large Community besides it and so many problems are solved instantly.

What do you dislike?

It is not a Useful tool for the Beginners who want to make A career in Big Data Analtics. MongoDB is still more easy to use and there are many tutorials also available to learn MongoDB. Sometimes Due to Local Hosting of the Tool it lacks sometimes. The UI is also not up to the Mark.

Recommendations to others considering the product

From startups to enterprises, for the modern and the Best Softwares or Analysis of Data use Hadoop HDFS. It allows you to easily store the Structured as well as Unstructured Data and Will allow the Users to Retrive it Easily for Further uses of the Data. Hadoop has a large amount of Softwares which supports various functionalities in BIg Data Analytics Like FLUME, HIVE, PIG-LATIN, MAHOUT, etc. Machine Learning can also be done on the Big Data in this Hadoop Ecosystem.

What business problems are you solving with the product? What benefits have you realized?

Hadoop HDFS as the Name suggests (Hadoop Distributed File System) allows users to Store any Large amount of Data which then can be easily used for analysis. My Business Analytics are all done through Hadoop which helps the Business to grow.

What do you like best?

1.Storing of file in sequential format, using key value pair.- Stores file as key and content as value and encrypts them.

2.128 mb block size.- Previously it was 64 which was less. But still some people like the old block size.

3.storing multiple copies of data- Store same data in multiple nodes. It helps in case there is a failure of single node. As we all know sole purpose of hdfs is using commodity hardware and make the service available.

4. Horizontal scaling and distributed architecture- This helps the system to grow without worrying about previous data. You can increase the number of nodes in case of a drastic change in the amount of data.

What do you dislike?

1.Map reduce jobs takes much time for smaller amount of data too. As the map and reduce jobs are getting created for any job.- Its recommended to use relational database in case of small amount of data in range of GB

2.The immutable nature- Files in HDFS can not be altered-more specifically it can not be modified. Append is allowed though.

Recommendations to others considering the product

yes, it is recommended for bulk data

What business problems are you solving with the product? What benefits have you realized?

We have developed module to push .eml files to HDFS and store it in sequential file.

The retrieval of data is much faster as compared to traditional file system.

What do you like best?

HDFS is fault tolerant as it makes multiple replica of the data which is stored in it, thus making it more reliable. Also when compared with traditional file systems, it is much robust and efficient in working on bulk amount of data, as it process via map reduce in the back-end. Also one of the major advantages is that HDFS can easily run on commodity hardware, making the initial setup cost very low. The commands used for HDFS to manage the files are almost the same as used in shell, providing an ease for the same

What do you dislike?

One of the major drawbacks of Hadoop file system is that is the amount of data to be dealt with is less, it won't be efficient, as the time for processing small data is equivalent to time taken for bulk data.

By default, the security measures in Hadoop File system are disabled, making it insecure for data storage.

Recommendations to others considering the product

For those who are dealing with bulk amount of data, can surely go for hadoop file system, having offline applications, but those dealing with real time small amount of data, this is not recommended.

What business problems are you solving with the product? What benefits have you realized?

We are moving data from traditional file system to, hadoop storage, as the amount of data is increasing day by day, so we require a platform to manage this bulk data in a better manner, robust and cost efficient manner.

What do you like best?

Hadoop can take loads of data from disparate sources quickly and performs well under testing performance conditions with multi-server configurations.

Hadoop is customizable so that nearly and most of our business objectives can be justified with the right combination of data and reports.

Very scalable product for infinite number of rows and large number of parallel processors through dynamic clustering. THe product is also very economical in comparison to SAS.

What do you dislike?

Less organizational support system. Bugs that need help outside help take a long time to get pushed as an update.

Does not come with too much business knowledge hence the container needs a lot of programming to make usable for a specific use case.

Recommendations to others considering the product

use if you have a heavy dataset (most other solutions will not work in this case). it is economical in the medium to long term although the implementation is higher than for most other cloud-based solutions.

What business problems are you solving with the product? What benefits have you realized?

What do you like best?

Hadoop distributed file system is a distributed,scalable,fault tolerant and very efficient data storage platform. This is used to store data and can be used to support data processing frameworks like mapreduce and Spark. The best thing about hdfs is that it can be used by multiple things to create a solution. Best thing about hdfs and hadoop framework is that for training purposes we can even create single node cluster in our laptop.

What do you dislike?

There are not much to dislike but speed is reduced when we deal with small files. if there are lot of small files to save then name node will be under pressure for saving the entry of those files. metadata will increase and hence performance will decrease.

Recommendations to others considering the product

HDFS is a must use solution as we have this as a complete storage solution and for now we have not found anything which can replace it as a storage solution. This can be used with spark as well as mapreduce for real time analysis as well as batch processing. we can store data by different compression techniques that is also a very good thing.

What business problems are you solving with the product? What benefits have you realized?

we are using hdfs as storage solution for large data which is got from legacy system, we use it with map reduce and spark framework to do some analysis of data. we use hbase on top of it and Also apache phoenix. it serves as storage solution by our map reduce programs to save intermediate and final outputs.

What do you like best?

HDFS or hadoop distributed file system is the storage component in Hadoop, where all the data resides at the end of the day. This is like a hard disk is to a computer, but actually this is a type of file system which allows user to store the data.

HDFS is very cost efficient. It is also fault tolerant as it makes replica of the data which is stored in it, thus maintaining a backup of all the files in commodity hardware.

What do you dislike?

The major drawback of Hadoop is the lack of security measures taken, for sensitive information. This may not be considered purely for HDFS, but HDFS being a component of Hadoop, falls under this category.

Also it take time in processing small amount of data, thus making it not so robust for less data as compared to bulk data.

Recommendations to others considering the product

For storing huge data, and managing this bulk data in a robust and cost efficient manner, I would surely recommend HDFS, but if the amount of data to be dealt with is less, then Hadoop File system is not recommended, as it consumes some time in processing.

What business problems are you solving with the product? What benefits have you realized?

In our business, we are moving from traditional file system to Hadoop File system, as the amount of data is growing day by day. Thus to handle this situation, we are moving to Bid Data, to manage this bulk data in a robust and efficient manner. Also the cost of installation, is very low as Hadoop works on commodity Hardware, and keeping replica of files, make it fault tolerant.

What do you like best?

HDFS is inexpensive because of two reasons. Firstly, the filesystem relies on commodity storage disks that are much less expensive than the storage media used for enterprise grade storage. Secondly, the filesystem shares the hardware with the computation framework as well, in this case, MapReduce. Also, HDFS is open source and does not levy licensing fee on the user.

HDFS has been around for more than 7 years and is considered mature technology. There is a large community behind it and a broad range of organizations that are storing petabytes of data on HDFS.

HDFS is optimized for MapReduce workloads. It provides very high performance for sequential reads and writes, which is the typical access pattern in MapReduce jobs.

What do you dislike?

The main drawback of HDFS is that it is not POSIX compliant. This means HDFS is immutable, that is, files cannot be modified.

What business problems are you solving with the product? What benefits have you realized?

We have around 50 GB data getting generated per hour per colo (we operate from 4 colos). Hadoop is udes in InMobi to make sense out of this data.

What do you like best?

The Word itself Hadoop Distributed File System, Its a file system of its own , that is it is like FAT, FAT32 NTFS or ext4 to whatever the system that we have seen, It is a file system to store the data

It is better than any other storage system because of the simple fact that it does not rely on any other file system to store the data.

The other things that I like is it can be increased to any extent and is capable of handling the pica bytes of data. It has capability to store and make the data available to any large resource.

And the Ambari UI that is awesome to work on.

What do you dislike?

The only fact that it is built in java so its very complex to start working with this solution, it require huge experience to start working with Hadoop.

Recommendations to others considering the product

Yes, The name itself tells you that you should go with the solution this is the future of data analytics

What business problems are you solving with the product? What benefits have you realized?

We had a huge data base and wanted to store the digital data as well as the data stored in tables and csv.

What do you like best?

HDFS benefits from a vibrant community of passionate open-source software contributors who have made it the filesystem of choice for users trying to get fault-tolerance and performance without vendor lock-in. It also has a number of easy-to-use access points (the HDFS shell, Java API, Thrift, and REST being the most popular), which means you can reach your data through whatever means you'd like.

What do you dislike?

HDFS is not the easiest distributed filesystem to use and a number of design decisions made have led to some believing that it's not as performant as it could be since it tries to be everything for everyone. Look elsewhere if you have a very specific use-case as far as availability is concerned, for example.

What business problems are you solving with the product? What benefits have you realized?

My company integrates a variety of data sources through workflows that include HDFS as both a source and a destination. The vast ecosystem of access points have made it among the smoothest parts of our architecture to incorporate.

What do you like best?

Well...HDFS files are write once files. I consider HDFS files as write-once and read-many files. There is no concept of random writes. It is optimized for streaming access of large files. I would typically store files that are in the 100s of MB upwards on HDFS and access them through MapReduce to process them in batch mode.

What do you dislike?

HDFS doesn't do random reads very well. A caveat of HDFS to remember, it is a distributed file system abstracted on top of local file system by hadoop, suitable for storing huge files; however, it does not provide facility of tabular form of storage as such.

Recommendations to others considering the product

MUST UNDERSTAND!

HDFS not a No-SQL. both may serve the purpose of storing huge data in distributed manner.

key diff is..

In HDFS, its easier to store the data and it retrive entire row. i.e no specific key based data access.

What business problems are you solving with the product? What benefits have you realized?

Trying to understand merging of files without copying them down locally using the built-in hadoop commands. Currently writing a mapreduce tool that uses the IdentityMapper and IdentityReducer to re-partition the files. Maybe I will merge all files into a single file on HDFS, run the job with just 1 reducer. If, on the other hand, I may want to partition the files into more parts, I will possibly run the job with more reducers.

What do you like best?

All type of tool on top of HDFS like pig/hive (help to reduce your time in writing MR jobs) ,sqoop (to transfer data from RDBMS <-> hdfs) ,NOSQL db can use as their storage FS(ex. HBASE) and many more are available currently plus many big -big organization(cloudera ,MAPR ,hortonwork to name a few) are actively contribution in hdfs ecosystem .

What do you dislike?

For beginner/First timer is a bit difficult to set up hdfs in cluster mode and currently hdfs use yarn2 as their resource manager which is develop keeping very narrrow thinking .They can enhanced by looking at MESOS. Every stage data (intermediated result ) store in disk , for streaming processing hdfs is the worst choice.

Recommendations to others considering the product

1. for streaming processing ,dont even look at this.

2.setup and maintain is quite difficult

3.ecosystem is great and community is active adding good feature on top of hdfs.

What business problems are you solving with the product? What benefits have you realized?

I have to develop a rich analytical dashboard for our business client .

The benefits:

1. For the batch processing with fault tolerant feature its simply the best.

2. Spring integration with hadoop no problem at-all.

3. we have to use column based NOSQL db HBASE to support as dashboard analytic and HBASE on top hdfs work like charm .

What do you like best?

Automatic replication, stable, compatible with/required for the rest of the Hadoop ecosystem. Its pretty easy to manage. Rack Awarenes ensures that losing a single rack doesn't result in the loss of data. Overall, its kind of an awkward thing to write a review about since, it is really an enabling-technology. However, when combined with an analytical tool that can take advantage of HDFS (like Map/Reduce, Hive, Pig, etc), HDFS shows its value. I'm also not aware of any alternatives to using HDFS with these tools.

What do you dislike?

Its purpose-built for Hadoop, and large-scale data processing. That said, it doesn't really work well as a general-purpose filesystem. You can mount it with NFS, but realize that as bad as NFS is, NFS on HDFS is worse. Don't get caught in this trap. For the most part, stick to interacting with it using Hadoop's tools, not generic filesystem tools over NFS.

Recommendations to others considering the product

Its basically a requirement for Hadoop Map/Reduce, Hive, Pig, and many other tools in the Hadoop ecosystem. Don't replace your SAN with HDFS, however, as there are key features which are missing for that purpose.

What business problems are you solving with the product? What benefits have you realized?

Large scale data processing. HDFS enables Hadoop to bring the code to the data as opposed to shipping the data to the code (like a massive DB server that has storage arrays). Redundancy and Scale.

What do you like best?

After working with Hadoop for 6 years, I like the direction in which it has evolved. It started as a tightly-coupled offering of a distributed data platform (HDFS) and an analytics processing framework (MapReduce); however, it has since expanding its scope tremendously. After the initial success of MapReduce, it has become quite clear that it has many limitations as an algorithm. Other frameworks such as Apache Spark have far surpassed its capabilities. Seeing that trend, the Hadoop team instead chose to focus on Hadoop as a base platform for dozens of data and analytics offerings. This lead to the strengthening of the already robust HDFS and the creation of YARN as an applications framework. This gives the Hadoop platform a lot of widespread appeal and makes it a great basis for any big data processing platform. Many tools exist for quickly standing up entire Hadoop clusters in very little time, and it treats scaling and fault tolerance as primary as first-order priorities.

What do you dislike?

There are two major areas that Hadoop could use improvement that have existed since the beginning and continue to be a problem for the implementation of Hadoop in real world settings. The first is poor documentation on performance and tuning. Hadoop works fairly well out of the box, but once you start to encounter problems there are very scarce resources on trying to troubleshoot those issues. Hadoop has been around long enough as an open source project that common configuration strategies and troubleshooting techniques should be built into the documentation. Secondly, and more importantly for many users, are the lack of security options for Hadoop. There are few if any built-in options, and any plugable solutions are fairly difficult to implement.

Recommendations to others considering the product

Hiring experienced DevOps talent is more essential than analytic developers. The skills for analytic developers can be trained faster than those of skilled DevOps team members.

What business problems are you solving with the product? What benefits have you realized?

I have mostly been focused on large-scale analytic workflows for our clients. We have utilized Hadoop and its ecosystem components like Hive, Pig, Spark, Flume, Oozie, and others to implement scaleable workflows for ETL and analytic applications. Hadoop enables the scale and reliability that is an absolute requirement of our customers. We have also used Hadoop as a basis for streaming analytics using YARN and Spark for real-time data analysis.

What do you like best?

It's currently the best distributed file system for implementing biodata projects. The main reason rely on the fact that HDFS is fully integrated with many parallel computing platforms for doing BigData analysis: Map/Reduce, Spark, Impala, Drill.

It's very mature, stable and really robust, optimised for streaming data to the application layer.

It provides now a full support for security, Posix-like, Posix ACLS and support for directory based encryption.

What do you dislike?

It works very badly with many small files. HDFS is optimised for dealing with a relatively small number of files but very big. This is an annoying limitation that forces many architectural decisions for supporting the data ingestion effectively.

What business problems are you solving with the product? What benefits have you realized?

We solve problem of data science on huge amount of data. HDFS as part of the bigger Hadoop eco-system provides all the pieces for ingesting/transform and analyse vas amount of data using advanced parallel platforms.

What do you like best?

Hadoop HDFS is proven scalable and stable enough for big data processing. I have over 6 years in product development and operation on HDFS. Storing and processing terabytes scale data with HDFS. Hadoop HDFS handles scale problem well and most of problem can be solved.

What do you dislike?

HDFS is optimized for big file and batch oriented data process. User should really need to pay attention on "Small File Problem", avoiding produce large amount of small files. This will eventually kill HDFS.

What business problems are you solving with the product? What benefits have you realized?

We use HDFS to store web logs for recommendation system, Call Details Record of carrier and device log of Hi-Tech manufacture. The benefits of HDFS is it's scalability and relative lower cost of storing data and be able to leverage MapReduce, Hive, Impala and Spark for further data analysis.

What do you like best?

I like MapReduce code a lot. Mappers and Reducers and how the overall hierarchy goes by. Hadoop is a platform that I have chosen a year ago and still I am in love with it because of it's simplicity for solving complex problems involving very large database. I also did Apache Giraph which goes into graph processing and was a great experience learning a whole new product of Hadoop. I like solving real life challenges using Hadoop like predicting earthquakes so that the results could be less devastating. This is one of my proposed ideas but there are many more like these which I like a lot.

What do you dislike?

I dislike the numerous products that are developing in Hadoop architecture because a developer can never learn all the products based on hadoop. He/She can learn only a few ones which are used in a very extensive way. So why not integrate other less used products in the mostly used products so that it gets the added functionality.

Recommendations to others considering the product

If you consider switching to Hadoop platform, in the learning phase, don't setup hadoop from scratch. Instead concentrate on your learning and download the pre installed hadoop virtualbox image of cloudera or hortonworks or MapR etc. I took me around a month to fully configure single node and multi node because I implemented them from scratch. Had I used the above mentioned virtual images, it would have helped me a lot and save my time as well.

What business problems are you solving with the product? What benefits have you realized?

Actually I was into learning Big Data and Hadoop but soon I fell in love with it. Recently I did a project on it by reviewing the big sales data of NY stores and processing Udacity's DIscussion forums which is a great way to analyze data and give all the statistics.

What do you like best?

HDFS supports features such as partitioning and replication that are actually mandatory to be present in a distributed environment. Of course, many optimizations should be done over the next years but the main concept will be always the same. Move code into the data and keep your data safe with no risk depending on the failures. What I like best in HDFS is the user interface which is pretty similar to a common local Linux filesystem. Moreover, HDFS is very compatible and that can be integrated with the majority of the frameworks that are used today like Hadoop and Spark. Last but not least, HDFS is open source and a huge community supports it.

What do you dislike?

HDFS has some disadvantages as well. First of all, I would really like it to be more customizable and to provide more features to the user interface. By doing this, users will be free to play and experiment with new ideas which will be integrated with HDFS. I also have observed that someone has to be an expert in order to use it securely in his application and there is no much documentation about how to achieve this specifically in HDFS.

Recommendations to others considering the product

There are a lot of systems that someone can do his job but HDFS would be always the most open one. It is also a very good choice for someone who is completely unexperienced with distributed programming because there is a lot of documentation on the Internet.

What business problems are you solving with the product? What benefits have you realized?

The most common problem that I am trying to solve is the big data management and the application of plenty of algorithms to this kind of data. I am currently also trying to integrate it with a new architecture that I am working on.

What do you like best?

HDFS runs usually on top of commodity hardware and failure could be common. I really like fault handling feature of HDFS because it can accommodate failures and still do MAP REDUCE jobs in parallel with lightening speed.

What do you dislike?

Set up of HDFS is extremely painful especially dealing with all permissions and ownerships. In addition to that, there are so many products being developed nowadays and it's extremely hard to keep up with those.

What business problems are you solving with the product? What benefits have you realized?

We have developed an application which shows data map and data flow from source to target. We use HDFS to store enterprise big data and then use proprietary software on top of HDFS and Titan to achieve that.

What do you like best?

Distributed file storage made easy with using HDFS. I don't need to know where the files are stored physically in the server because HDFS exposed all the files as if it was a single storage with multiple backup (depending on you replication factor). In term of using HDFS API, it is straight forward to use.

What do you dislike?

Configuration. To get HDFS running might be easy or complicated depending on your experience. We are using Hadoop together with Cloudera, so that was really easy for us to get things started. However, as any other Hadoop components, fine tuning HDFS can be tricky. Debugging HDFS can also be tricky, like suddenly HDFS doesn't allow write due to it was in safe mode. At the point I was using Hadoop, getting Hadoop to work with HA is also challenging, namenode was a single point of failure. HDFS also doesn't work well with lots of small files. For average user, it can be daunting for them to access HDFS (I think HDFS has web app running with limited functionality), for developers it would be no issue.

Recommendations to others considering the product

HDFS is a great tool if you're looking for proven solution for file storage that offers distributed storage and file backups. However, HDFS is just a file system and nothing more than that. I've got clients who think HDFS is like magic, put up files into HDFS and come out analytic. Hadoop is prone to failure, having someone who knows Hadoop in and out is great plus.

What business problems are you solving with the product? What benefits have you realized?

I was building internal tool for managing logs and analyzing logs for business intelligence. We used logs as our source to train machine learning algorithm to detect system failure.

What do you like best?

Hadoop is a collection of software that handles distributed file system (HDFS), and distributed processing mechanism on top of it (MapReduce). It is highly scalable and reliable. With Hadoop, users could specify their processing requirements on large datasets without worrying the details of underlying communication and data distribution. Hadoop can scale up easily to adapt to workflow increase. Automatic data replication mechanism in HDFS guarantees its reliability.

What do you dislike?

Hadoop is written in Java and it is not fast. It cannot handle the data processing requests in real-time. Its processing layer, MapReduce, simplifies the processing logic by supporting only a Map and Reduce function, but it also introduces inconvenience to express complicated processing logic.

Hadoop adopts master-slave architecture, but the master is designed in single-node mode: when the master node is down, it is difficult to get recovered. Users have to purchase high-end hardware to prevent master-node failures.

Recommendations to others considering the product

If you have data that are large in size, use Hadoop. The initial setup and trial is simple; and you can figure out easily whether it is a good solution to your data processing requirements. Why not give it a try?

But hadoop is not a solution for all big data problems. It cannot handle interactive, iterative, and real-time processing well.

What business problems are you solving with the product? What benefits have you realized?

By using Hadoop, we can explore much larger datasets and find the hidden essence in them in order to provide better service.

The cost of developing, debugging, and deploying of the tools becomes easier than before, and the scale of processing is expanded significantly.

What do you like best?

Hadoop is a no brainer for big data. The main killer feature is HDFS. Having a redundant WORM file system is amazingly useful. There's a reason Google invented in and Yahoo made it open source. 3 copies of your file just works. Cheap commodity servers, but still fast and stable. Never lose data, store everything. Access and use in multiple use cases. So many tools and other projects around Hadoop make it a must have for all enterprises and startups. You add Spark which most distributions include and you can pretty much do everything you need. Ambari and Hue make it easy to setup now.

What do you dislike?

There's a lot of stuff in Hadoop, also there's always 10 ways to do something and hard to know what's the best. Do you do Storm or one of 20 other frameworks. Should I store in Parquest, ORCFile, Avro or CSV or something else? Do you compress with SNAPPY or nothing. What level of encryption? Is Kerberos good enough for my security. Security is a bit lax and there's definitely a lot of things to configure.

Recommendations to others considering the product

Try it out in one of the sandboxes. It's very easy to install with Ambari. The sandboxes are all setup and running with all the basic tools. Try the HDFS CLI and copy a few files into HDFS. Then try to access them through the CLI and through some basic HiveQL. It's easy to load, transform and query your data. Easy to pull it out of SQL and drop it in HDFS. The only hard part is to figure out what tools to use for BI and for imports.

What business problems are you solving with the product? What benefits have you realized?

Storing everything, accessing everything, not losing data and rapid access to big and fast data. It's great for BI and for applications.

What do you like best?

hdfs is high avalible and scalable, I can expand the storage only add several datanodes. And with hdfs genome data can be easily analyzed by mapreduce.

What do you dislike?

hdfs is not so good for small files, and the nfs-gate-way is also not very well.

Recommendations to others considering the product

I think hadoop has a very good community, although hdfs still has some bugs(I think hadoop-yarn make have more bugs, espically on dokcer-container-executor), I think it will be better.

What business problems are you solving with the product? What benefits have you realized?

When using java, It is not so easy to manipulate files in hdfs by hadoop api. I find an open-source project jsr203-hadoop(https://github.com/damiencarol/jsr203-hadoop) can make things simple. One can read and write hdfs files via NIO api in jdk1.7. But at that time I found a small bug in the project when I tring to move a file. I fixed the bug and the auther (damiencarol) kindly merged my code.

What do you like best?

HDFS is Hadoop distributed File system. The best thing I like about HDFS is reliability I get with Hadoop, its file replication is great and there are very less chance of your data being lost. To get the best benefits out of hadoop keep the file size big. At least 100MB each file. Then you will realize the power of Hadoop. Fault tolerant file system etc.

What do you dislike?

Its a little bit slower, but then what is not slow when you come to big file systems which work with Tera bytes of data. Even Amazon S3 is extremely slow when reading the data from it. Other than that it might be little tricky to find documentation for new users / features.

Recommendations to others considering the product

Get it from some free apache distributor. Don't try to get it from Apache directly as you might face some trivial issues

What business problems are you solving with the product? What benefits have you realized?

Mainly focussing on large scale data storage with data duplication, file redundancy, scalability etc. Used in conjunction with other big data components like Hive, Pig etc. For ETL and analytics applications.

What do you like best?

Well I am using Hadoop HDFS for HBase filesystems. I found it's really easy to deploy. I use Cloudera Manager as hadoop package, it could be more easy. If you have a lot of nodes, then truly you will have power from Hadoop HDFS

What do you dislike?

It's quite troblesome for tuning HBase and HDFS. At first when we have fe w nodes it doesnt looks better, but when we hit more nodes, performance gained. But still, lot of tinkering to do.

Recommendations to others considering the product

Use a good package, don't use bare install

What business problems are you solving with the product? What benefits have you realized?

What do you like best?

Distribute data and computation.The computation local to data prevents the network overload.

We can easy to handle partial failure. Here the entire nodes can fail and restart. it avoids crawling horrors of failure and tolerant synchronous distributed systems. Speculative execution to work around stragglers.

What do you dislike?

1 ) Rough manner:- Hadoop Map-reduce and HDFS are rough in manner. Because the software under active development.

2) Programming model is very restrictive:- Lack of central data can be preventive.

3) Joins of multiple datasets are tricky and slow:- No indices! Often entire dataset gets copied in the process.

What business problems are you solving with the product? What benefits have you realized?

This is the one advantages of using Hadoop in contrast to other distributed systems is its flat scalability curve. Executing Hadoop on a limited amount of data on a small number of nodes may not demonstrate particularly stellar performance as the overhead involved in starting Hadoop programs is relatively high.

What do you like best?

Hadoop is a very popular big data framework.Hadoop is based on MapReduce, which makes it useful for big datasets. Hadoop can be used for almost any requirement involving huge data and also when data is unstructured. The open source community has built tons of tools around it and evolved it into an ecosystem.

What do you dislike?

I don't see any dislikes on this; the only thing people get confused is, its the right thing for solving every problem. Well, its not.

Recommendations to others considering the product

If you say yes for most of the questions then Hadoop is recommended

1. Data size - does it have TBs-PetaBytes of data

2. How much time you can wait - Hadoop is not instant querying tool

3. What is the data growth expected

4. Can I manage with out any real time operations

5. How much percentage of your data is structured - the low the better

What business problems are you solving with the product? What benefits have you realized?

Massive data collection, storage and analytics. It is extremely cheap to get this up and running. It does not need fancy hardware and its open source. If you are thinking this is open source and looking for support there are enterprise hadoop flavors from Cloudera, Hortonworks, MapR.

What do you like best?

HDFS is reliable and solid, and in my experience with it there are very few problems using it. If you have your own data centre and you use Hadoop, it's the obvious choice for reliably storing your data.

What do you dislike?

If your NameNodes all go down, then HDFS is pretty much useless as you won't know which file blocks are where and which files they belong to -- and I've read it's difficult to recover (or impossible) if you completely lose your NameNode file mappings. Fortunately I've never personally seen this occur.

Recommendations to others considering the product

Again, you get it for free if you have your own Hadoop installation and run your own datacentre, so you might as well use it for archiving/storage/input to various ETL. Even if you're "in the Cloud" you usually have access to HDFS, even ephemerally, and it's quicker to do work on it directly than some systems such as Amazon S3 (of course you still need to persist your data back off of HDFS when you're done in such a situation).

What business problems are you solving with the product? What benefits have you realized?

HDFS is used for MapReduce processes, Hive tables, Spark job input, for backing up data... The list goes on. You get replication for free, which is also very useful.

What do you like best?

Well, I like the basic idea - it is distributed filesystem used to store and transform large datasets. Science, we faced the problem of storing and processing multi-terabyte datasets it is only natural to use HDFS

What do you dislike?

Well, it is trivial to fool hdfs security and it was completely ineffective. You have to relly on additional tools, such as Kerberos and it adds complexity to your company.

Recommendations to others considering the product

Hire a preson whose only responsibility is to manage this zoo. You will never use HDFS in isolation (look security concerns) and at some point the cost of managing all Apache big data projects will be a significant burden on your talent pool. It is also wise to take advantage of existing bundles (Cloudera, Hortonworks).

Otherwise the risk of not going beyond experimentation phase is quite significant.

What business problems are you solving with the product? What benefits have you realized?

The main business problem was to crawl and store polish websites. We used CommonCrawl dataset as a source, and Akka framework for highly paralell data processing (it was a "trivially parallelizable" problem") and hdfs with Cassandra and Apache Spark for storage and processing.

What do you like best?

What makes a successful platform is scalability and reliability . These are two traits Hadoop does perfectly . When you work with large datasets Hadoop helps you process without giving you worry .

HDFS is the best distributed file system for large datasets . It is complete with integrations with many parallel computing platforms . It is tool you need to scale with huge datasets and it doing magic for use .

What do you dislike?

As with any tool , Hadoop is not a silver bullet for all data related tasks . Cases where the dataset is small or dataset involves transaction , one call feel Hadoop not upto the task .

But no complains here , as we have to realise that Hadoop is not built for small dataset or transactional data.

What business problems are you solving with the product? What benefits have you realized?

What do you like best?

It is extremely flexible and able to handle the largest data sets while the Map/Reduce patterns makes it easy to reason about program behavior.

What do you dislike?

Although it's getting better, it would be nice to improve the Streaming API.

Recommendations to others considering the product

Others have mentioned a lower limit of 1 TB for data that I generally agree with, although I might say that you should try to stay within convention systems up to 3-5 TB if possible. Smart indexing and sharding can even take you past that.

What business problems are you solving with the product? What benefits have you realized?

The business problem was extracting and refining data from a large unstructured corpus. Hadoop allowed us to scale this process and be able to iterate using the entire dataset.

What do you like best?

The hadoop platform is essentially an open industry standard in cloud computing, several essential tools for modern production quality machine learning applications support scaling via hadoop / spark.

What do you dislike?

The setup and configuration of hadoop and spark is it's greatest weakness. It often takes a non trivial amount of engineering time to setup and tune. Fortunately services such as AWS allow you to hit the ground running without as much setup.

Recommendations to others considering the product

Use AWS if possible, but setting up your own cluster isn't as scary as it appears.

What business problems are you solving with the product? What benefits have you realized?

We are producing fault detection and prognostics for industrial machines and vehicles. We use several non parametric statistics and a good deal of machine learning to get the job done. Hadoop has drastically lowered turn around time on results and even in development (after the initial setup and growing pains subsided).

What do you like best?

What do you dislike?

There are a few UIs to access it without terminal, but we could use one which is bug free and has all features.

Recommendations to others considering the product

Definitely should try this out and think about switching from current file management system to this one. While it may take some time to get used to it, once done it will be super easy to use and improve current implementations.

What business problems are you solving with the product? What benefits have you realized?

What do you like best?

Is this a review of the HDFS file system? If so, the performance is great compared to say S3 or other ways to access file systems on Hadoop. Its also tried and tested. However, for the survey to make more sense, I will answer on Hadoop in general too.

What do you dislike?

Inflexible, data needs to be copied to HDFS from other places, one cannot do real-time access from HDFS.

This survey is not well-written if its primarily HDFS that you need feedback on.

Recommendations to others considering the product

Need a better filesystem. We should be able to import data from other sources faster. There should also be real time (in memory) capabilities built-around HDFS.

What business problems are you solving with the product? What benefits have you realized?

Analytics, business intelligence. Typically, I want fast results for jobs and ended up using Impala on top of HDFS.

Learning about Hadoop HDFS?

* We monitor all Hadoop HDFS reviews to prevent fraudulent reviews and keep review quality high. We do not post reviews by company employees or direct competitors. Validated reviews require the user to submit a screenshot of the product containing their user ID, in order to verify a user is an actual user of the product.