Like this:

Use Case: We have 1 million files to process and provide option to download.

Hadoop is meant to bring process to data. We can store processed file content or meta data in HBase to support easy search. Upon successful search, user want to see original document. During that time we can download file from NAS easily.

HDFS: This is not meant to store large files. 16MB is block size. We can configure to support to store small files. But not supposed to be.

HBase: Default block size is 100kb. We can tweak, but not meant to store proprietary data formats.

NAS: Network Attached Storage is easy to store/retrieve original files, When we don’t have map reduce nature of jobs.

What is hive?: Hive is a data warehousing infrastructure based on Hadoop
What is Hbase?: Its a distributed, versioned, column-oriented NoSQL data store, modeled after Googles Bigtable. used to host very large tables — billions of rows *times* millions of columns.
What is hadoop?: Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware using map-reduce programming paradigm.

http://twill.apache.org/
Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus instead on their application logic. Apache Twill allows you to use YARN’s distributed capabilities with a programming model that is similar to running threads.
————-https://tika.apache.org/
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

https://flume.apache.org/
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
————-

PLOS articles can be accessed programmatically through our API, via PubMed Central, or using Europe PMC’s RESTful Web Service and SOAP Web Service. Detailed information about our Search API, including examples, is available at http://api.plos.org/solr/faq/. If you have any questions or require assistance with our API, please contact webmaster@plos.org.