Testing Services provides a platform for QA professionals to discuss and gain insights in to the business value delivered by testing, the best practices and processes that drive it and the emergence of new technologies that will shape the future of this profession.

Testing BIG Data Implementations - How is this different from Testing DWH Implementations?

Whether it is a Data Warehouse (DWH) or a BIG Data Storage system, the basic component that's of interest to us, the testers, is the 'Data'. At the fundamental level, the data validation in both these storage systems involves validation of data against the source systems, for the defined business rules. It's easy to think that, if we know how to test a DWH we know how to test the BIG Data storage system. But, unfortunately, that is not the case! In this blog, I'll shed light on some of the differences in these storage systems and suggest an approach to BIG Data Testing.

Let us look at these differences from the following 3 perspectives:

Data

Infrastructure

Validation tools

Data

Four fundamental characteristics by which the data in DWH and BIG Data storage systems differ are the Data Volume, Data Variety, Data Velocity and Data Value.

Typical Data volumes which the current DWH systems are capable of storing is in terms of Gigabytes whereas the BIG Data storage systems can store & process data sizes more than Petabytes.

When it comes to Data variety, there are no constraints on the type of data that can be stored and processed within a BIG Data storage system. Web logs, RFID, sensor networks, social networks, Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, biological, genomics, biochemical, medical records, scientific research, military surveillance, photography archives, video archives, and large scale eCommerce - any data, irrespective of whether it is 'structured' or 'unstructured' can be stored and efficiently processed within a tolerable elapsed time in BIG Data Storage system. DWHs, on the other hand, can store and process only 'Structured' data.

While the data is stored in DWH is through 'Batch Processing', BIG Data implementations support 'Streaming' data too.

Because of its ability to capture, manage and process data sets whose size is beyond the ability of DWH, the Data Value / Business value of the information that can be derived out of BIG Data implementations are exponentially larger than that of DWH systems.

What does this mean to the tester?

A DWH tester has the advantage of working with 'Structured' data. (Data with static schema). But BIG Data tester may have to work with 'Unstructured or Semi Structured' data (Data with dynamic schema) most of the time. The tester needs to seek the additional inputs on 'how to derive the structure dynamically from the given data sources' from the business/development teams.

When it comes to the actual validation of the data in DWH, the testing approach is well-defined and time-tested. Tester has the option of using 'Sampling' strategy manually or 'Exhaustive verification' strategy from within automation tools like Infosys Perfaware (proprietary DWH Testing solution). But, considering the huge data sets for validation, even 'Sampling' strategy is a challenge in the context of BIG Data Validation. Both the service industry and the automation solution industry being in the infant stage, the best approach for testing BIG Data can be determined only through focused R&D efforts. This provides a tremendous opportunity for the testers who are innovative and who would go the extra mile to build the utilities that can increase the test coverage of BIG Data while increasing the test productivity as well.

Infrastructure

DWH systems are based on RDBMS and the BIG Data storage systems are based on File system. While DWH systems have limitations on the linear data growth, BIG Data implementations such as the ones based on Apache Hadoop have no such limitations as they are capable of storing the data in multiple clusters. The storage is provided by HDFS (Hadoop Distributed File System), a reliable shared storage system which can be analyzed using MapReduce technology.

What does this mean to the tester?

As the HDFS gives power to the customers to store huge amounts of variety of data, run queries on the entire data set and get the results in a reasonable time, they are no longer constrained on the amount of information they can derive out of the data. Applying complex transformation and business rules on the data will be easier. This power would lead to new ways of data exploration. For a tester, this means, exponential increase in the number of requirements that are to be tested. If the testing process is not strengthened on reuse and optimization of the test case sets, the test suit would enormously increase and would lead to maintenance disasters.

RDBMS based databases (Oracle, SQL Server etc) are installed in the ordinary file system. So, testing of DWH systems does not require any special test environment as it can be done from within the file system in which the DWH is installed. When it comes to testing BIG Data in HDFS, the tester requires a test environment that is based on HDFS itself. Testers need to learn the how to work with HDFS as it is different than working with ordinary file system.

Validation tools

Validation tools for DWH systems testing are based on SQL (Structured Query language). For the comparison purpose, the DWH testers use either the xl based macros or full-fledged UI based automation tools. For BIG Data, there are no defined tools. Tools presently available in the Hadoop eco system range from pure programming tools like MapReduce (which supports coding in Java, Peal, Ruby, Python etc) to wrappers that are built on top of MapReduce like HIVE QL or PIGlatin.

What does this mean to the tester?

HIVE QL and SQL are not the same, though it is easy to learn if one has basic skills in SQL. HIVE QL is also in infancy stage and does not yet have all the constructs to access the data from HDFS for validation purposes. HIVE QL is suitable for flat data structures only and cannot handle complex nested data structures. To deal with these, the tester can use PIGLatin which is statement based and does not require complex coding. But, since both HIVE and PIGLatin are evolving, the need for writing MapReduce programs, to make the testing comprehensive cannot be ruled out. This poses a tremendous challenge to the testers. Either they work towards adding the scripting skills to their profiles or wait for their internal solution teams or external vendors to come out with powerful automation tools which provide easy interfaces to query and compare the data in HDFS with the external sources.

Conclusion

Experience in DWH at the least, can only shorten the learning curve of the BIG Data tester in understanding the extraction, loading transformation of the data from source systems to HDFS at the conceptual level. It does not provide any other advantage.

BIG Data testers have to learn the components of the BIG Data eco system from the scratch. Till the time, the market evolves and fully automated testing tools are available for BIG Data validation, the tester does not have any other option but to acquire the same skill set as the BIG Data developer in the context of leveraging the BIG Data technologies like Hadoop. This requires a tremendous mindset shift for both the testers as well as the testing units within the organization.

To be competitive, in the short term, the organizations should invest in the BIG Data specific training needs of the testing community and in the long term, should invest in developing the automation solutions for BIG Data validation.