Interview with Andy Watson, CTO at WekaIO

By
Ellie Lucy,
Global Summit Creator
- RE•WORK
November 23, 2018

At the upcoming Deep Learning Summit in San Francisco on 24 & 25 January 2019, WekaIO will be exhibiting their latest file system for AI, machine learning and technical computing workloads. The solution accelerates compute-intensive applications so that data scientists, researchers, and engineers get to the answer faster.

I had the chance to speak with Andy Watson, CTO at WekaIO to get a sneak peek at what they do and what we can expect at the event. Andy is responsible for maintaining and guiding the vertical market technology development, as well as monitoring emerging trends and communicating them to key business stakeholders. He also oversees the organizational leadership on technology issues and ensuring the alignment of the technology vision and business strategy and for driving technology innovation throughout the entire organization.

Give us an overview of WekaIO.

WekaIO helps companies manage, scale and future-proof their data center so they can solve real problems that impact the world. Data is intrinsically valuable to businesses today, but in order to extract value from the data it has to be available to multiple application servers simultaneously. Only WekaIO Matrix™ saturates the data-hungry applications with data. It’s the world’s fastest shared parallel file system and WekaIO’s flagship product, leapfrogs legacy storage infrastructures by delivering simplicity, scale, and faster performance for a fraction of the cost. In the cloud or on-premises, WekaIO’s NVMe-native high-performance software-defined storage solution removes the barriers between the data and the compute layer, thus accelerating artificial intelligence and machine learning workloads. Matrix is a fully parallel and distributed file system, both data and metadata are distributed across the entire storage infrastructure to ensure massively parallel access. The software has an optimized network stack that runs on InfiniBand or Ethernet (10Gbit and above), so data locality is no longer a necessary factor for performance, resulting in a solution that can handle the most demanding data and metadata intensive operations.

How are you helping to reduce the AI development cycle?

Machine learning platforms are GPU intensive and require large data sets to deliver the highest levels of accuracy to the training systems. They also require a high bandwidth, low latency storage infrastructure, to ensure a GPU cluster is fully saturated with as much data as the application needs. Typical data sets can span from terabytes to tens of petabytes, and the data access pattern for each training Epoch is unique and unpredictable. This calls for a data infrastructure that can instantaneously and consistently feed large amounts of random data to multiple GPU nodes in real-time, all emanating from a single shared data pool. WekaIO Matrix is the world’s fastest and most scalable file system for these data-intensive applications, whether hosted on-premises or in the public cloud. It has proven the scalable performance of over 10GBytes per second bandwidth to a single GPU node, delivering 10x more data than NFS and 3x more than a local NVMe SSD. With WekaIO Matrix more training Epochs can be run in a shorter amount of time, thus accelerating the AI development cycle. That’s not the only use-case, but it’s a great example. We are also brought in to turn things around in situations where customers have crazy numbers of files per directory (something that happens with IoT or other workloads where files are generated automatically). Go ahead, put millions of files into the same directory and see if any other file system can handle it. We can, because everything on WekaIO’s Matrix file system is parallelized, including the metadata.

Who is the ideal user for WekaIO?

We target enterprises and organizations that have use cases that include artificial intelligence, machine learning, deep learning, and analytics workloads that require high IOPS and throughput with low latency. The use case is very horizontal across markets that include manufacturing, life sciences, pharmaceutical, media, oil and gas, academia, research, and finance. It’s a very broad use case today.

What are some of the main bottlenecks in productivity gains from AI?

Two dimensions of pain are felt by many AI applications. They need more data and they need it faster. The most acute pain is currently experienced by the large GPU farms being amassed for Machine Learning (ML) at scale. When you only have a handful of GPU’s you will be compute-bound, but when you have many of them you will be IO-bound. If you have a huge number of them you will be crippled by the IO constraint, regardless of how much networking bandwidth you make available. Nobody but WekaIO can keep up, and we are happy to prove that — it is something we do routinely in bake-offs around the world. We’ve seen a few competitors try to feed NVIDIA’s DGX-1, but it’s a disappointing struggle which WekaIO’s Matrix software takes in stride. And nobody else — I mean, literally nobody — can even attempt to feed NVIDIA’s DGX-2 beast. GPUs have shrunk the processing power of tens of CPU servers into a single GPU server delivering massively parallel processing and dramatically improving machine learning cycles. However, the shared storage systems being leveraged to support AI workloads are utilizing technology developed in the 1980s when networks were slow. If your data set does not fit inside the local storage on a single GPU server then scaling the AI workload is a nightmare. NFS, the predominant protocol for data sharing is limited to about 1.5GB/second in bandwidth while a single GPU server can easily consume 10x that throughput. GPU workloads demand a low latency, highly parallel I/O pattern to ensure that the AI workloads are operating at full bandwidth.

What developments of AI are you most excited for, and which industries do you think will be most impacted?

AI is still in its infancy, but numerous studies have shown that companies who are adopting AI are reducing costs, improving efficiency and delivering bottom line profit to the company. AI can help with problems as basic as setting a maintenance schedule for a factory floor, all the way to targeting the right product to potential buyers and improving sales closure rates. Look at a company like gong.io that is helping salespeople use the right language to improve the rate of sales closure. No business is too small to utilize readily available AI-powered tools or develop its own AI strategies.

Would you advise a career in AI, and what are the key skills that you think are needed for such roles?

Many companies have AI initiatives today, and I don’t foresee the investment in AI falling off in the near term. In fact, we are seeing job titles such as Chief Data Officer (CDO) at many of the customers we work with. Data is the currency of business today, and companies are just beginning to wrap their heads around how to extract value from their data. And as more companies implement IT infrastructures that enable monetization of that data, the more we’ll see opportunities for new careers in AI emerge.

Early Bird tickets for the Deep Learning Summit, San Francisco end on 7 December, so make sure to get your discounted tickets now for a chance to discuss the World's Fastest Parallel File System.