Big Data Technologies Narrow the Gap between HPC and the Enterprise
The intersection of big data and HPC will be a common theme at the upcoming ISC Big Data conference in Heidelberg, Germany, on September 25-26. On the second day of the event, Dr. Flavio Villanustre will deliver a keynote around this topic, offering his perspective on the merging of big data in HPC and the commercial realm. Villanustre, who is the vice president of Infrastructure & Products for the HPCC Systems group and vice president of Information Security at LexisNexis, gave us a preview of his talk, expounding on some of the technologies driving this trend.
How would you define "big data, aka data-intensive computing? What is the most important thing that distinguishes big data from older data-centric applications -- data warehousing, business intelligence, and so on?
Villanustre: Big data is loosely defined by its main attributes, as Gartner and other market analysts are eager to indicate: volume, velocity, variety and complexity. But this statement doesn’t tell the whole story, and there is much more in big data than meets the eye. Just to point out one of the most significant differences, traditional data oriented systems and software architectures made strong assumptions that no longer hold true, such as that “data is structured and the semantics of the data are defined by that structure,” “data is always truthful, complete and clean,” “there are always unique identifiers that can be used to provide referential integrity across multiple datasets,” etc.
When it comes to big data, each and every one of the assumptions made before are no longer valid. Big data is composed of structured and unstructured data elements, and while structured data can account for a large number of records, unstructured, and for the most part human generated, data normally makes up a much larger percentage of the total size. Big data, particularly when originated from external sources -- for example, social media and micro-blogging -- tends to contain errors, wrong content and missing parts. And big data does not usually come with unique identifiers, which generates significant challenges for entity resolution and entity disambiguation.
In my opinion, there is a place for big data in traditional data-centric applications, such as data warehousing, master data management and business intelligence and, as a matter of fact, we are already seeing convergence. But this means that traditional architectures used for these applications are inefficient and will become obsolete, as they are replaced by systems that are specifically tailored to dealing with data in a “big data way” -- distributed systems, shared-nothing architectures, in-place processing, etc.
In your estimation, what are the most important enabling technologies for big data?
Villanustre: Unlike other technologies appearing over the last few decades, where academic and research institutions were the place where the initial design and development happened, big data technologies were born out of necessity in the industry. Data processing companies such as Google and LexisNexis faced the big data challenge in the second half of the last decade, and came with very pragmatic approaches to the problem.
In the case of Google, the MapReduce paradigm uses the scatter-gather approach to distributed processing of crawler logs. In the case of LexisNexis, a high level declarative programming language, ECL, is used to abstract the underlying complexities of distributed architectures, parallelism and concurrency. Eventually Google’s MapReduce approach trickled down into the Hadoop framework.
In general, big data benefits from faster, larger and denser distributed storage, fast network interconnects and faster processing. The data must be loaded into RAM in order to be processed, and moving the data from the hard drives into memory, and from memory back to the hard drive tends to be a significant hurdle to performance, particularly when the ratio between the size of the data and the available memory across the system is unfavorable. Multiple cycles of data reading and spilling are required in order to complete the process.
While non-volatile solid state storage is not at a point where it can compete with hard drives in capacity/density and cost, it offers unmatched random access performance and is quickly becoming more affordable. In addition to this, having the ability to equip the nodes with larger amounts of RAM -- for example, from upcoming technologies such as RRAM -- can minimize the number of trips to the persistent storage, increasing the overall performance significantly.
High speed network interconnects are also becoming more commonplace, with FDR InfiniBand and 40Gbps Ethernet providing for a faster and lower latency communication medium for low latency and high speed inter-node data transfers.
But big data is not just about storing massive amounts of data and performing aggregates and rollups. In order to make sense of most unstructured data, natural language processing is required; statistical methods such as clustering and segmentation analysis are needed to extract semantic meaning; predictions are used to make use of regression analysis and classifiers; high volume high velocity stream oriented processing requires special handling; and graph oriented problems may require approaches that specifically deal with this type of data structure. So we have seen a number of novel constructs employed to tackle each one these problems and the emergence of things like LSM-trees, array storage, key-value stores and in-memory graph databases.
More important, a number of applications in big data require a significant amount of numeric analysis, such as optimization methods and linear algebra, in general, as part of, for example, machine learning. This is creating a resurgence of analytical environments that can efficiently deal with these types of problems, but this time in distributed environments, due to the size of the data involved.
Is a real convergence going on between HPC and big data? If so, why is this occurring now?
Villanustre: Indeed, HPC and data intensive computing are converging fast, for some of the reasons I just mentioned. HPC, on one hand, has been perfecting itself over the years to provide highly efficient numeric processing in distributed environments, mostly in shared memory architectures -- and Parallel Block BLAS is an example of this. Data intensive computing, on the other hand, is specialized in dealing effectively with vast quantities of data in distributed environments, exploiting data locality.
The intersection of these two domains is mainly driven by the use of machine learning methodologies to extract knowledge from big data, and we see an increasing number of platforms that are combining these capabilities to provide hybrid environments that can take advantage of data locality to keep the data exchanges over the network at a manageable level while they offer high performance distributed linear algebra libraries.
Is there a case to be made for buying different systems for traditional compute-intensive HPC applications and big data applications? If so, how would those systems be different?
Villanustre: Traditionally, the main distinction used to be the performance of iterative algorithms, such as certain numeric optimization methods, and the size of the data, which pushed HPC towards shared memory models, and big data systems towards disk storage local to the nodes and local data processing for the most part. Your traditional HPC system would have plenty of memory, little or no hard drives, InfiniBand interconnects, possibly GPUs and FPGAs, and plenty of CPU cores, while your big data system would have local storage, local hard drives, Ethernet/IP interconnects and a few CPU cores.
But as the analytical load increases, big data systems can benefit from many CPU cores and GPUs, faster networks and RDMA that InfiniBand offers, and a larger memory footprint. As you can see, the boundaries between HPC and big data systems are starting to blur.
Is it reasonable to think a single platform can be constructed to cost-effectively handle both types of applications?
Villanustre: Absolutely! Certain ideas need to be tweaked to work well in this dual world, but there is no inherent obstacle to a system that can handle both types of workload effectively and at a compelling price/performance standpoint, particularly at a time that technologies like InfiniBand and CUDA, OpenCL, and FPGAs are becoming mainstream commodities.
The inaugural ISC Big Data'13 Conference promises great keynotes, insightful discussions and networking opportunities. The same holds true for our ISC Cloud’13 Conference, which will be held at the same venue, two days before the big data conference. If you register by September 12, you'll be able to save 25 percent, plus a combo ticket will provide you additional savings!
The ISC Cloud’13 is the fourth in the series, and this year digital engineering has been added as a topic. Most manufacturers, especially small and medium enterprises use desktop workstations for their daily R&D work. Because of the sheer size of the simulation jobs, they often do the preparation work during the day and production runs overnight, resulting in one simulation job per day. A viable but expensive alternative is buying an HPC cluster. But an often far better solution is "Cloud Computing". Find out at our conference how you can improve your ROI, time to develop and market.
Ms. Nages Sieslack
Phone +49 (0) 621 180686 16
Mobile +49 (0) 178 18798 58