We live in a data-rich world, in which masses of data, known as big data, are generated every day.
An online search or booking, a credit card purchase, an ID scan, a grocery shopping list, social media post, medical record, health monitor, all contain data which can be gathered and analyzed to gain useful insights and improve products and services.
Artificial intelligence (AI) technologies, including algorithms and machine learning are able to make sense of big data in real time. These technologies are evolving rapidly and being used in a growing number of industries, from financial and healthcare to smart manufacturing, intelligent transport systems and self-driving vehicles, as decisions that were once made by human experience are gradually be made by machine analysis of huge amounts of big data.
The Big Data programme of work is part of the portfolio of emerging technologies that ISO/IEC JTC 1/SC 42 is working on. SC 42 is the IEC and ISO joint committee on artificial intelligence that develops international standards for areas including foundational AI standards, Big Data, AI trustworthiness, use cases, applications, governance implications of AI, computational approaches of AI, ethical and societal concerns.
“Big Data is an important part of the SC 42 programme of work having recently published three standards,” said Wael William Diab, Chair of SC 42. “Big Data and AI enabled systems are helping accelerate the digital transformation where they are often deployed hand in hand to deliver key insights by enabling advanced data analytics.”
Interview with Wo Chang, who leads the IEC and ISO standards development for Big Data
Wo Chang, is a digital data advisor for the National Institute of Standards and Technology (NIST) and Convenor of SC 42/WG 2 for Big Data. In this interview, we find out more about the SC 42 standard ISO/IEC 20546: Information technology – Big data – Overview and vocabulary, which was recently published.
Q. Why do we need an international standard which covers Big Data vocabulary?
A. In any standards development, overview and terminology are elemental building blocks. This is especially true for Big Data, which is commonly referred to by both technical and non-technical communities with “V” terms: such as volume, velocity, varieties, volatility, veracity, visualization, and value from the context of “Big”. By providing a proper overview and vocabulary, the standard will establish a common understanding for what constitutes Big Data.
Q. Is there a misunderstanding of what Big Data is?
A. Big Data does not refer to “bigger” data than has been previously handled, but instead represents the paradigm shift to distributed data processing, where data is distributed across a number of nodes. This shift has resulted in the expansion in a number of robust techniques to adapt file systems for the distribution of large blocks of data; for non-relational database models (known as NoSQL), for the parallelization of algorithms, so models can run separately against data on disparate nodes, and for the ability to scale by adding or removing additional data storage or analytic resources (such as adding additional data servers in a cloud system).
Q. What does the standard include in the overview?
A. In addition to the paradigm shift mentioned above, the dataset characteristics which are of concern in the data science analysis of Big Data include concurrency – where data is updated consistently across nodes, so query results are deterministic. The second is complexity – where the data is not independent across the separate data nodes, but there are in fact relationships between data elements stored in different nodes. Complexity implies that techniques, such as MapReduce which operates separately on data nodes, are not appropriate.
Q. Why is it important to differentiate between concepts that are new and not new to Big Data?
A. For big data standards development, it is critical to highlight what is new in Big Data and avoid areas that have not changed with Big Data. The new area is more focused on the paradigm shift mentioned above. It covers, for instance, how to make data more efficient in distributed file systems, effective distributed computing, optimized parallelization of algorithms, etc. for better data analytics, while the areas which have not changed comprise: (a) data types for individual data elements, (b) data formats to encapsulate data, and (c) metadata to describe data header, etc.