Generic Big Data Analytics Blueprint

The DataBench Framework is following three main dimensions: Pipeline steps, Processing types and Data types. There are also other dimensions like computing continuum – edge, premise, fog, cloud, HPC etc – and others. In addition, we have the Business dimensions: Application domains etc.

In this nugget we present details of the generic blueprint devised in the DataBench project. The blueprint shown in the figure is generic, as it indicates the building blocks of a BDA architecture, with no reference to specific technologies.

For example, it reports noSQL technologies, with no reference to specific noSQL databases such as MongoDB, Cassandra or HBase. Companies can use benchmarks of noSQL databases to support software selection based on technical performance parameters. If different technologies have a different performance, this software selection can be critical to ensure the overall performance of the architecture and to obtain the best cost/performance ratio. Our goal here is to assess the impact of IT benchmarking on the outcome of software selection, to obtain an evaluation of the benefits of benchmarking.

Different technologies can be applied to implement different components of the technology blueprint. Clearly, data- and/or processing-intensive components are critical, as they play a key role in determining the performance, cost and scalability of the overall architecture. Different use cases have different critical components. For example, real-time AI is a critical component of the predictive maintenance use case in manufacturing, while data preparation and data management are critical in the targeting use case. It seems intuitive to assume that the more data- and/or processing-intensive components should be the focus of a careful software selection based on technical benchmarking. For example, is the software that is selected does not have enough performance, the component would be likely to represent a bottleneck, or would require additional processing/storage capacity with additional related costs.

It is important to understand the mappings from the generic blueprint to different standardization efforts:

    • In DataBench we have devised a simple data pipeline for big data and AI based in existing efforts in the community, such as the data value chain from the NIST Big Data Reference Architecture (NBDRA), or the steps in the BDV Reference Model. The four steps, namely data acquisition/collection, data storage/preparation, data analysis/ML, data visualization/interaction, of the data pipeline are explained here.
    • Each of the 4 main elements of the data pipeline used in DataBench for classification purposes can be mapped to the different elements of the generic blueprint as shown in this figure.
    • The generic blueprint is also at the core of the Search by pipeline/blueprint functionality of the Toolbox. In that search you can see how the different elements of the blueprint are mapped to specific benchmarks or knowledge gathered in the Toolbox knowledge base and to the 4 steps of the data pipeline.
    • And there is a mapping to the AI & Robotics Framework as well, as shown in this figure.
    • The data pipeline is mapped to the BDV Reference Model as shown in this figure and explained in the table below:
Steps of the DataBench Data Pipeline Corresponding Steps of the BDV Reference Model
Data Acquisition/CollectionData Management + Data Processing + Things/Assets
Data Storage/PreparationData Management
Data Analytics/MLData Analytics
Data Visualisation/User interactionData Visualisation


In DataBench we have developed a set of examples of specification of this generic blueprint for different use cases. All can be accessed as knowledge nuggets in the Toolbox for instance by searching in the Search box by the tag “blueprint”. An example of a specific blueprint tailored for the Telecom industry with hints on how it was developed taking into account not only technical, but also business goals, can be accessed here.

Attachments: