DataBench Generic Data Pipeline (4 steps data value chain)
The DataBench Framework is following three main dimensions: Pipeline steps, Processing types and Data types. There are also other dimensions like computing continuum – edge, premise, fog, cloud, HPC etc – and others. In addition, we have the Business dimensions: Application domains etc.
We are presenting in this nugget the DataBench top-level Data Pipeline. This is based in many other efforts in the community, such as the Data Value Chain proposed by the NIST Big Data Reference Architecture (NBDRA), or the steps in the BDV Reference Model. In order to have an overall usage perspective on Big Data and AI systems a top level generic pipeline has been introduced in order to understand the connections between the different parts of a Big Data and AI system in the context of an application flow.
We have now the following top-level pipeline as seen in the figure (Data Ingestion, Data Storage, Data Analytics/ML and Data Visualisation/Action/Interaction) following the Big Data and AI Value chain:
- Data Acquisition/Collection: In general, this step handles the interface with the data providers and includes the transportation of data from various sources to a storage medium where it can be accessed, used, and analysed by an organization. Tasks in this step, depending on application implementations, include accepting or performing specific collections of data, pulling data or receiving pushes of data from data providers and storing or buffering data. The cycle Extract, Transform, Load (ETL)/Extract, Load, Transform (ELT) can also be included in this step. At the initial collection stage, sets of data are collected and combined. Initial metadata can also be created to facilitate subsequent aggregation or look-up methods. Security and privacy considerations may also be included in this step, since authentication and authorization activities as well as recording and maintaining data provenance activities are usually performed during data collection. Last, we would like to note that tasks in this step may vary, depending on the type of the collected data.
- Data Storage/Preparation: Tasks performed in this step include data validation, like for example checking formats, data cleansing, such as removing outliers or bad fields, extraction of useful information, organization and integration of data collected from various sources, leveraging metadata keys to create an expanded and enhanced dataset, annotation, publication and presentation of the data in order to be available for discovery, reuse and preservation, standardization and reformatting, or encapsulating. Also, in this step, source data are frequently persisted to archive storage and provenance data are verified or associated. The transformation part of the ETL/ELT cycle could also be performed in this step, although advanced transformation is usually included in the next step which is related with data analytics. Optimization of data through manipulations, such as data deduplication and indexing, could also be included here in order to optimize the analytics process.
- Analytics/AI/Machine Learning: In this step, new patterns and relationships, which might be invisible, are discovered so as to provide new insights. The extraction of knowledge from the data is based on the requirements of the vertical application which specify the data processing algorithms. This step can be considered as the most important step as it explores meaningful values, and thus, it is the basis for giving suggestions and making decisions. Hashing, indexing and parallel computing are some of the methods used for Big Data analysis. Machine learning techniques and Artificial Intelligence are also used here, depending on the application requirements.
- Action/Interaction, Visualisation/Access: Data can have no value without being interpreted. Visualization assists in the interpretation of data by creating graphical representations of the information conveyed, and thus adding more value to data. This is due to the fact that the human brain processes information much better when it is presented in charts or graphs rather than on spreadsheets or reports. Thus, visualization is an essential step as it assists users to comprehend large amounts of complex data, interact with them, and make decisions according to the results. It is worth to note that effective data visualization needs to keep a balance between the visuals it provides and the way it provides them so that it attracts users’ attention and conveys the right messages.
From the benchmarking point of view, typical streaming benchmarks and ingestion benchmarks will be under Data Acquisition, Database benchmarks will be under Data Storage/Preparation, AI/ML benchmarks will be under Data Analytics/ML and any Visualistion/interaction (few) benchmarks will be under data visualisation (which also includes outer boundary action/interaction – i.e. for embedded systems etc.).
Some larger benchmark suites cover benchmarks in more than one of these areas.
The pipeline steps maps into the BDV Reference Model explained in the Generic Data Analytics Blueprint as follows:
|Steps of the DataBench Data Pipeline||Corresponding Steps of the BDV Reference Model|
|Data Acquisition/Collection||Data Management + Data Processing + Things/Assets|
|Data Storage/Preparation||Data Management|
|Data Analytics/AI/ML||Data Analytics|
|Action/Interaction, Visualisation/Access||Data Visualisation|
For each benchmark DataBench classifies which pipeline steps it covers (it might cover one or more) – and which processing approach (batch, stream, interactive) as well as which big data types that are covered.
this pipeline is quite high level. Therefore, it can be easily specialised in order to describe more specific pipelines, depending on the type of data and the type of processing (e.g. IoT data and real-time processing). The 3D cube depicts the steps of this pipeline in relationship with the type of data processing and the type of data being processed. As we can see in this figure, the type of data processing, which has been identified as a separate topic area in the BDV Reference model, is orthogonal to the pipeline steps and the data types. This is due to the fact that different processing types, like Batch/data-at-rest and Real-time/data-in-motion and interactive, can span across different pipeline steps and, can handle different data types, as the ones identified in the BDV Reference Model, within each of the pipeline steps. Thus, there can be different data types like structured data, times series data, geospatial data, media, Image, Video and audio data, text data, including natural language data, and graph data, network/web data and metadata, which can all imply differences in terms of storage and analytics techniques.
If you would like to search for these elements (pipelines and BDV Reference Model), please use the Search by Pipeline/Blueprint or Search by BDV Reference Model options available under the search menu of the DataBench Toolbox.