DataBench Generic Data Pipeline
The DataBench Framework is following three main dimensions: Pipeline steps, Processing types and Data types. There are also other dimensions like computing continuum – edge, premise, fog, cloud, HPC etc – and others. In addition, we have the Business dimensions: Application domains etc.
We are presenting in this nugget the DataBench top-level Data Pipeline. This is based in many other efforts in the community, such as the Data Value Chain proposed by the NIST Big Data Reference Architecture (NBDRA), or the steps in the BDV Reference Model.
We have now the following top-level pipeline as seen in the figure (Data Ingestion, Data Storage, Data Analytics/ML and Data Visualisation/Action/Interaction).
- Data Acquisition/Collection: In general, this step handles the interface with the data providers and includes the transportation of data from various sources to a storage medium where it can be accessed, used, and analysed by an organization. Tasks in this step, depending on application implementations, include accepting or performing specific collections of data, pulling data or receiving pushes of data from data providers and storing or buffering data. The cycle Extract, Transform, Load (ETL)/Extract, Load, Transform (ELT) can also be included in this step. At the initial collection stage, sets of data are collected and combined. Initial metadata can also be created to facilitate subsequent aggregation or look-up methods. Security and privacy considerations may also be included in this step, since authentication and authorization activities as well as recording and maintaining data provenance activities are usually performed during data collection. Last, we would like to note that tasks in this step may vary, depending on the type of the collected data.
- Data Storage/Preparation: Tasks performed in this step include data validation, like for example checking formats, data cleansing, such as removing outliers or bad fields, extraction of useful information, organization and integration of data collected from various sources, leveraging metadata keys to create an expanded and enhanced dataset, annotation, publication and presentation of the data in order to be available for discovery, reuse and preservation, standardization and reformatting, or encapsulating. Also, in this step, source data are frequently persisted to archive storage and provenance data are verified or associated. The transformation part of the ETL/ELT cycle could also be performed in this step, although advanced transformation is usually included in the next step which is related with data analytics. Optimization of data through manipulations, such as data deduplication and indexing, could also be included here in order to optimize the analytics process.
- Analytics/AI/Machine Learning: In this step, new patterns and relationships, which might be invisible, are discovered so as to provide new insights. The extraction of knowledge from the data is based on the requirements of the vertical application which specify the data processing algorithms. This step can be considered as the most important step as it explores meaningful values, and thus, it is the basis for giving suggestions and making decisions. Hashing, indexing and parallel computing are some of the methods used for Big Data analysis. Machine learning techniques and Artificial Intelligence are also used here, depending on the application requirements.
- Action/Interaction, Visualisation/Access: Data can have no value without being interpreted. Visualization assists in the interpretation of data by creating graphical representations of the information conveyed, and thus adding more value to data. This is due to the fact that the human brain processes information much better when it is presented in charts or graphs rather than on spreadsheets or reports. Thus, visualization is an essential step as it assists users to comprehend large amounts of complex data, interact with them, and make decisions according to the results. It is worth to note that effective data visualization needs to keep a balance between the visuals it provides and the way it provides them so that it attracts users’ attention and conveys the right messages.
From the benchmarking point of view, typical streaming benchmarks and ingestion benchmarks will be under Data Acquisition, Database benchmarks will be under Data Storage/Preparation, AI/ML benchmarks will be under Data Analytics/ML and any Visualistion/interaction (few) benchmarks will be under data visualisation (which also includes outer boundary action/interaction – i.e. for embedded systems etc.).
Some larger benchmark suites cover benchmarks in more than one of these areas.
The pipeline steps maps into the BDV Reference Model explained in the Generic Data Analytics Blueprint as follows:
|Steps of the DataBench Data Pipeline||Corresponding Steps of the BDV Reference Model|
|Data Acquisition/Collection||Data Management + Data Processing + Things/Assets|
|Data Storage/Preparation||Data Management|
|Data Analytics/AI/ML||Data Analytics|
|Action/Interaction, Visualisation/Access||Data Visualisation|
For each benchmark DataBench classifies which pipeline steps it covers (it might cover one or more) – and which processing approach (batch, stream, interactive) as well as which big data types that are covered.
If you would like to search for these elements (pipelines and BDV Reference Model), please use the Search by Pipeline/Blueprint or Search by BDV Reference Model options available under the search menu of the DataBench Toolbox.