ZEISS Digital Innovation Blog - Data ingestion as the basic building block of an industrial data platform

While the first part of the article series provided an overview of the basics of data ingestion and the various data sources, this article focuses on the specific processes of data ingestion, common challenges, and tried-and-tested solutions.

Architectures and patterns for data ingestion

Batch vs. microbatch vs. streaming

For data ingestion in industrial data platforms, there are various strategies that can be selected depending on the use case and requirements. In this article, we focus on batch ingestion while briefly introducing streaming ingestion and microbatch ingestion. However, we will look at these two topics in more detail in another part of the article series.

Streaming ingestion refers to the continuous ingestion of data streams in real time. This method allows companies to process and react to data immediately, which is particularly beneficial for applications that require real-time analysis. The benefits of streaming ingestion lie in the ability to gain immediate insights and react quickly to changes in production processes. An example of this would be data from sensors that record the ambient temperature in production and are used as an influencing variable for process control.

Batch ingestion, on the other hand, deals with the processing of large volumes of data at regular intervals. This method is particularly suitable for scenarios in which data is not required in real time but can be collected in large quantities and processed at a later point in time. Batch ingestion is often more cost-efficient and easier to implement, as it requires fewer resources and the processing takes place at scheduled intervals. This would include data that is required for reporting at set times, such as morning rounds or CIP.

Microbatch ingestion is an intermediate solution between batch and streaming ingestion. With this approach, data is collected and processed at very short, regular intervals, often in minutes or seconds. Microbatching offers a balance between the efficiency of batch processing and the timeliness of streaming processing. It is well suited for applications that require near real-time processing but do not require the complexity and infrastructure of full streaming. An example would be Q data that is not used for automatic process control but which should still be reacted to promptly.

Figure 1: Methods of data ingestion

The connection between these methods lies in the frequency and granularity of data processing:

Batch ingestion is ideal for applications that process large amounts of data at less time-critical intervals.

Microbatch ingestion offers a faster response time than batch, without the complexity of streaming, and is ideal for applications that require frequent but not continuous updates.

Streaming ingestion is the best choice for applications that need to process continuous streams of data in real time.

The differentiated selection of ingestion methods makes it possible to both optimize costs and ensure the necessary data supply to the application.

ETL vs. ELT

Another important aspect of data ingestion is the ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) patterns. With ETL, data is first extracted, then transformed and finally loaded into the target system. This pattern is well suited to structured data, such as transaction data from ERP systems, and offers the advantage that the data is cleansed and optimized before loading. This ensures that only high-quality data reaches the target system, which increases data integrity.

In contrast, ELT loads the data into the target system first and then transforms it there. This is particularly beneficial for large amounts of data and unstructured data, such as sensor data from IoT devices or log files, as it increases flexibility and improves processing speed. By using cloud-based databases that enable elastic scaling, companies can process large amounts of data more efficiently. In the manufacturing industry, the choice between ETL and ELT can vary depending on specific requirements and data types. For example, a company that primarily processes structured data might prefer ETL, while a company that analyzes large amounts of unstructured data would lean towards ELT.

Data pipelines

Data pipelines play a crucial role in automating the data ingest process. In particular, ingestion pipelines are responsible for collecting data from various sources and transferring it to the central data platform. These pipelines can ingest both batch and streaming data and enable seamless integration of data into the data architecture.

Common architectures for data pipelines include:

Batch pipelines: These pipelines collect data at set intervals and load it into the data platform. They are ideal for processing large volumes of data that are not required in real time.

Streaming pipelines: These pipelines process continuous streams of data in real time and enable immediate analysis and reaction to changes.

Use cases and scenarios in the manufacturing industry

The choice of the right data ingestion method depends heavily on the specific use cases and scenarios in the manufacturing industry. Here are some examples:

Predictive maintenance: Data ingestion is used to continuously collect sensor data from machines. This data can be analyzed to identify patterns and make predictions about potential equipment failures. By identifying problems in good time, companies can minimize downtime and reduce maintenance costs. One example is monitoring temperatures, pressures and performance data of machines to detect anomalies in the short term and predict wear and maintenance trends in the long term.

Process optimization: Real-time monitoring of production processes is crucial for efficiency. Data ingestion makes it possible to collect data on machine performance, production speed and material consumption in real time. This information can be used to optimize processes and identify bottlenecks, resulting in higher productivity. In practice, this is the case with bottleneck analysis, for example.

Quality control: Data ingestion also plays an important role in quality control. By analyzing production data, companies can identify deviations from quality standards at an early stage and take appropriate action. This leads to an improvement in product quality and a reduction in rejects and rework. One example of this is trend detection, which can be used to predict and avoid potential errors based on continuous deviations within the tolerance limits.

Figure 2: Use case-specific choice of data ingestion method

Decision matrix

To select the appropriate data ingestion method for specific use cases, companies can use a decision matrix that takes into account factors such as data volume, processing frequency, and real-time requirements. This matrix helps determine the best data ingestion strategy to achieve the desired results in the manufacturing industry.

Figure 3: Example of a decision matrix

Overall, choosing the right architecture and pattern for data ingestion is critical to the success of an industrial data platform. By considering the specific requirements and use cases, companies can ensure that they have the right data at the right time to make informed decisions and optimize their processes.

Challenges and best practices

Implementing an effective data ingestion process in an industrial data platform comes with a number of challenges. To overcome these challenges and maximize the efficiency of data ingestion, companies should follow best practices.

Data quality and validation

Ensuring data quality is critical to the success of any data strategy. During the ingestion process, data must be continuously validated to ensure that it is accurate, consistent, and complete. Insufficient data quality can lead to incorrect analyses and decisions. Best practices for ensuring data quality include:

Automated validation rules: Implement rules to check data integrity, such as format checks, range checks, and detection of duplicates. Depending on the scenario at hand, a decision must be made as to whether incorrect or incomplete data should be discarded during data ingestion or whether it should still be saved with appropriate flagging.

Data cleansing: Perform regular data cleansing to identify and correct inconsistent or incorrect data.

Monitoring: Set up monitoring tools to continuously monitor data quality and detect problems in real time.

Schema management

Schema management is another challenge, especially when integrating data from different sources. Different data sources may have different data formats and structures, making it difficult to harmonize data. Best practices for schema management include:

Schema evolution: Develop strategies for handling schema changes to ensure that new data formats can be easily integrated without disrupting existing processes.

Centralized metadata management: Use metadata catalogs to manage information about the structure and content of the data. This makes it easier to integrate and understand the data.

Standardization: Implement standards for data formats and structures to simplify integration and ensure consistency.

Security, data protection, governance

Security and data protection are critical aspects of the data ingestion process. Organizations must ensure that sensitive data is protected and that they comply with all relevant data protection policies. Best practices in this area include:

Access controls: Implement strict access controls to ensure that only authorized users can access sensitive data.

Data encryption: Encrypt data both during transport and at rest to protect it from unauthorized access.

Compliance management: Ensure that all data processes comply with applicable data protection laws and policies, such as the GDPR or CCPA.

Scalability and performance

Scalability and performance optimization of data ingestion systems are critical to keep pace with growing data volumes and increasing data processing requirements. Best practices for optimizing scalability and performance include:

Distributed architectures: Utilize distributed systems that are horizontally scalable to enable the processing of large amounts of data.

Optimized queries: Reduce the load on data sources by avoiding repetitive queries from the same sources. Use existing data streams and topics where possible.

Caching mechanisms: Implement caching mechanisms to provide frequently accessed data quickly and reduce the load on data sources.

Load balancing: Use load balancing techniques to distribute data processing evenly across multiple resources and avoid bottlenecks.

Essential issues from an architectural perspective

In order to overcome the challenges in the data ingestion process, companies should ask themselves fundamental questions that can serve as a starting point for finding solutions:

How do we ensure data quality during the ingestion process?

What strategies have we implemented for schema management?

How do we protect sensitive data and ensure compliance with data protection guidelines?

Are our data ingestion systems scalable and powerful enough to handle the growth of our data?

What technologies and tools can we use to optimize our data ingestion processes?

How can redundancy and reliability of the ingestion system be ensured?

By answering these questions and implementing these best practices, companies can successfully overcome the challenges of the data ingestion process and build a robust data infrastructure that meets the needs of the modern manufacturing industry.

Conclusion

Data ingestion is an essential part of any industrial data platform and plays a critical role in its success. In the manufacturing industry, effective data ingestion enables the collection, integration, and processing of large amounts of data from various sources, such as sensor data from machines, ERP systems, MES and SCADA systems, and quality control systems.

The choice of the right ingestion method – whether batch, microbatch or streaming – depends on the specific requirements and use cases. While batch ingestion is suitable for processing large amounts of data at regular intervals, microbatch and streaming ingestion offer advantages for applications that require near real-time or real-time data processing.

Challenges such as data quality and validation, schema management, security, privacy and governance, as well as scalability and performance need to be addressed to ensure a robust and efficient data architecture. By implementing best practices and answering essential questions, organizations can overcome these challenges and successfully operate their data platforms.

Overall, effective data ingestion is a critical building block for optimizing production processes, improving quality control and predicting equipment failure in the manufacturing industry. It forms the basis for well-founded decisions and makes a significant contribution to increasing efficiency and competitiveness.

This post was written by:

Christian Heinemann

Christian Heinemann is a graduate computer scientist and works as a solution architect at ZEISS Digital Innovation in Leipzig. His work focuses on the areas of distributed systems, cloud technologies and digitalization in manufacturing. Christian has more than 20 years of project experience in software development. He works together with various ZEISS units and external customers to develop and implement innovative solutions.

See author’s posts

This post was written by:

Michael Muck

Michael Muck holds a master’s degree in computer science and works as software architect and group leader at ZEISS Digital Innovation in Munich. His focus is on cloud technologies. He has project experience in various industries, including retail, logistics and the automotive industry.

See author's posts