

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The difference between batch and stream processing flow, and the importance of monitoring and detecting data anomaly errors. It also discusses OLTP and OLAP, and the process of moving data from source to target using a data pipeline. The document also introduces the concept of message systems and their types, and provides a use case scenario.
Typology: Cheat Sheet
1 / 2
This page cannot be seen from the preview
Don't miss anything!
Batch n stream processing flow: there are window slide, fixed, session (can be selected from the required use case) Monitoring to see if there are any problems with the application we are running. If there is a problem, we have to deal with it. Detect data anomaly errors We can immediately process and analyze the data that appears (make decisions) Critical time requires streaming piepline data Non-analysis data = OLTP [if the storage is heavy, large companies don't mess with OLTP because it will impact the performance of the application] Data analysis = OLAP So, data from OLTP is immediately moved to OLAP so that 1 million users are not disturbed by slow applications Streaming is expensive because the service has to run 24 hours. This is different from batches, for example monthly, so the system runs once a month If the streaming process time for the e-commeerce promo = so you need to increase the pipeline specifications Data pipeline = data flow, the basic task of a data pipeline === moving data from source to target So there is a process in the data pipeline If the batch is charged from a script that we created (we take it from the source, not the source that was sent to us) Message system = so that two systems can communicate / activate two-way and asynchronous communication applications (chat via WA, if the opponent has read it, it has been delivered) If it's synchronous (call via WA, if you don't pick up, wait for the message, then call again) There are 2 types =
Processing engine = apache kafka[kafka stream(it provides API)], apache flink(it doesn't batch and it's native streaming, apache spark streaming(it's not native streaming, it's rather low level, meaning it doesn't really stream, microbatch) Flink and spark = must have their own cluster Kafka = it doesn't need additional clusters Messaging systems need clusters For flink and spark they mean they need two clusters = 1 for kafka and one for themselves Case difference is at least once and exactly once In the streaming process there is windowing too Data pushed = if you make an API create a CRUD db from the transaction table for example When the data becomes more complex, they start to use event-driven. Sent to the messaging system. For example, a shopping event continues with the goods delivery process (it will be processed according to the event flow)