
Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The concept of Data Pipeline and Data Ingestion
Typology: Cheat Sheet
1 / 1
This page cannot be seen from the preview
Don't miss anything!
Cost in data pipeline is not always about money, but also place and time Scalability means the data pipeline will be capable if the data volume gets bigger Error handling is not 100% success , we have to handle if there are errors Logging = information generated by application. But how if when the process is running, how come the logging doesn't appear? We must pay attention to our data pipeline again Monitoring = for example, suddenly it has been finished for 2 hours but it hasn't finished yet. Check whether the process is still running or not Optimize = faster optimization will save more time Consistent = why is the process fast but the data is not good enough Tools that easily manage batch and flow data pipelines: cloud dataflow, amazon glue, blue data factory Apache beam = execute data pipeline, one part of google cloud dataflow Kafka = stream Glue, splash = batch Airflow = collection The sla promises how long the service will be available, a good one is 4 decimal places The count is in minutes , not hours Multi region will transfer data in our cloud storage to several servers. If the area is only 1 smaller host Big query is more like a data warehouse and is similar to sql, it has integer strings, etc