Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Pipeline - Data Ingestion, Cheat Sheet of Data Warehousing

The concept of Data Pipeline and Data Ingestion

Typology: Cheat Sheet

2015/2016

Available from 11/21/2023

maydore
maydore 🇮🇩

2 documents

1 / 1

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
DATA PIPELINE INGESTION
Cost in data pipeline is not always about money, but also place and time
Scalability means the data pipeline will be capable if the data volume gets
bigger
Error handling is not 100% success, we have to handle if there are errors
Logging = information generated by application. But how if when the process
is running, how come the logging doesn't appear? We must pay attention to
our data pipeline again
Monitoring = for example, suddenly it has been finished for 2 hours but it
hasn't finished yet. Check whether the process is still running or not
Optimize = faster optimization will save more time
Consistent = why is the process fast but the data is not good enough
Tools that easily manage batch and flow data pipelines: cloud dataflow,
amazon glue, blue data factory
Apache beam = execute data pipeline, one part of google cloud dataflow
Kafka = stream
Glue, splash = batch
Airflow = collection
The sla promises how long the service will be available, a good one is 4 decimal
places
The count is in minutes, not hours
Multi region will transfer data in our cloud storage to several servers.
If the area is only 1 smaller host
Big query is more like a data warehouse and is similar to sql, it has integer
strings, etc

Partial preview of the text

Download Data Pipeline - Data Ingestion and more Cheat Sheet Data Warehousing in PDF only on Docsity!

DATA PIPELINE INGESTION

Cost in data pipeline is not always about money, but also place and time Scalability means the data pipeline will be capable if the data volume gets bigger Error handling is not 100% success , we have to handle if there are errors Logging = information generated by application. But how if when the process is running, how come the logging doesn't appear? We must pay attention to our data pipeline again Monitoring = for example, suddenly it has been finished for 2 hours but it hasn't finished yet. Check whether the process is still running or not Optimize = faster optimization will save more time Consistent = why is the process fast but the data is not good enough Tools that easily manage batch and flow data pipelines: cloud dataflow, amazon glue, blue data factory Apache beam = execute data pipeline, one part of google cloud dataflow Kafka = stream Glue, splash = batch Airflow = collection The sla promises how long the service will be available, a good one is 4 decimal places The count is in minutes , not hours Multi region will transfer data in our cloud storage to several servers. If the area is only 1 smaller host Big query is more like a data warehouse and is similar to sql, it has integer strings, etc