Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Data Pipeline - Data Ingestion, Cheat Sheet of Data Warehousing

Universitas Pelita Harapan (UPH)Data Warehousing

The concept of Data Pipeline and Data Ingestion

Typology: Cheat Sheet

2015/2016

Available from 11/21/2023

maydore 🇮🇩

2 documents

1 / 1

This page cannot be seen from the preview

Don't miss anything!

DATA PIPELINE INGESTION

Cost in data pipeline is not always about money, but also place and time

Scalability means the data pipeline will be capable if the data volume gets

bigger

Error handling is not 100% success, we have to handle if there are errors

Logging = information generated by application. But how if when the process

is running, how come the logging doesn't appear? We must pay attention to

our data pipeline again

Monitoring = for example, suddenly it has been finished for 2 hours but it

hasn't finished yet. Check whether the process is still running or not

Optimize = faster optimization will save more time

Consistent = why is the process fast but the data is not good enough

Tools that easily manage batch and flow data pipelines: cloud dataflow,

amazon glue, blue data factory

Apache beam = execute data pipeline, one part of google cloud dataflow

Kafka = stream

Glue, splash = batch

Airflow = collection

The sla promises how long the service will be available, a good one is 4 decimal

places

The count is in minutes, not hours

Multi region will transfer data in our cloud storage to several servers.

If the area is only 1 smaller host

Big query is more like a data warehouse and is similar to sql, it has integer

strings, etc

Partial preview of the text

Download Data Pipeline - Data Ingestion and more Cheat Sheet Data Warehousing in PDF only on Docsity!

DATA PIPELINE INGESTION

Cost in data pipeline is not always about money, but also place and time Scalability means the data pipeline will be capable if the data volume gets bigger Error handling is not 100% success , we have to handle if there are errors Logging = information generated by application. But how if when the process is running, how come the logging doesn't appear? We must pay attention to our data pipeline again Monitoring = for example, suddenly it has been finished for 2 hours but it hasn't finished yet. Check whether the process is still running or not Optimize = faster optimization will save more time Consistent = why is the process fast but the data is not good enough Tools that easily manage batch and flow data pipelines: cloud dataflow, amazon glue, blue data factory Apache beam = execute data pipeline, one part of google cloud dataflow Kafka = stream Glue, splash = batch Airflow = collection The sla promises how long the service will be available, a good one is 4 decimal places The count is in minutes , not hours Multi region will transfer data in our cloud storage to several servers. If the area is only 1 smaller host Big query is more like a data warehouse and is similar to sql, it has integer strings, etc

Data Pipeline - Data Ingestion, Cheat Sheet of Data Warehousing

Related documents

Partial preview of the text

Download Data Pipeline - Data Ingestion and more Cheat Sheet Data Warehousing in PDF only on Docsity!

DATA PIPELINE INGESTION