A case study of rewriting a simple data pipeline involving Python, a pinch of Go, Git workflows, Airflow, Postgres and Cloud. Investigating some common assumptions and principles of designing data pipelines. The benefits and issues with the tools and how these may be handled. I hope this case study of a pipeline rewrite will give you insights that are applicable to Python use for your own data pipelines, and into cloud pricing.
Python, some exposure to cloud. Some basics of databases and data pipeline may be useful.
Get answers to the following options and more, as to what is the cheapest and most maintainable solution for this kind of data pipeline.
The following areas will be covered.
The approach to the architecture and reasons for the rewrite.
Python code for scraping data, along with Soda testing to verify the steps data.
The client data consumption architecture to our Go cloud service
It is complicated! For example, hard disk storage cost should be simple? But it varies based on hardware type, size, throughput, iops, different price depending on zone (us-east2 vs ap-south1 etc.), charges for traffic between regions vs within regions Then there are the backup costs based on its retention size/time, schedule.
Cloud price lists are not small. The full combined price lists for AWS, Google and Azure are 5 million prices.
Review possible data sources and related tools, such as Kubecost, Infracost.io and the cloud providers direct sources.
Often data scientists put together a PoC pipeline without considering cost, it becomes production and starts eating money!
Charges for cloud SaaS are more concealed than raw Cloud PaaS. But it must be assessed too, and they have some interesting pricing. For example if your data pipeline is open source and public, Github workflows are free! If it is private they charge twice as much for compute as using direct Azure.
Similarly cloud pipeline providers are big in this space, but how does Astronomy SaaS pricing stack up vs. running yourself on cloud PaaS?
Cloud developer at EDB, the Postgres company. Bristol, United Kingdom
I have been a cloud engineer for the last 9 years. Working in Golang for the last 6, and Python for the last 20 years. Before that I was a web developer. I have spoken at Kubecon, Djangocon.Eu and EuroPython as well as many times at local techie groups, including regularly at my local Golang meetup.
Currently I spend my time developing microservices for EDB's cross cloud Postgres AI product, built on our CNCF kubernetes postgres operator cloudnative-pg. Plus the occasional Python automation. I am interested in improving DevX for the full lifecycle of Cloud development, and the cost and sustainability of cloud use.