LoginSign-up for free
Skill Path Python Data Engineering - Learn Data Engineering with Python: visualize, manage and manipulate data at scale

Python Data Engineering - Learn Data Engineering with Python: visualize, manage and manipulate data at scale

Intermediate18h 44minLast updated 08/2025
Embark on the Python Data Engineering skill path, tailored for software engineers aiming to master data management. This course guides you from the fundamentals of Python to advanced techniques in data manipulation and management. You'll learn to efficiently process and analyze large datasets using pandas and NumPy, and visualize your insights with Matplotlib. The curriculum covers all phases of data treatment—from initial data wrangling to the final presentation of your findings. Ideal for software engineers seeking to enhance their data engineering skills and work effectively with data at scale.
Content (39)
What is Data Engineering
In this chapter we are going to learn more about the definition of data engineering and what that really takes into account. You will learn the tasks and core competencies of data engineers and we will introduce you to a few key concepts like ETL
Data engineering skills
To get started with this path we need to talk about what data engineering is in the first place: As a data engineer you will need to extract data from a variety of sources, organize that data, and finally export it so other people/applications can use it.
step image Intro to ETL tools
Link
Intro to ETL tools
20minBasic
ETL is the foundation of data engineering, it stands for extract, transform and load and describes fairly well what you will be doing as a data engineer.
How an experienced Data Engineer would learn the field.
It is very easy to confuse data engineering with analyzing data, and the fields are connected and have some overlaps. However, in the most basic sense, a data engineer extracts and prepares the data while a data analyst uses that data to get deeper insights.
As a data engineer you will work with a broad range of sources and will thus be required to learn various tools. The primary one is usually Python and SQL but, given the project, you might have to learn other languages as well.
Link
Data Engineer wiki
5minBasic
Data engineers have created a wiki that contains a lot of useful learning tools and a glossary of technical terms; it also contains events and communities. Definitely a page you want to bookmark.
Python Data Tools: how to manage data at scale with Python
This chapter is fully focused on the tools you can use with Python to manage large datasets of data. You will learn how to use Jupyter notebooks, Numpy and Pandas.
Jupyter Notebook
NumPy
Pandas
Python
Video
Jupyter notebooks
25minBasic
When working with data you usually don't use a traditional code editor like VS code but rather Jupyter notebooks: These allow you to separate your code into chunks and run each of those separately, they also include nice visual features to display graphs and tables.
The single most important tool in Python for data analysis is an array, which you get from Numpy. An Array is basically a list that is much faster and more powerful.
While Numpy arrays are fundamentally simple, you can do a lot with them. Reading through the documentation/manual will give you a good idea what you can do with it.
Numpy arrays are only for single sets of data but usually you are working with lots of data that you need to organize. For that you have Pandas, which gives you a DataFrame. This combines multiple Numpy arrays into a larger DataFrame that lets you organize things efficiently.
Like with Numpy, Pandas' DataFrames are fundamentally not too complicated but you can do a lot with them. Hence, it is really important to understand the documentation and read through the manual inside of it.
To really get good at Pandas you need to practice. The attached website contains 101 exercises and if you get through all of them you should have a good understanding of the library.
Extracting data with Python and other tools
In this chapter, we dive deep into data extraction using various techniques. We will use Python and SQL in a variety of environments and use cases.
Data Mapping and Extraction
Microsoft Excel
Selenium
SQL
The first step to working with data is to extract it. This is not a standardized process as the data can be very diverse. Ideally, you simply import an SQL database or some kind of Excel file but there can be lots of other, more unconventional, datasources that you might have to work with.
step image Web scraping with Python
Video
Web scraping with Python
21minIntermediate
Web scraping is a very common source of data. All you do for that is download an HTML file and then extract specific bits from it. Python has a few libraries for that.
Video
BeautifulSoup
1hAdvanced
To expand on web scraping, the main library in Python for that is BeautifulSoup4, or BS4, which allows you to download and analyze HTML files very efficiently. You can find here the link to the full playlist: https://www.youtube.com/playlist?list=PLzMcBGfZo4-lSq2IDrA6vpZEV92AmQfJK
Link
Getting started with Selenium
25minIntermediate
With tools like BeautifulSoup you can download HTML files but quite often that isn't enough: Occasionally you will need to create a bot that can itself click through websites and store important data. Selenium is the Python library best for this purpose.
step image SQL
Video
SQL
24minIntermediate
SQL, or Structured Query Language, is that standard language to work with databases and extract data from them. Python has a number of modules to work with SQL and import parts of a database.
While Python has dedicated SQL modules a common shortcut is to use Pandas for that. This might not be as powerful as a separate module but very often you just want a basic import. Also, Pandas can do basic web scraping as well!
step image Import Excel data
Video
Import Excel data
20minIntermediate
Excel is another source of data that you need to work with quite often. Once again, Python can import this data well, either via a dedicated module like OpenPyXL or via Pandas.
step image Python CSV tutorial
Video
Python CSV tutorial
16minAdvanced
CSV, or comma separated values, are text values that contain values separated by a comma. For homogenous datasets these are quite common as they are easy to create and very small. You can import them either via Pandas, Numpy or even with standard Python.
Exploratory Data Analysis: understand your data and their characteristics
Exploratory Data Analysis is an approach to analyzing datasets that involves summarizing their main characteristics, often using visual methods. It’s a critical first step in data analysis, allowing analysts to discover patterns and more
Data Mapping
Exploratory Analysis
An introduction to exploratory data analysis to get started.
Just having a basic understanding of data exploration isn't too helpful, you actually need to do practical work in a DataFrame. This video is a really good start to go through an example and see how Pandas can help you.
To get into more detail on data exploration you can read through this article to get more ideas like how to manipulate columns in a DataFrame.
A really important thing to understand is that Numpy and Pandas are just tools that make your life easier. they do not prevent you from making statistical errors. This article is very good at discussing various ideas how to approach these kinds of problems.
Data organization and cleaning
In this chapter, you will learn how to clean up data before working on it. This is usually related to the removal of columns, information, and other data that is not needed.
Data Cleaning and Preprocessing
A really important task for data organization is data cleaning, meaning that you remove unnecessary data and errors, hide columns that aren't required, ensure that you have the right datatype and that columns are named appropriately etc...
Reorganizing the data can be a step in exploratory data analysis or in the data organization part. To go into depth, this article outlines key methods to reorganize your DataFrames to suit your data.
Just organizing the data itself isn't enough, you also need to make sure that the whole project is organized well. This includes the folder setup, naming convention, file types etc...
step image Pandas Data Science by example
Link
Pandas Data Science by example
30minIntermediate
With the basics covered you need to practice reorganizing data. The linked article contains a lot of examples and recipes that should help you to master handling basically any kind of data.
Data visualization MatPlotLib: a complete Python library for data visualization
Now that you have cleaned your data it's time to visualize them and it's chapter is all about learning how to use a few Python libraries to easily visualize data and build any type of chart.
Data Visualization
Python
Link
Pyplot tutorial
20minIntermediate
An introduction to creating graphs in Python using Pyplot, a submodule of MatPlotLib.
An introduction to creating graphs using Pandas and Pyplot together. Python's data tools have been developed together, meaning that Numpy and Pandas can be very easily integrated with MatPlotLib.
Link
MatPlotLib examples
20minAdvanced
Matplotlib can create basically any kind of graphic and on this page you will find an example for all the available options. Don't read all of it in one go instead look through it when you need a specific graph and then adapt the example.
step image MatPlotLib animations
Video
MatPlotLib animations
12minAdvanced
MatPlotLib is also able to animate graphs, which can give a graph a very professional look.
Building a data pipeline
A data pipeline automates the flow of data from sources to storage and analysis. It is the final result of everything we have seen in this path: It involves extraction, transformation, and loading (ETL) processes to prepare and move data efficiently.
Data engineering skills
Data Manipulation
Data Mapping
Data Modeling
Data Pipelines
An introduction to what data pipelines are and how to approach them. A data pipeline is simply the whole path from gathering data to transforming it and then loading it into other tools.
An article that goes into a lot of detail on how data pipelines can look like. It talks about the various possible sources of data, if the pipeline generates real world or "static" data; what tools are going to be used and where the data can end up.
A very detailed and practical guide how a data pipeline can be established in Python. It nicely brings together the concepts we have covered so far in the path. While the created data pipeline is simple it outlines the basic approaches to understand the process.
This is a practical project to get started with extracting json data and working with AWS. This would be an excellent start to get data imported for a data pipeline.
Another example to get started with a data pipeline. This video imports data from an SQL server and organizes it using a DataFrame.
To finish, it is important to look at examples and references what other people are doing. This article contains a few ideas that should be handy.
There is no better way to test your new skills that try our final AI Simulation for Data Engineers: you will be presented with a real life scenario where you need to take the lead and help your company analyze a large amount of maritime data.
Introduction
This path has been curated by the Anthropos team in collaboration with Chris Koch
Chris KochChris Koch
Instructional designer, Instructor and Sr. Software Engineer
Skill objectives