What skills you need to become a data engineer

BW L.
Data Engineering Insight
3 min readOct 28, 2020

--

Photo by fabio on Unsplash

You might have heard that data scientist is one of the hottest job in 21st century from Harvard Business review. But have you heard about data engineering (DE)? If you work as a DS, you probably know one or two DE.

As this is my first article on data engineering, the lists in this article also serve as tasks of what I’ll be publishing over time. If there are particular interested topics, please leave me a comment and I can prioritize release of new articles based on interests from readers.

What does a data engineer do?

Although different companies might have different definitions on what a data engineer do, below are some common tasks.

  1. Data ingestion. To get insight for a business based on statistics, you’ll need to have as much data you can have. The first step will be the acquisition of data. If your business already has tons of data, then congratulations, you might have save lots of time to look for data that can help improve your business. For most companies, this is okay initially but over time, they’ll realize that they need more data. So it might be inevitable that there is a need for a data acquisition team and some processes to bring the data into the company.
  2. ETL pipeline building. ETL stands for extract, transform, load. It is one of the most important steps before the data is ready for machine learning modeling.
  3. ETL pipeline productionization. Often, a DS would start the initial data exploration and build an ETL pipeline already. DE team can then productionize the pipeline by improving execution efficiency and schedule the task to be run at a fixed time.
  4. Model production. Another type of productionization that a DE team will be doing is to productionize a model. The same requirements for ETL production applies to model production. What are the different ways of model production? Real time or batch scoring. Real time scoring usually is a REST API. Batch scoring can be a program running at schedule time and write prediction results on large amount (batch) of new data to a location that is consumable by down stream processes.
  5. Dashboard. If an engineer has dashboard skills, building a dashboard to demonstrate insights from data or prediction results will be a big plus for business to understand the value of data science.
  6. Software development. In addition to data processing and model production, regular software development like building web UI, implementing a model could also be required.

What skills a data engineer need?

Based on the above tasks, a data engineer will at least be very familiar with the following skills.

  1. General knowledge on Linux. Most of the processing of data would require lots of CPU and memory power that a laptop (Windows or Mac) cannot have. Therefore, most likely a typical work day will be spent on some Linux servers.
  2. Know one of the programming languages: Python, R, Java, Scala. Python and R are the most popular languages for data science projects. Big data platforms like Hadoop also heavily use Java and Scala.
  3. Understand data analysis basics, start from Python.
  4. Know how Hadoop works. For projects that need to process large amount of data (>10G), chances are a Hadoop cluster is needed.
  5. Cloud Skills. Depends on where the data is, Cloud platform (AWS, Azue, GCP) could be required to execute day to day jobs.

to be continued with more details at each point….

Update: Check out the newest posts

--

--