9.1. Analyze Data Science Market#

The data in this section is collected from 160k+ data scientists, 570k data analysts, 100k+ data engineers, and 19k+ machine learning engineers from all over the world using Diffbot, the world’s largest knowledge graph. Find more instructions on how to use Diffbot here.

I uploaded the dataset used in this tutorial to this repository so that you can try out the dataset yourself.

9.1.1. Find Top Most Popular Languages#

!pip install observable_jupyter

What are the top languages used by data scientists, data analysts, data engineers, and machine learning engineers? I answered this question using the data consisting of the top 100 most popular skills of people who have these job titles.

from observable_jupyter import embed
import pandas as pd 

Start with loading the dataset:

skill_count = pd.read_csv(
    "https://media.githubusercontent.com/media/khuyentran1401/dataset/master/data_science_market/all_skills.csv",
    index_col=0,
)
skill_count.head(10)

	count	skill	Title
0	131292	teaching	Data Scientist
1	113898	economics	Data Scientist
2	106630	programming language	Data Scientist
3	105294	mathematics	Data Scientist
4	79871	machine learning	Data Scientist
5	79735	python	Data Scientist
6	77810	robotics	Data Scientist
7	70540	software development	Data Scientist
8	69262	phython	Data Scientist
9	62084	data analysis	Data Scientist

Next, we will visualize the dataset using a bubble matrix created on Observable.

A bubble matrix uses sizes and colors to represent two-dimensional information. The rows represent the job titles and the columns represent the languages. The bigger a bubble is, the more frequently the language is used in a certain job category.

The bubbles are highlighted if they are above a certain number of occurrences. You can use the slider to choose the threshold above which the bubbles are highlighted. For example, if you choose the threshold to be 100k, only the bubbles with a count above 100k are colored dark purple.

embed("@khuyentran1401/languages-between-jobs", cells=["chart", "viewof options"])

Edit @khuyentran1401/languages-between-jobs on Observable

To sort the bubbles by a specific job title, click that job title.

Based on this plot, we can see that:

The top 3 skills of data analysts in descending order are SQL, Python, and R
The top 3 skills of data engineers in descending order are SQL, Python, and Java
The top 3 skills of data scientists in descending order are Python, R, and SQL
The top 3 skills of machine learning engineers in descending order are Python, Java, and C++

9.1.2. 10 Most Popular Majors of Data Professionals#

What are the top most popular majors of data professionals? To answer this question, we will download the data that shows the frequency of the top majors of a certain position.

!wget -c -O 'data/majors_df.pkl' 'https://github.com/khuyentran1401/dataset/raw/master/data_science_market/majors_df.pkl'

import pandas as pd  
from IPython.core.pylabtools import figsize
import matplotlib.pyplot as plt 

figsize(12, 8)

df = pd.read_pickle("data/majors_df.pkl")

Top majors of data scientists:

df["data scientist"].plot.bar(x="name", y="value", legend=False, fontsize=15)
plt.title("Majors of Data Scientists")
plt.show()

Top majors of data engineers:

df["data engineer"].plot.bar(x="name", y="value", legend=False, fontsize=15)
plt.title("Majors of Data Engineer")
plt.show()

Top majors of machine learning engineers:

df["machine learning engineer"].plot.bar(x="name", y="value", legend=False, fontsize=15)
plt.title("Majors of Machine Learning Engineer")
plt.show()

9.1.3. Top Data Science Bootcamps for Data Professionals#

Have you ever been frustrated when seeing hundreds of data science bootcamps to choose from on Google? Instead of choosing a random one, wouldn’t it be nice if you can find which bootcamps most data scientists took?

I analyzed the profiles over 5k data professionals to figure it out. The chart below shows the most popular bootcamps by roles from the analysis.

!wget -c -O 'data/bootcamps.pkl' https://github.com/khuyentran1401/dataset/blob/master/data_science_market/bootcamps.pkl?raw=true

import pandas as pd  
import plotly.express as px  

df = pd.read_pickle("data/bootcamps.pkl")
fig = px.bar(df, x='education', y='value', color='position', title='Top Bootcamps for Data Professionals by Roles').update_xaxes(
    categoryorder='total descending')
fig 

Full article on how I scraped and processed the data.

Effective Python for Data Scientists

Analyze Data Science Market

Contents

9.1. Analyze Data Science Market#

9.1.1. Find Top Most Popular Languages#

9.1.2. 10 Most Popular Majors of Data Professionals#

9.1.3. Top Data Science Bootcamps for Data Professionals#