Discover gender and job related patterns in college majors (A Data Science approach)

Choosing a college major is stressful and making sure you make the right decision can often feel like a daunting task. In this project we will be examining students who graduated in US between 2010 and 2012. The though process of an individual choosing a major can be different. But we will explore if there are tendencies in choosing a major based on gender. Also we will check if having a major can improve jobs aspects such as employment and income.

Conclusions highlights:
  1. Men typically choose majors such as science, technology, engineering or mathematics.
  2. Women tend to choose majors such as health, education or social work.
  3. Median unemployment rate is at 0.067 below US unemployment rate(8%-10%).
  4. Average salary is around $40K within national average salary range of($39k-$42k).
  5. 52.2% of the students with college majors are women while 47.7% are men.

Exploring the dataset

The original dataset is provided by ‰‰American Community Survey. FiveThirtyEight cleaned the dataset and released it on their Github repo Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more.

Let's begin by examining the columns of the dataset

In [24]:
import pandas as pd
import numpy as np

# read dataset
recent_grads=pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv')

# view first 5 rows
recent_grads.head()
Out[24]:
Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
0 1 2419 PETROLEUM ENGINEERING 2339.0 2057.0 282.0 Engineering 0.120564 36 1976 ... 270 1207 37 0.018381 110000 95000 125000 1534 364 193
1 2 2416 MINING AND MINERAL ENGINEERING 756.0 679.0 77.0 Engineering 0.101852 7 640 ... 170 388 85 0.117241 75000 55000 90000 350 257 50
2 3 2415 METALLURGICAL ENGINEERING 856.0 725.0 131.0 Engineering 0.153037 3 648 ... 133 340 16 0.024096 73000 50000 105000 456 176 0
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258.0 1123.0 135.0 Engineering 0.107313 16 758 ... 150 692 40 0.050125 70000 43000 80000 529 102 0
4 5 2405 CHEMICAL ENGINEERING 32260.0 21239.0 11021.0 Engineering 0.341631 289 25694 ... 5180 16697 1672 0.061098 65000 50000 75000 18314 4440 972

5 rows × 21 columns

Columns Description
Rank Rank by median earnings (the dataset is ordered by this column)
Major_code Major code
Major Major description
Major_category Category of major
Total Total number of people with major
Sample_size Sample size (unweighted) of full-time
Men Male graduates
Women Female graduates
ShareWomen Women as share of total
Employed Number employed
Unemployment_rate Unemployed / (Unemployed + Employed)
Median Median salary of full-time, year-round workers
Low_wage_jobs Number in low-wage service jobs
Full_time Number employed 35 hours or more
Part_time Number employed less than 35 hours

Data cleaning

Let's now look for null values. If found we would have to drop those records as we want our analysis to be accurate

In [25]:
recent_grads.isnull().sum()
Out[25]:
Rank                    0
Major_code              0
Major                   0
Total                   1
Men                     1
Women                   1
Major_category          0
ShareWomen              1
Sample_size             0
Employed                0
Full_time               0
Part_time               0
Full_time_year_round    0
Unemployed              0
Unemployment_rate       0
Median                  0
P25th                   0
P75th                   0
College_jobs            0
Non_college_jobs        0
Low_wage_jobs           0
dtype: int64

Drop missing values

In [26]:
recent_grads=recent_grads.dropna()

recent_grads.isnull().sum()
Out[26]:
Rank                    0
Major_code              0
Major                   0
Total                   0
Men                     0
Women                   0
Major_category          0
ShareWomen              0
Sample_size             0
Employed                0
Full_time               0
Part_time               0
Full_time_year_round    0
Unemployed              0
Unemployment_rate       0
Median                  0
P25th                   0
P75th                   0
College_jobs            0
Non_college_jobs        0
Low_wage_jobs           0
dtype: int64

Check if there is any correlation between college major, median salary, employment and gender percentage

We will examine if there is a correlation between:

  • Total and Employed
  • Sample_size and Unemployment_rate
  • Full_time and Median
  • ShareWomen and Unemployment_rate
  • Men and Median
  • Women and Median

Pearson correlation coefficient expresses variable interdependency. Correlation has 2 characteristics:

  • strength(strong correlation;medium;no correlation)
  • direction(positive;negative)

We measure the correlation coefficient using DataFrame.corr.

There is correlation if Y depends on X and if any change in X will determine a linear change in Y.

Represented below are some typical correlations:

correlation.png

In [27]:
# import plotting modules
# i used plotly due to improved astetics
# if ModuleNotFoundError, try pip install cufflinks==0.8.2

from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
import plotly.graph_objs as go
init_notebook_mode()
import cufflinks as cf
cf.set_config_file(offline=True) #don't forget to set cufflinks offline