Choosing a college major is stressful and making sure you make the right decision can often feel like a daunting task. In this project we will be examining students who graduated in US between 2010 and 2012. The though process of an individual choosing a major can be different. But we will explore if there are tendencies in choosing a major based on gender. Also we will check if having a major can improve jobs aspects such as employment and income.
The original dataset is provided by American Community Survey. FiveThirtyEight cleaned the dataset and released it on their Github repo Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more.
Let's begin by examining the columns of the dataset
import pandas as pd import numpy as np # read dataset recent_grads=pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv') # view first 5 rows recent_grads.head()
|1||2||2416||MINING AND MINERAL ENGINEERING||756.0||679.0||77.0||Engineering||0.101852||7||640||...||170||388||85||0.117241||75000||55000||90000||350||257||50|
|3||4||2417||NAVAL ARCHITECTURE AND MARINE ENGINEERING||1258.0||1123.0||135.0||Engineering||0.107313||16||758||...||150||692||40||0.050125||70000||43000||80000||529||102||0|
5 rows × 21 columns
|Rank||Rank by median earnings (the dataset is ordered by this column)|
|Major_category||Category of major|
|Total||Total number of people with major|
|Sample_size||Sample size (unweighted) of full-time|
|ShareWomen||Women as share of total|
|Unemployment_rate||Unemployed / (Unemployed + Employed)|
|Median||Median salary of full-time, year-round workers|
|Low_wage_jobs||Number in low-wage service jobs|
|Full_time||Number employed 35 hours or more|
|Part_time||Number employed less than 35 hours|
Let's now look for null values. If found we would have to drop those records as we want our analysis to be accurate
Rank 0 Major_code 0 Major 0 Total 1 Men 1 Women 1 Major_category 0 ShareWomen 1 Sample_size 0 Employed 0 Full_time 0 Part_time 0 Full_time_year_round 0 Unemployed 0 Unemployment_rate 0 Median 0 P25th 0 P75th 0 College_jobs 0 Non_college_jobs 0 Low_wage_jobs 0 dtype: int64
Rank 0 Major_code 0 Major 0 Total 0 Men 0 Women 0 Major_category 0 ShareWomen 0 Sample_size 0 Employed 0 Full_time 0 Part_time 0 Full_time_year_round 0 Unemployed 0 Unemployment_rate 0 Median 0 P25th 0 P75th 0 College_jobs 0 Non_college_jobs 0 Low_wage_jobs 0 dtype: int64
We will examine if there is a correlation between:
Pearson correlation coefficient expresses variable interdependency. Correlation has 2 characteristics:
We measure the correlation coefficient using DataFrame.corr.
There is correlation if Y depends on X and if any change in X will determine a linear change in Y.
Represented below are some typical correlations:
# import plotting modules # i used plotly due to improved astetics # if ModuleNotFoundError, try pip install cufflinks==0.8.2 from plotly.offline import download_plotlyjs, init_notebook_mode, iplot from plotly.graph_objs import * import plotly.graph_objs as go init_notebook_mode() import cufflinks as cf cf.set_config_file(offline=True) #don't forget to set cufflinks offline