exploratory data analysis: basics Python part 2

Python part 2

Bruno Carlin

Jan 25, 2020 21 min read 0 Comments R and Python

exploratory data analysis: basics Python part 2

Python part 2

Bruno Carlin

Jan 25, 2020 21 min read 0 Comments R and Python

Libraries

Let’s see what version of python this env is running.

reticulate::py_config()

## python:         C:/Users/bruno/AppData/Local/r-miniconda/envs/r-reticulate/python.exe
## libpython:      C:/Users/bruno/AppData/Local/r-miniconda/envs/r-reticulate/python36.dll
## pythonhome:     C:/Users/bruno/AppData/Local/r-miniconda/envs/r-reticulate
## version:        3.6.12 (default, Dec  9 2020, 00:11:44) [MSC v.1916 64 bit (AMD64)]
## Architecture:   64bit
## numpy:          C:/Users/bruno/AppData/Local/r-miniconda/envs/r-reticulate/Lib/site-packages/numpy
## numpy_version:  1.19.5

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import scipy.stats as ss
import statsmodels.api as sm
import statsmodels.formula.api as smf
import os
from statsmodels.graphics.gofplots import ProbPlot

Second Post

Objectives

Define the variables used in the conclusion

In our case, we initially choose to use salary ~ sex,region region was added to test whether Simpson’s paradox was at play.

But then I augmented our analysis with a simple linear regression.

Using masks or other methods to filter the data

We used it once.

Visualizing the hypothesis

We were advised to use two histograms combined to get a preview of our answer.

Conclusion

Comment on our findings.

Before we start

Reservations

This is an exercise where we were supposed to ask a relevant question using the data from the IBGE(Brazil’s main data collector) database of 1970.

Our group decided to ask whether women received less than man, we expanded the analysis hoping to avoid the Simpson’s paradox.

This is just an basic inference, and it’s results are therefore only used for studying purposes I don’t believe any finding would be relevant using just this approach but some basic operations can be used in a more impact full work.

Data Dictionary

We got a Data Dictionary that will be very useful for our Analysis, it contains all the required information about the encoding of the columns and the intended format that the folks at STATA desired.

Portuguese

Descrição do Registro de Indivíduos nos EUA.

Dataset do software STATA (pago), vamos abri-lo com o pandas e transforma-lo em DataFrame.

Variável 1 – CHAVE DO INDIVÍDUO ? Formato N - Numérico ? Tamanho 11 dígitos (11 bytes) ? Descrição Sumária Identifica unicamente o indivíduo na amostra.

Variável 2 - IDADE CALCULADA EM ANOS ? Formato N - Numérico ? Tamanho 3 dígitos (3 bytes) ? Descrição Sumária Identifica a idade do morador em anos completos.

Variável 3 – SEXO ? Formato N - Numérico ? Tamanho 1 dígito (1 byte) ? Quantidade de Categorias 3 ? Descrição Sumária Identifica o sexo do morador. Categorias (1) homem, (2) mulher e (3) gestante.

Variável 4 – ANOS DE ESTUDO ? Formato N - Numérico ? Tamanho 2 dígitos (2 bytes) ? Quantidade de Categorias 11 ? Descrição Sumária Identifica o número de anos de estudo do morador. Categorias (05) Cinco ou menos, (06) Seis, (07) Sete, (08) Oito, (09) Nove, (10) Dez, (11) Onze, (12) Doze, (13) Treze, (14) Quatorze, (15) Quinze ou mais.

Variável 5 – COR OU RAÇA ? Formato N - Numérico ? Tamanho 2 dígitos (2 bytes) ? Quantidade de Categorias 6 ? Descrição Sumária Identifica a Cor ou Raça declarada pelo morador. Categorias (01) Branca, (02) Preta, (03) Amarela, (04) Parda, (05) Indígena e (09) Não Sabe.

Variável 6 – VALOR DO SALÁRIO (ANUALIZADO) ? Formato N - Numérico ? Tamanho 8 dígitos (8 bytes) ? Quantidade de Decimais 2 ? Descrição Sumária Identifica o valor resultante do salário anual do indivíduo. Categorias especiais (-1) indivíduo ausente na data da pesquisa e (999999) indivíduo não quis responder.

Variável 7 – ESTADO CIVIL ? Formato N - Numérico ? Tamanho 1 dígito (1 byte) ? Quantidade de Categorias 2 ? Descrição Sumária Dummy que identifica o estado civil declarado pelo morador. Categorias (1) Casado, (0) não casado.

Variável 8 – REGIÃO GEOGRÁFICA ? Formato N - Numérico ? Tamanho 1 dígito (1 byte) ? Quantidade de Categorias 5 ? Descrição Sumária Identifica a região geográfica do morador. Categorias (1) Norte, (2) Nordeste, (3) Sudeste, (4) Sul e (5) Centro-oeste.

English

Description of the US Individual Registry.

Dataset of the STATA software (paid), we will open it with pandas and turn it into DataFrame.

Variable 1 - KEY OF THE INDIVIDUAL? Format N - Numeric? Size 11 digits (11 bytes)? Summary Description Uniquely identifies the individual in the sample.

Variable 2 - AGE CALCULATED IN YEARS? Format N - Numeric? Size 3 digits (3 bytes)? Summary Description Identifies the age of the resident in full years.

Variable 3 - SEX? Format N - Numeric? Size 1 digit (1 byte)? Number of Categories 3? Summary Description Identifies the gender of the resident. Categories (1) men, (2) women and (3) pregnant women.

Variable 4 - YEARS OF STUDY? Format N - Numeric? Size 2 digits (2 bytes)? Number of Categories 11? Summary Description Identifies the number of years of study of the resident. Categories (05) Five or less, (06) Six, (07) Seven, (08) Eight, (09) Nine, (10) Dec, (11) Eleven, (12) Twelve, (13) Thirteen, (14 ) Fourteen, (15) Fifteen or more.

Variable 5 - COLOR OR RACE? Format N - Numeric? Size 2 digits (2 bytes)? Number of Categories 6? Summary Description Identifies the Color or Race declared by the resident. Categories (01) White, (02) Black, (03) Yellow, (04) Brown, (05) Indigenous and (09) Don’t know.

Variable 6 - WAGE VALUE (ANNUALIZED)? Format N - Numeric? Size 8 digits (8 bytes)? Number of decimals 2? Summary Description Identifies the amount resulting from the individual’s annual salary. Special categories (-1) individual absent on the survey date and (999999) individual did not want to answer.

Variable 7 - CIVIL STATE? Format N - Numeric? Size 1 digit (1 byte)? Number of Categories 2? Summary Description Dummy that identifies the marital status declared by the resident. Categories (1) Married, (0) Not married.

Variable 8 - GEOGRAPHICAL REGION? Format N - Numeric? Size 1 digit (1 byte)? Number of Categories 5? Summary Description Identifies the resident’s geographic region. Categories (1) North, (2) Northeast, (3) Southeast, (4) South and (5) Midwest.

Python

Importing the dataset from part 1

You can also dowload it from the github page from this blog

df_sex_thesis =pd.read_feather(r.file_path_linux + '/sex_thesis_assignment.feather')

df_sex_thesis.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 65795 entries, 0 to 65794
## Data columns (total 9 columns):
##  #   Column        Non-Null Count  Dtype   
## ---  ------        --------------  -----   
##  0   index         65795 non-null  int64   
##  1   age           65795 non-null  int64   
##  2   sex           65795 non-null  object  
##  3   years_study   65795 non-null  category
##  4   color_race    65795 non-null  object  
##  5   salary        65795 non-null  float64 
##  6   civil_status  65795 non-null  object  
##  7   region        65795 non-null  object  
##  8   log_salary    65795 non-null  float64 
## dtypes: category(1), float64(2), int64(2), object(4)
## memory usage: 4.1+ MB

Let’s get going first define which variables to add to the hypothesis, to isolate the factor of salary ~ sex, if we consider that our sample of individuals is random in nature comparing the means of the individuals given their sex and seeing if there is a significant difference in their means.

A good graphic to get an idea if these effects would be significant was the bar plots used in part 1.

When working with Categorical variable it is possible to use a groupby approach to glimpse at the difference in means.

Difference in means

Using the log salary feature from post 1.

df_agg1 = df_sex_thesis.groupby('sex').mean().log_salary
df_agg1

## sex
## man      9.026607
## woman    8.607023
## Name: log_salary, dtype: float64

Remember that in order to transform back our log variables you can do e^variable like this e ^ 9.03 is 8321.57 but the log of the mean is not the same as the mean of the log.

df_agg2 = df_sex_thesis.groupby('sex').mean().salary
df_agg2

## sex
## man      14302.491879
## woman    10642.502734
## Name: salary, dtype: float64

9.03 is not the same as 9.57

Therefore which one should be done first log or mean?

The most common order is log then mean, because it is the order that reduces variance the most, you can read more about this here

Group by explanation

Group by in pandas is a method that accepts a list of elements in this case just ‘sex’ and applies consequent operation in each group, in this case the mean method from a pandas DataFrame, .salary returns just the mean for the salary variable.

df_sex_thesis.groupby(['sex','region']).mean().log_salary

## sex    region   
## man    midwest      9.155421
##        north        8.678263
##        south        9.123554
##        southeast    9.113084
## woman  midwest      8.847172
##        north        8.291580
##        northeast    9.462870
##        south        8.345867
##        southeast    8.766985
## Name: log_salary, dtype: float64

Adding standard deviations

df_sex_thesis.groupby(['sex']).std().log_salary

## sex
## man      1.397496
## woman    2.009225
## Name: log_salary, dtype: float64

These are really big Standard deviations! Remembering from stats that +2 SD’s gives about a 95% confidence interval we are not even close.

Combining mean and std using pandas agg method.

df_agg =df_sex_thesis.groupby(['sex']).agg(['mean','std']).log_salary

Calculating boundaries

df_agg['lower_bound'] = df_agg['mean'] - df_agg['std'] * 2
df_agg['upper_bound'] = df_agg['mean'] + df_agg['std'] * 2

df_agg

##            mean       std  lower_bound  upper_bound
## sex                                                
## man    9.026607  1.397496     6.231615    11.821599
## woman  8.607023  2.009225     4.588573    12.625473

Cool, but to verbose to be repeated multiple times, it is better to convert this series of operations into a function.

def groupby_bound(df,groupby_variables,value_variables):
  df_agg =  df.groupby(groupby_variables).agg(['mean','std'])[value_variables]
  df_agg['lower_bound'] = df_agg['mean'] - df_agg['std'] * 2
  df_agg['upper_bound'] = df_agg['mean'] + df_agg['std'] * 2
  return df_agg

groupby_bound(df=df_sex_thesis,groupby_variables='sex',value_variables='log_salary')

##            mean       std  lower_bound  upper_bound
## sex                                                
## man    9.026607  1.397496     6.231615    11.821599
## woman  8.607023  2.009225     4.588573    12.625473

Let’s try to find the difference in salary on some strata of the population.

groupby_bound(df=df_sex_thesis,groupby_variables=['sex','region'],value_variables='log_salary')

##                      mean       std  lower_bound  upper_bound
## sex   region                                                 
## man   midwest    9.155421  1.192631     6.770160    11.540683
##       north      8.678263  1.733378     5.211507    12.145019
##       south      9.123554  1.426591     6.270371    11.976736
##       southeast  9.113084  1.226845     6.659395    11.566773
## woman midwest    8.847172  1.575228     5.696717    11.997627
##       north      8.291580  2.470543     3.350495    13.232665
##       northeast  9.462870  1.172049     7.118773    11.806968
##       south      8.345867  2.461307     3.423253    13.268482
##       southeast  8.766985  1.639338     5.488309    12.045660

No.

groupby_bound(df=df_sex_thesis,groupby_variables=['sex','civil_status'],value_variables='log_salary')

##                         mean       std  lower_bound  upper_bound
## sex   civil_status                                              
## man   married       9.173335  1.247871     6.677593    11.669077
##       not_married   8.810261  1.567870     5.674521    11.946001
## woman married       8.487709  2.263061     3.961586    13.013831
##       not_married   8.771966  1.578347     5.615272    11.928659

No.

groupby_bound(df=df_sex_thesis,groupby_variables=['sex','civil_status'],value_variables='log_salary')

##                         mean       std  lower_bound  upper_bound
## sex   civil_status                                              
## man   married       9.173335  1.247871     6.677593    11.669077
##       not_married   8.810261  1.567870     5.674521    11.946001
## woman married       8.487709  2.263061     3.961586    13.013831
##       not_married   8.771966  1.578347     5.615272    11.928659

No.

groupby_bound(df=df_sex_thesis,groupby_variables=['sex','color_race'],value_variables='log_salary')

##                       mean       std  lower_bound  upper_bound
## sex   color_race                                              
## man   black       8.885541  1.215289     6.454962    11.316119
##       brown       8.878127  1.348974     6.180179    11.576076
##       indigenous  7.480596  3.328801     0.822994    14.138197
##       white       9.215328  1.369700     6.475927    11.954728
##       yellow      9.503222  1.398690     6.705843    12.300601
## woman black       8.617966  1.683712     5.250543    11.985389
##       brown       8.518060  2.025707     4.466647    12.569473
##       indigenous  7.623917  3.342682     0.938553    14.309282
##       white       8.696991  2.001864     4.693263    12.700718
##       yellow      8.904886  1.877233     5.150420    12.659351

No.

groupby_bound(df=df_sex_thesis,groupby_variables=['sex','years_study'],value_variables='log_salary')

##                         mean       std  lower_bound  upper_bound
## sex   years_study                                               
## man   5.0           8.702990  1.495661     5.711668    11.694312
##       6.0           8.836372  1.285517     6.265338    11.407407
##       7.0           8.883292  1.302857     6.277578    11.489006
##       8.0           8.939945  1.342110     6.255724    11.624166
##       9.0           8.873640  1.292672     6.288296    11.458984
##       10.0          9.064804  1.122187     6.820430    11.309177
##       11.0          9.182335  1.163925     6.854486    11.510184
##       12.0          9.274507  1.187977     6.898553    11.650462
##       13.0          9.308213  1.651300     6.005613    12.610812
##       14.0          9.563542  1.295638     6.972267    12.154817
##       15.0         10.083954  1.334792     7.414369    12.753538
## woman 5.0           8.150536  2.574328     3.001881    13.299191
##       6.0           8.577851  1.854171     4.869508    12.286194
##       7.0           8.572043  1.743439     5.085165    12.058921
##       8.0           8.480459  1.952266     4.575927    12.384991
##       9.0           8.609875  1.583736     5.442403    11.777346
##       10.0          8.735525  1.403048     5.929428    11.541622
##       11.0          8.755282  1.518318     5.718647    11.791918
##       12.0          8.906622  1.505541     5.895540    11.917705
##       13.0          8.963868  1.555327     5.853213    12.074523
##       14.0          9.146458  1.263748     6.618962    11.673954
##       15.0          9.572653  1.228241     7.116171    12.029134

Also no.

Does that mean that there were no Gender pay differences in Brazil in 1970?

No, it just means that there were no signs of this difference when looking at the whole population combined with one extra factor, but what if we combine all factors and isolate each influence in the salary? This would be a way to analyse the Ceteris Paribus(all else equal) effect of each feature in the salary, here is where Linear Regression comes in.

Linear Regression

But what is Linear Regression? you might ask, wasn’t it just one method for prediction? Not really, Linear Regression coefficients are really useful for hypothesis testing, meaning that tossing everything at it and then interpreting the results that come out without having to individually compare each feature pair, while also capturing the effect that all features have simultaneously.

Is Linear Regression always perfect? No. In fact most of the time the results are a little biased or a underestimate the variance or are just flat out wrong.

To understand the kinds of errors we might face when doing a linear regression we can use the Gauss Markov Theorem.

Terminology:

Predictor/Independent Variable: Theses are the features e.g sex,years_study, region we can have p predictors where p = n -1 and n is the numbers of rows our dataset possesses in this case 65795 rows are present.

Predicted/Dependent variable: This is the single “column” also called ‘target’ that we are modeling in this case we can use either log_salary or salary.

for a more in depth read this great blog post and for a more in depth usage in R and Python

Linearity

To get good results using Linear Regression the relationship of the Predictors and the Predicted variable has to be a linear relationship, to check for Linearity it is possible to use a simple line plot, and look for patterns like a parabola that would indicate that the Predictor has a quadratic relationship with the Dependent variable, there are ways of fixing non-linear relationships like we did with log_salary or by taking the power of the Predictor.

Linearity can easily be tested for numerical Predictors, categorical predictors are harder to test, so in our case we only checked the age feature.

Now it is time to flex these matplotlib graphs…

x = df_sex_thesis['age']
y = df_sex_thesis['log_salary']
plt.scatter(x, y)

z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.show()

It seems age is not a great fit let’s try to also log age as well.

x = np.log(df_sex_thesis['age'])
y = df_sex_thesis['log_salary']
plt.scatter(x, y)

z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.show()

Better but still close to no impact, maybe if we filter our sample to just the earning population, we can improve on it, we can call theses filters ‘masks’ in pandas.

df_filter = df_sex_thesis[df_sex_thesis['log_salary']> 2]

df_filter

##        index  age    sex  ... civil_status     region  log_salary
## 0          0   53    man  ...      married      north   11.060384
## 1          1   49  woman  ...      married      north    9.427336
## 2          2   22  woman  ...  not_married  northeast    8.378713
## 3          3   55    man  ...      married      north   11.478344
## 4          4   56  woman  ...      married      north   11.969090
## ...      ...  ...    ...  ...          ...        ...         ...
## 65790  66465   34  woman  ...      married    midwest    9.427336
## 65791  66466   40    man  ...      married    midwest    7.793999
## 65792  66467   36  woman  ...      married    midwest    7.793999
## 65793  66468   27  woman  ...      married    midwest    8.617075
## 65794  66469   37    man  ...      married    midwest    6.134157
## 
## [63973 rows x 9 columns]

x = df_filter['age']
y = df_filter['log_salary']
plt.scatter(x, y)

z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.show()

There is some slight improvement, it is a good question whether to filter otherwise sane values, the point is that the entire analysis would change, changing to salary ~ sex in the earning population in 1970 in Brazil instead of the salary ~ sex for the whole population in 1970 in Brazil.

I think in both cases analyzing the Gender Pay Gap would be interesting it is even possible to split the hypothesis in two, analysing if Men and Women earn the same, and if Men and Women are have the same employment rate.

So for here on out We are analyzing just the Gender Pay Difference of employed people.

Random

This is a vital hypotheses it means that the observations(rows) were chosen at random for the entire Brazilian population, in this case We choose to trust that IBGE did a good job, if IBGE failed to correctly sample the population or if we mess to much with our filters we risk invalidating the whole process, yes that is right, if you don’t respect this hypothesis everything you have analysed is worthless.

Non-Collinearity

The effect of each Predictors is reduced when you introduce Colinear predictors, you are spliting the effect between the Predictors whenever a new predictor is added, meaning that you are in the worst case only calculating half of the coefficient, Collinearity always happens, Women live more so age is related to Sex, therefore age ‘steals’ part of the calculated effect from Sex, the more variables you introduce to your Linear Regression model the more that Collinearity plagues your estimations, everything is correlated.

So be careful when doing Linear Regression for estimating Ceteris Paribus effects so that you don’t introduce too many features or features that are too correlated with you hypothesis, remember that our hypothesis is Salary ~ Sex.

A good way too know if you are introducing too much Collinearity is looking at the heatmap.

corr = pd.get_dummies(df_sex_thesis[['sex','age']]).corr()
sns.heatmap(corr)

This is quite cloudy let’s get rid of the Sex interaction with itself.

We are using a really cool pandas operation inspired this stack overflow answer, and combining it with the negate operator ‘~’ effectively selecting just the columns that don’t start with sex.

new_corr = corr.loc[:,~corr.columns.str.startswith('sex')]
sns.heatmap(new_corr)

Very little correlation, we are fine.

Once again we don’t usually calculate the correlation between Categorical Variables.

Exogeneity

If violated this hypothesis blasts your study into oblivion, Exogeneity is a one way road, your Independent Variables influence your Dependent Variable, and that is it.

Discussion on whether we are violating this assumption creates really cool intellectual pursuits, Nobel’s were won discovering if there was some violation to this assumption see Trygve_Haavelmo.

In our case let’s hypothesize for all variables

Sex ~ Salary - Maybe people that get richer/poorer change Sex, probably not.
Age ~ Salary - You can’t buy year with money . Years Study ~ Salary - Possible but this probably only happen to the latter years of education, still worth considering. Color/Race ~ Salary - No. Civil Status ~ Salary - Yes I can see that, taking this feature out.
Region ~ Salary - Do richer people migrate to richer regions? I think so, taking this feature out.

It is also nice to notice that this may be reason why we call the Predicted Variable the Independent Variable.

Homoscedasticity / Homogeneity of Variance/ Assumption of Equal Variance

Assumption of Equal Variance of predicted values means that for any value for the whole distribution of the Dependent Variable the estimated values remain equally distributted, meaning that we are as sure on our predictions for 1000 moneys as for 100000 moneys this assumption is really hard to adhere.

If broken the variance of the coefficients may be under or over estimated, meaning that we may fail to consider relevant features or consider wrongly irrelevant features, there are many formal statistical tests for this assumption let’s use scipy’s Bartlett’s test for homogeneity of variances where Ho is Homoscedasticity confirmation meaning we hope for p-values < 0.05.

We used a significance level of 5% for this assignment.

ss.bartlett(df_filter['log_salary'],df_filter['age'])

## BartlettResult(statistic=232468.57113474328, pvalue=0.0)

Don’t reject H0 -> ok

Another way to check this assumption is using tests called Breusch-Pagan and Goldfeld-Quandt post fitting the linear model.

Fitting the linear regression

Fitting the linear regression using yet another library called statsmodels.

mod = smf.ols(formula='log_salary ~ sex + age + years_study + color_race', data=df_filter)
model_fit = mod.fit()
print(model_fit.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:             log_salary   R-squared:                       0.133
## Model:                            OLS   Adj. R-squared:                  0.133
## Method:                 Least Squares   F-statistic:                     613.6
## Date:                Thu, 21 Jan 2021   Prob (F-statistic):               0.00
## Time:                        01:05:49   Log-Likelihood:                -81546.
## No. Observations:               63973   AIC:                         1.631e+05
## Df Residuals:                   63956   BIC:                         1.633e+05
## Df Model:                          16                                         
## Covariance Type:            nonrobust                                         
## ============================================================================================
##                                coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------------------
## Intercept                    8.3200      0.019    440.777      0.000       8.283       8.357
## sex[T.woman]                -0.2125      0.007    -30.970      0.000      -0.226      -0.199
## years_study[T.6.0]           0.1520      0.020      7.756      0.000       0.114       0.190
## years_study[T.7.0]           0.1728      0.018      9.441      0.000       0.137       0.209
## years_study[T.8.0]           0.1721      0.014     12.391      0.000       0.145       0.199
## years_study[T.9.0]           0.1395      0.019      7.449      0.000       0.103       0.176
## years_study[T.10.0]          0.2414      0.018     13.444      0.000       0.206       0.277
## years_study[T.11.0]          0.3314      0.009     35.414      0.000       0.313       0.350
## years_study[T.12.0]          0.3826      0.018     21.100      0.000       0.347       0.418
## years_study[T.13.0]          0.5989      0.025     24.032      0.000       0.550       0.648
## years_study[T.14.0]          0.6619      0.025     25.988      0.000       0.612       0.712
## years_study[T.15.0]          1.0396      0.013     78.438      0.000       1.014       1.066
## color_race[T.brown]          0.0487      0.013      3.696      0.000       0.023       0.075
## color_race[T.indigenous]     0.0587      0.040      1.451      0.147      -0.021       0.138
## color_race[T.white]          0.1805      0.013     13.736      0.000       0.155       0.206
## color_race[T.yellow]         0.2401      0.050      4.776      0.000       0.142       0.339
## age                          0.0129      0.000     40.197      0.000       0.012       0.014
## ==============================================================================
## Omnibus:                    14771.140   Durbin-Watson:                   1.806
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):            45603.025
## Skew:                          -1.188   Prob(JB):                         0.00
## Kurtosis:                       6.386   Cond. No.                         584.
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Looking at the results, it is possible that there isGender Pay Gap, calculating the difference in estimated salaries done by

plus intercept + e ^ beta_variable = 5077.12
minus intercept 4105.16
equals 971.96.

There is a 971.96 difference between men and women salaries in the earning population of Brazil in 1970 quite significant at 7.59% of the mean salary at the time.

Linear Regression plots

Using the code from this excellent post and combining it with the understanding from this post.

# fitted values (need a constant term for intercept)
model_fitted_y = model_fit.fittedvalues

# model residuals
model_residuals = model_fit.resid

# normalized residuals
model_norm_residuals = model_fit.get_influence().resid_studentized_internal

# absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

# absolute residuals
model_abs_resid = np.abs(model_residuals)

# leverage, from statsmodels internals
model_leverage = model_fit.get_influence().hat_matrix_diag

# cook's distance, from statsmodels internals
model_cooks = model_fit.get_influence().cooks_distance[0]

Initializing some variables. ### Residual plot

plot_lm_1 = plt.figure(1)

plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'log_salary', data=df_filter,
                                  lowess=True,
                                  scatter_kws={'alpha': 0.5},
                                  line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

## C:\Users\bruno\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\seaborn\_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
##   FutureWarning

plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals')


# annotations
abs_resid = model_abs_resid.sort_values(ascending=False)
abs_resid_top_3 = abs_resid[:3]

for i in abs_resid_top_3.index:
    plot_lm_1.axes[0].annotate(i, 
                               xy=(model_fitted_y[i], 
                                   model_residuals[i]));

Here we are looking for the red line to get as close to the doted black line meaning that our Predictors would have a perfectly linear relationship with our Dependent variable following the assumption of linearity.

I think we are close enough.

QQ plot

QQ = ProbPlot(model_norm_residuals)
plot_lm_2 = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)

plot_lm_2.axes[0].set_title('Normal Q-Q')
plot_lm_2.axes[0].set_xlabel('Theoretical Quantiles')
plot_lm_2.axes[0].set_ylabel('Standardized Residuals');

# annotations
abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid_top_3 = abs_norm_resid[:3]

for r, i in enumerate(abs_norm_resid_top_3):
    plot_lm_2.axes[0].annotate(i, 
                               xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
                                   model_norm_residuals[i]));

Here we are looking for the circles to get as close to the red line as possible meaning that our variables follow a normal distribution and therefore our p-values are not biased.

I think we have two problems the extremes may be a bit too distant and there are three concerning outliers.

Scale-Location Plot

plot_lm_3 = plt.figure(3)

plt.scatter(model_fitted_y, model_norm_residuals_abs_sqrt, alpha=0.5)
sns.regplot(model_fitted_y, model_norm_residuals_abs_sqrt, 
            scatter=False, 
            ci=False, 
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_3.axes[0].set_title('Scale-Location')
plot_lm_3.axes[0].set_xlabel('Fitted values')
plot_lm_3.axes[0].set_ylabel('$\sqrt{|Standardized Residuals|}$');

# annotations
abs_sq_norm_resid = np.flip(np.argsort(model_norm_residuals_abs_sqrt), 0)
abs_sq_norm_resid_top_3 = abs_sq_norm_resid[:3]

for i in abs_norm_resid_top_3:
    plot_lm_3.axes[0].annotate(i, 
                               xy=(model_fitted_y[i], 
                                   model_norm_residuals_abs_sqrt[i]));

This is the graph where we check the homoscedasticity assumption, we want the red line to be as straight as possible meaning that our Predictor variance is constant among the Dependent Variable values.

I think it is fine.

Leverage plot

plot_lm_4 = plt.figure(4)

plt.scatter(model_leverage, model_norm_residuals, alpha=0.5)
sns.regplot(model_leverage, model_norm_residuals, 
            scatter=False, 
            ci=False, 
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_4.axes[0].set_xlim(0, 0.005)

## (0.0, 0.005)

plot_lm_4.axes[0].set_ylim(-3, 5)

## (-3.0, 5.0)

plot_lm_4.axes[0].set_title('Residuals vs Leverage')
plot_lm_4.axes[0].set_xlabel('Leverage')
plot_lm_4.axes[0].set_ylabel('Standardized Residuals')

# annotations
leverage_top_3 = np.flip(np.argsort(model_cooks), 0)[:3]

for i in leverage_top_3:
    plot_lm_4.axes[0].annotate(i, 
                               xy=(model_leverage[i], 
                                   model_norm_residuals[i]))
    
# shenanigans for cook's distance contours
def graph(formula, x_range, label=None):
    x = x_range
    y = formula(x)
    plt.plot(x, y, label=label, lw=1, ls='--', color='red')

p = len(model_fit.params) # number of model parameters

graph(lambda x: np.sqrt((0.5 * p * (1 - x)) / x), 
      np.linspace(0.000, 0.005, 50), 
      'Cook\'s distance') # 0.5 line

## C:/Users/bruno/AppData/Local/r-miniconda/envs/r-reticulate/python.exe:1: RuntimeWarning: divide by zero encountered in true_divide

graph(lambda x: np.sqrt((1 * p * (1 - x)) / x), 
      np.linspace(0.000, 0.005, 50)) # 1 line

## C:/Users/bruno/AppData/Local/r-miniconda/envs/r-reticulate/python.exe:1: RuntimeWarning: divide by zero encountered in true_divide

plt.legend(loc='upper right');

Finally in this plot we are looking for outliers, it failed on the Python version, but it should show if the outliers plague the betas enough to the point where it may be worth studying removing them.

We want the Red line to be as close as possible to the dotted line.

Looking at the R plot We can say it is fine.

At the end I am comfortable not denying our Hypothesis that Salary ~ Sex in 1970 Brazil working population.

And that is it, Statistical analysis with almost no R! .

Final Remarks

I guess my opinion is important in this post, this was really hard, Python may be an excellent Prediction based language but it lacks so much on my normal Economist features that I have easily available even when using Stata/E-Views/SAS, like look at how much code for a simple linear regression plot!

I don’t have much hope that this will improve with time, normal statistics just doesn’t get as much hype as Deep Learning and stuff I feel sorry for whoever has to learn stats alongside Python, you guys deserve a Medal! Also I applaud the guys that Developed statsmodels.formula.api it really helps!

Whoever develops with matplotlib deserves two medals, you guys make me feel dumber than when I read my first Time Series paper and that was a really low point in my self esteem, the graphs turned out great in my honest opinion.

If you liked it please share it.

In the next part we repeat everything from part 1 with a few twists in R using the tidyverse!

R Markdown reticulate pandas scipy statsmodels

Bruno Carlin

Data Scientist - Specialist

Data Scientist

exploratory data analysis: basics Python part 2

exploratory data analysis: basics Python part 2

Libraries

Second Post

Objectives

Define the variables used in the conclusion

Using masks or other methods to filter the data

Visualizing the hypothesis

Conclusion

Before we start

Reservations

Data Dictionary

Python

Importing the dataset from part 1

Difference in means

Linear Regression

Linearity

Random

Non-Collinearity

Exogeneity

Homoscedasticity / Homogeneity of Variance/ Assumption of Equal Variance

Fitting the linear regression

Linear Regression plots

QQ plot

Scale-Location Plot

Leverage plot

Final Remarks

Next post

Bruno Carlin

Data Scientist - Specialist

Related