exploratory data analysis: basics Python part 1

Python part 1

Bruno Carlin

Jan 19, 2020 22 min read 0 Comments R and Python

exploratory data analysis: basics Python part 1

Python part 1

Bruno Carlin

Jan 19, 2020 22 min read 0 Comments R and Python

Libraries
The Exercise
- Before we get into it
Python
- Pre-processing
Saving our work for later
Next post

I am currently doing exercises from digital house brasil

Libraries

Let’s see what version of python this env is running.

import platform
print(platform.python_version())

## 3.6.12

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os
file_path_linux = r.file_path_linux

The Exercise

Before we get into it

Objectives

Open and read a DataFrame using pandas

Simple stuff right?

Basic analysis of each column using value counts.

I improved a bit on the base python capabilities

Creating a hypothesis that we care about

In our case the hypothesis is simple do women earn on average less than men?

Data preprocessing

We need to clean the data removing outliers, biases or any other factors that could in theory compromise our hypothesis testing.

Visualize all the variables

We were free to apply any technique.

Categorical Data

To do in the second post

Define the variables used in the conclusion

In our case, we choose to use salary ~ sex,region region was added to test whether Simpson’s paradox was at play.

Using masks or other methods to filter the data

This objective was mostly done using the groupby function.

Visualizing the hypothesis

We were advised to use two histograms combined to get a preview of our answer.

Conclusion

Comment on our findings.

Reservations

This is an exercise where we were supposed to ask a relevant question using the data from the IBGE(Brazil’s main data collector) database of 1970.

Our group decided to ask whether women received less than man, we expanded the analysis hoping to avoid the Simpson’s paradox.

This is just an basic inference, and it’s results are therefore only used for studying purposes I don’t believe any finding would be relevant using just this approach but some basic operations can be used in a more impact full work.

Data Dictionary

We got a Data Dictionary that will be very useful for our Analysis, it contains all the required information about the encoding of the columns and the intended format that the folks at STATA desired.

Portuguese

Descrição do Registro de Indivíduos nos EUA.

Dataset do software STATA (pago), vamos abri-lo com o pandas e transforma-lo em DataFrame.

Variável 1 – CHAVE DO INDIVÍDUO ? Formato N - Numérico ? Tamanho 11 dígitos (11 bytes) ? Descrição Sumária Identifica unicamente o indivíduo na amostra.

Variável 2 - IDADE CALCULADA EM ANOS ? Formato N - Numérico ? Tamanho 3 dígitos (3 bytes) ? Descrição Sumária Identifica a idade do morador em anos completos.

Variável 3 – SEXO ? Formato N - Numérico ? Tamanho 1 dígito (1 byte) ? Quantidade de Categorias 3 ? Descrição Sumária Identifica o sexo do morador. Categorias (1) homem, (2) mulher e (3) gestante.

Variável 4 – ANOS DE ESTUDO ? Formato N - Numérico ? Tamanho 2 dígitos (2 bytes) ? Quantidade de Categorias 11 ? Descrição Sumária Identifica o número de anos de estudo do morador. Categorias (05) Cinco ou menos, (06) Seis, (07) Sete, (08) Oito, (09) Nove, (10) Dez, (11) Onze, (12) Doze, (13) Treze, (14) Quatorze, (15) Quinze ou mais.

Variável 5 – COR OU RAÇA ? Formato N - Numérico ? Tamanho 2 dígitos (2 bytes) ? Quantidade de Categorias 6 ? Descrição Sumária Identifica a Cor ou Raça declarada pelo morador. Categorias (01) Branca, (02) Preta, (03) Amarela, (04) Parda, (05) Indígena e (09) Não Sabe.

Variável 6 – VALOR DO SALÁRIO (ANUALIZADO) ? Formato N - Numérico ? Tamanho 8 dígitos (8 bytes) ? Quantidade de Decimais 2 ? Descrição Sumária Identifica o valor resultante do salário anual do indivíduo. Categorias especiais (-1) indivíduo ausente na data da pesquisa e (999999) indivíduo não quis responder.

Variável 7 – ESTADO CIVIL ? Formato N - Numérico ? Tamanho 1 dígito (1 byte) ? Quantidade de Categorias 2 ? Descrição Sumária Dummy que identifica o estado civil declarado pelo morador. Categorias (1) Casado, (0) não casado.

Variável 8 – REGIÃO GEOGRÁFICA ? Formato N - Numérico ? Tamanho 1 dígito (1 byte) ? Quantidade de Categorias 5 ? Descrição Sumária Identifica a região geográfica do morador. Categorias (1) Norte, (2) Nordeste, (3) Sudeste, (4) Sul e (5) Centro-oeste.

English

Description of the US Individual Registry.

Dataset of the STATA software (paid), we will open it with pandas and turn it into DataFrame.

Variable 1 - KEY OF THE INDIVIDUAL? Format N - Numeric? Size 11 digits (11 bytes)? Summary Description Uniquely identifies the individual in the sample.

Variable 2 - AGE CALCULATED IN YEARS? Format N - Numeric? Size 3 digits (3 bytes)? Summary Description Identifies the age of the resident in full years.

Variable 3 - SEX? Format N - Numeric? Size 1 digit (1 byte)? Number of Categories 3? Summary Description Identifies the gender of the resident. Categories (1) men, (2) women and (3) pregnant women.

Variable 4 - YEARS OF STUDY? Format N - Numeric? Size 2 digits (2 bytes)? Number of Categories 11? Summary Description Identifies the number of years of study of the resident. Categories (05) Five or less, (06) Six, (07) Seven, (08) Eight, (09) Nine, (10) Dec, (11) Eleven, (12) Twelve, (13) Thirteen, (14 ) Fourteen, (15) Fifteen or more.

Variable 5 - COLOR OR RACE? Format N - Numeric? Size 2 digits (2 bytes)? Number of Categories 6? Summary Description Identifies the Color or Race declared by the resident. Categories (01) White, (02) Black, (03) Yellow, (04) Brown, (05) Indigenous and (09) Don’t know.

Variable 6 - WAGE VALUE (ANNUALIZED)? Format N - Numeric? Size 8 digits (8 bytes)? Number of decimals 2? Summary Description Identifies the amount resulting from the individual’s annual salary. Special categories (-1) individual absent on the survey date and (999999) individual did not want to answer.

Variable 7 - CIVIL STATE? Format N - Numeric? Size 1 digit (1 byte)? Number of Categories 2? Summary Description Dummy that identifies the marital status declared by the resident. Categories (1) Married, (0) Not married.

Variable 8 - GEOGRAPHICAL REGION? Format N - Numeric? Size 1 digit (1 byte)? Number of Categories 5? Summary Description Identifies the resident’s geographic region. Categories (1) North, (2) Northeast, (3) Southeast, (4) South and (5) Midwest.

Python

Pre-processing

Reading Data

The path is specific for my computer but it is easy to adapt

You can also dowload it from the github page from this blog

# Abertura e leitura dos dados em um DeteFrame em Pandas
path = r.file_path_linux
df = pd.read_csv(path + '/stata_data_1970.csv')

Analyzing some basic stuff about our data frame

#Análise básica dos conteúdos de cada coluna com contagem de valores
df.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 66470 entries, 0 to 66469
## Data columns (total 9 columns):
##  #   Column        Non-Null Count  Dtype  
## ---  ------        --------------  -----  
##  0   Unnamed: 0    66470 non-null  int64  
##  1   id            66470 non-null  float64
##  2   idade         66470 non-null  int64  
##  3   sexo          66470 non-null  object 
##  4   anos_estudo   66036 non-null  float64
##  5   cor/raca      66228 non-null  object 
##  6   salario       47878 non-null  float64
##  7   estado_civil  66470 non-null  float64
##  8   regiao        66470 non-null  object 
## dtypes: float64(4), int64(2), object(3)
## memory usage: 4.6+ MB

I do enjoy python’s base value_counts but when used in a loop it can create some ugly outputs, in order to fix I created a function that adds some flavor text to the print output and generates new information about the accumulated percentage of the data being displayed.

Custom count_values()

def pretty_value_counts(data_frame,
                        number_of_rows = 5,
                        cum_perc = True):
  
  for col in data_frame:
    counts = data_frame[col].value_counts(dropna=False)
    percentages = data_frame[col].value_counts(dropna=False, normalize=True)
    
    if cum_perc == True:
      cum_percentages = percentages.cumsum()
      tb = pd.concat([counts,
                      percentages,
                      cum_percentages],
                     axis=1,
                     keys=['counts',
                           'percentages',
                           "cum_percentages"]
                    ).head(number_of_rows)
      
    else:
      tb = pd.concat([counts,
                      percentages],
                     axis=1,
                     keys=['counts',
                           'percentages']).head(number_of_rows)
      
    print("Column %s with %s data type" % (col,data_frame[col].dtype),
          "\n",
          tb,
          "\n")

Now we can apply our new function.

Using a custom function

pretty_value_counts(df)

## Column Unnamed: 0 with int64 data type 
##         counts  percentages  cum_percentages
## 2047        1     0.000015         0.000015
## 41601       1     0.000015         0.000030
## 21151       1     0.000015         0.000045
## 23198       1     0.000015         0.000060
## 17053       1     0.000015         0.000075 
## 
## Column id with float64 data type 
##                counts  percentages  cum_percentages
## 1.100351e+10       2     0.000030         0.000030
## 3.132701e+10       1     0.000015         0.000045
## 1.501501e+10       1     0.000015         0.000060
## 3.230631e+10       1     0.000015         0.000075
## 5.003991e+10       1     0.000015         0.000090 
## 
## Column idade with int64 data type 
##      counts  percentages  cum_percentages
## 20    2104     0.031653         0.031653
## 28    2056     0.030931         0.062585
## 26    2040     0.030691         0.093275
## 22    2034     0.030600         0.123875
## 27    2017     0.030345         0.154220 
## 
## Column sexo with object data type 
##            counts  percentages  cum_percentages
## mulher     33607     0.505597         0.505597
## homem      32791     0.493320         0.998917
## gestante      72     0.001083         1.000000 
## 
## Column anos_estudo with float64 data type 
##        counts  percentages  cum_percentages
## 5.0    23349     0.351271         0.351271
## 11.0   16790     0.252595         0.603866
## 15.0    5636     0.084790         0.688657
## 8.0     5017     0.075478         0.764134
## 10.0    2704     0.040680         0.804814 
## 
## Column cor/raca with object data type 
##            counts  percentages  cum_percentages
## Branca     31689     0.476741         0.476741
## Parda      28370     0.426809         0.903550
## Preta       5249     0.078968         0.982518
## Indigena     597     0.008981         0.991500
## Amarela      323     0.004859         0.996359 
## 
## Column salario with float64 data type 
##             counts  percentages  cum_percentages
##  NaN        18592     0.279705         0.279705
##  0.0         1841     0.027697         0.307402
## -1.0         1101     0.016564         0.323966
##  999999.0     367     0.005521         0.329487
##  5229.0       277     0.004167         0.333654 
## 
## Column estado_civil with float64 data type 
##       counts  percentages  cum_percentages
## 1.0   39066     0.587724         0.587724
## 0.0   27404     0.412276         1.000000 
## 
## Column regiao with object data type 
##                counts  percentages  cum_percentages
## sudeste        25220     0.379419         0.379419
## centro-oeste   14702     0.221182         0.600602
## norte          14653     0.220445         0.821047
## sul            11890     0.178878         0.999925
## nordeste           5     0.000075         1.000000

Just for comparison lets look how we could do the same thing without the function.


for col in df:
  df[col].value_counts(dropna=False).head(5)

## 2047     1
## 41601    1
## 21151    1
## 23198    1
## 17053    1
## Name: Unnamed: 0, dtype: int64
## 1.100351e+10    2
## 3.132701e+10    1
## 1.501501e+10    1
## 3.230631e+10    1
## 5.003991e+10    1
## Name: id, dtype: int64
## 20    2104
## 28    2056
## 26    2040
## 22    2034
## 27    2017
## Name: idade, dtype: int64
## mulher      33607
## homem       32791
## gestante       72
## Name: sexo, dtype: int64
## 5.0     23349
## 11.0    16790
## 15.0     5636
## 8.0      5017
## 10.0     2704
## Name: anos_estudo, dtype: int64
## Branca      31689
## Parda       28370
## Preta        5249
## Indigena      597
## Amarela       323
## Name: cor/raca, dtype: int64
##  NaN         18592
##  0.0          1841
## -1.0          1101
##  999999.0      367
##  5229.0        277
## Name: salario, dtype: int64
## 1.0    39066
## 0.0    27404
## Name: estado_civil, dtype: int64
## sudeste         25220
## centro-oeste    14702
## norte           14653
## sul             11890
## nordeste            5
## Name: regiao, dtype: int64

Replacing columns names

The columns are named in Portuguese we can replace their names for English equivalents in a lot of different ways

df.columns

## Index(['Unnamed: 0', 'id', 'idade', 'sexo', 'anos_estudo', 'cor/raca',
##        'salario', 'estado_civil', 'regiao'],
##       dtype='object')

My favorite way of doing this sort of trades is using a dictionary defined outside the replace method, the cool thing about replace is that if we liked some of the column names previously defined we can simply omit them, for example, both “Unnamed: 0” and “id” are useless but since their names are already in English I don’t need to mess with them right now

Translation discussion on race

There is some valid discussion on whether to translate “cor/raca” into ethnic_group or color_race, but I am personally on the opinion that the ones making this data frame in 1970 were probably under other standards of naming conventions and racism accusations so I will keep their naming scheme, I apologize if anyone feels offended by the use of these terms

dict_cols = {"idade" : "age",
                "sexo" : "sex",
                "anos_estudo" : "years_study",
                "cor/raca" : "color_race",
                "salario" : "salary",
                "estado_civil" : "civil_status",
                "regiao" : "region"
                }

df.rename(columns = dict_cols, inplace = True)

Let’s see what changed

df.columns

## Index(['Unnamed: 0', 'id', 'age', 'sex', 'years_study', 'color_race', 'salary',
##        'civil_status', 'region'],
##       dtype='object')

It look fine now we can translate some of our main features

Cleaning categorical data

First we need to know the categories present in each of our columns a simple loop would fails us when we reached a numeric variable, the simplest way to solve that would be using an if statement, another alternative is using conditional execution, I personally don’t know a simple way of doing that in python but I will show it in the R post

To discover the numeric and “categorical” variables, know that sometimes you will have to change some elements of these lists but looking at my outputs I think I got all the relevant ones

Finding which columns are categorical

These are the numerical variables

df.select_dtypes(include=[np.number]).columns

## Index(['Unnamed: 0', 'id', 'age', 'years_study', 'salary', 'civil_status'], dtype='object')

And these are the Categorical variables

list_cat = df.select_dtypes(exclude=[np.number]).columns

Now we can run a simple loop

for col in list_cat:
    df[col].unique()

## array(['homem', 'mulher', 'gestante'], dtype=object)
## array(['Parda', 'Amarela', 'Indigena', 'Branca', 'Preta', nan],
##       dtype=object)
## array(['norte', 'nordeste', 'sudeste', 'sul', 'centro-oeste'],
##       dtype=object)

The simpler method is comparing the dtype in each column to the desired output, but this would be harder if we needed the np.numeric

for col in df:
  if df[col].dtype == "O":
    df[col].unique()

## array(['homem', 'mulher', 'gestante'], dtype=object)
## array(['Parda', 'Amarela', 'Indigena', 'Branca', 'Preta', nan],
##       dtype=object)
## array(['norte', 'nordeste', 'sudeste', 'sul', 'centro-oeste'],
##       dtype=object)

The problem with the simpler approach is that sometimes you have columns that are categories and not objects so the simpler approach would fail when the more complex one would not, let’s convert sex to a category to prove my point

df.sex =df.sex.astype("category")

df.dtypes

## Unnamed: 0         int64
## id               float64
## age                int64
## sex             category
## years_study      float64
## color_race        object
## salary           float64
## civil_status     float64
## region            object
## dtype: object

for col in df:
  if df[col].dtype == "O":
    df[col].unique()

## array(['Parda', 'Amarela', 'Indigena', 'Branca', 'Preta', nan],
##       dtype=object)
## array(['norte', 'nordeste', 'sudeste', 'sul', 'centro-oeste'],
##       dtype=object)

It does not work anymore, of course you can still solve this “problem” with the simpler approach by including a “and” clause on your if statement but at that point you might as well use the more extensible appoach

Replacing values with an dictionary: 1 column

After looking into the categories I can create a dictionary for each column if I want to be safe on repeating terms or I can pass a master dictionary for the whole data frame, I think the column by column approach is tidier but for each their own

dict_sex = {"mulher"   : "woman",
            "homem"    : "man",
            "gestante" : "woman"} # pregnant

This is one strange data frame, it probably made sense to split women into pregnant and not pregnant but I think it will only complicate the otherwise simple analyses so I will group both into “woman”

df.sex.replace(dict_sex,inplace = True)

Showing the new amounts of women/mean

df.sex.value_counts()

## woman    33679
## man      32791
## Name: sex, dtype: int64

df.sex.unique()

## array(['man', 'woman'], dtype=object)

This fails

pretty_value_counts(df.sex)

## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 'man'
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>
##   File "<string>", line 6, in pretty_value_counts
##   File "C:\Users\bruno\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\pandas\core\series.py", line 882, in __getitem__
##     return self._get_value(key)
##   File "C:\Users\bruno\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\pandas\core\series.py", line 990, in _get_value
##     loc = self.index.get_loc(label)
##   File "C:\Users\bruno\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\pandas\core\indexes\range.py", line 358, in get_loc
##     raise KeyError(key)

Here is actually a example on why I don’t personally enjoy Pandas conversion of data, the function that we created pretty_value_counts is not gonna work in this example because Pandas converts a single column to an Series object, so we would have to write a pretty_value_counts for Series as well or we would have to mess with the Pandas method or we could convert the series back into a DataFrame like this

pretty_value_counts(pd.DataFrame(data=  df.sex))

## Column sex with object data type 
##         counts  percentages  cum_percentages
## woman   33679      0.50668          0.50668
## man     32791      0.49332          1.00000

Replacing values with an dictionary: multiple columns

Translation discussion on race part 2

Again there is relevant discussion on whether I should translate “Parda” as brown but basically Brazil’s population sometimes answers that their skin color is “Parda” = brown when asked about for many reasons I will propose two, “Preta” black can be used as an racist term so some people prefer to be called “brown”, the second explanation is that most of the population is actually pretty well integrated meaning that there a lot of biracial couples in this case we see something like “Preta” parent + “Branca” parent = “Parda” = in English “brown”.

There is also the case for the English equivalent of brown skin we simply use “Indiano” = “Indian”.

Curiously the term “Negra” =~ "N*gger" is often preferred in Brazil, that may cause some confusion between Portuguese and English speakers.

I will use brown but do notice that there were multiple sensible approaches here.

This is a good opportunity to show failures in the master dictionary approach, realize that if I were to replace “nan” as no_answer or something like that python could thrown me an error because there are “nan” in some numerical columns such as salary but instead I get silence conversion of a numerical columns into object columns a dangerous feature.

for col in list_cat:
  df[col].unique()

## array(['man', 'woman'], dtype=object)
## array(['Parda', 'Amarela', 'Indigena', 'Branca', 'Preta', nan],
##       dtype=object)
## array(['norte', 'nordeste', 'sudeste', 'sul', 'centro-oeste'],
##       dtype=object)

dict_all = {"Parda"    : "brown",
            "Amarela"  : "yellow",
            "Indigena" : "indigenous",
            "Branca"   : "white",
            "Preta"    : "black",
            np.nan      : "no_answer"}

df.replace(dict_all).salary.dtype

## dtype('O')

dict_all = {"Parda"    : "brown", #col color_race
            "Amarela"  : "yellow",
            "Indigena" : "indigenous",
            "Branca"   : "white",
            "Preta"    : "black",
            "norte"    : "north", # col region 
            "nordeste" : "northeast",
            "sudeste"  : "southeast",
            "sul"      : "south",
        "centro-oeste" : "midwest"}

Let’s pray that we don’t have this problem and use this shared dictionary

df.replace(dict_all, inplace = True)

Did we correctly clean the Categorical Variables?

Conversion of types

Well not really I would argue that year_study is an categorical variable as well so let’s convert it.

df.years_study = df.years_study.astype('category')

df.years_study.unique()

## [5.0, 8.0, 11.0, 15.0, 13.0, ..., 9.0, 10.0, 14.0, 12.0, NaN]
## Length: 12
## Categories (11, float64): [5.0, 8.0, 11.0, 15.0, ..., 9.0, 10.0, 14.0, 12.0]

Some nan but otherwise this is could be a useful feature, I will convert it back into a numerical column so that if we can easily impute the NaN’s based on a mathematical method such as the mean of the column.

df.years_study = df.years_study.astype('interger')

## Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: data type 'interger' not understood
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>
##   File "C:\Users\bruno\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\pandas\core\generic.py", line 5548, in astype
##     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
##   File "C:\Users\bruno\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\pandas\core\internals\managers.py", line 604, in astype
##     return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
##   File "C:\Users\bruno\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\pandas\core\internals\managers.py", line 409, in apply
##     applied = getattr(b, f)(**kwargs)
##   File "C:\Users\bruno\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\pandas\core\internals\blocks.py", line 548, in astype
##     dtype = pandas_dtype(dtype)
##   File "C:\Users\bruno\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\pandas\core\dtypes\common.py", line 1763, in pandas_dtype
##     npdtype = np.dtype(dtype)

Another numpy quirk you can’t use integers because there are NaN values.

df.years_study = df.years_study.astype('float')

Converting civil_status into a category.

df.civil_status.unique()

## array([1., 0.])

To know what 1 or 0 mean, so we need to check the dictionary

dict_civil_status = { 0. : "not_married",
                      1. : "married"}

df.civil_status = df.civil_status.replace(dict_civil_status)
df.civil_status.head()

## 0        married
## 1        married
## 2    not_married
## 3        married
## 4        married
## Name: civil_status, dtype: object

Before we deal with numerical variables I will get rid of ‘Unnamed: 0’ and ‘id’ features because they are useless in this case.

df.drop(columns=['Unnamed: 0', 'id'],inplace=True)

Seeing the effects of categorical Variables

We can use a colored barplot to see the interaction of these Categorical Variables with our Hypothesis.

sns_plot = sns.catplot(x="sex", y="salary", hue="region", kind="bar", data=df)

py$sns_plot

## <seaborn.axisgrid.FacetGrid>

sns_plot = sns.catplot(x="sex", y="salary", hue="civil_status", kind="bar", data=df)
plt.show(sns_plot)

sns_plot = sns.catplot(x="sex", y="salary", hue="color_race", kind="bar", data=df)
plt.show(sns_plot)

Cleaning numerical data

If we pull back the code that we used here are the numerical features of this dataset

df.select_dtypes(include=[np.number]).columns

## Index(['age', 'years_study', 'salary'], dtype='object')

It is very common to reuse these kind of codes in Data Science scripts, so you shouldn’t fell as bad about repeating yourself as you do in other endeavors such in normal software engendering and you call always clean your analysis latter.

In order to know what to “clean” in numerical data I like to use plot such as a histogram

df.salary.hist(bins = 10)
plt.show()

Here we can see that the data may have a few outliers at 1000000 and that most of the salary data has a large Positive skew meaning that most data point are left to the mean of the dataset we can see that better using an density plot instead

plot_density = df.salary.plot.kde()
plot_density.set_xlim(0,100000)

## (0.0, 100000.0)

plot_density

#### Replacing variables {#python_custom_function_2}

If we go back to our custom function we can find that the values -1 and 999999 are unusually common after consulting the dictionary we decided to replace these values with the mean of the group.

This operation would be wrong for machine learning purposes since the mean of our train group would leak information from the test set as well but here in exploratory data analysis it is mostly fine also you need to replace the values with the numpy nan or else this operation doesn’t work as expected.

df_copy = df.copy()
df_copy.salary.replace({-1: "NaN",999999:'NaN'},inplace = True)
df_copy.salary.fillna(df.salary.mean(),inplace= True)

pretty_value_counts(pd.DataFrame(df_copy.salary))

## Column salary with object data type 
##                      counts  percentages  cum_percentages
## 19706.790323432902   18592     0.279705         0.279705
## 0.0                   1841     0.027697         0.307402
## NaN                   1468     0.022085         0.329487
## 5229.0                 277     0.004167         0.333654
## 7200.0                 260     0.003912         0.337566

# Create the new na values



df.salary.replace({-1:np.nan,999999:np.nan},inplace = True)

df.salary.fillna(df.salary.mean(),inplace= True)

pretty_value_counts(pd.DataFrame(df.salary))

## Column salary with float64 data type 
##               counts  percentages  cum_percentages
## 12422.39119   20060     0.301790         0.301790
## 0.00000        1841     0.027697         0.329487
## 5229.00000      277     0.004167         0.333654
## 7200.00000      260     0.003912         0.337566
## 7560.00000      244     0.003671         0.341237

And that is the magic of mutable Data Structures no extra assignments are required, quite useful, but be careful there is no going back if you haven’t saved a copy of your data.

Log of numerical data

There is also a statisticall solution for the Positive skew in our Data we can take the log of the salary column, but we will have to add one to all values since log of 0 goes to -Inf

df.log_salary = np.log1p(df.salary)

## C:/Users/bruno/AppData/Local/r-miniconda/envs/r-reticulate/python.exe:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access

pretty_value_counts(pd.DataFrame(df.log_salary))

## Column salary with float64 data type 
##            counts  percentages  cum_percentages
## 9.427336   20060     0.301790         0.301790
## 0.000000    1841     0.027697         0.329487
## 8.562167     277     0.004167         0.333654
## 8.881975     260     0.003912         0.337566
## 8.930759     244     0.003671         0.341237

But then it is gonne

df.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 66470 entries, 0 to 66469
## Data columns (total 7 columns):
##  #   Column        Non-Null Count  Dtype  
## ---  ------        --------------  -----  
##  0   age           66470 non-null  int64  
##  1   sex           66470 non-null  object 
##  2   years_study   66036 non-null  float64
##  3   color_race    66228 non-null  object 
##  4   salary        66470 non-null  float64
##  5   civil_status  66470 non-null  object 
##  6   region        66470 non-null  object 
## dtypes: float64(2), int64(1), object(4)
## memory usage: 3.6+ MB

You are better off using the [ notation

df['log_salary'] = np.log1p(df.salary)

plot_density = df.log_salary.plot.kde(bw_method= 0.5)
plot_density.set_xlim(0,15)

## (0.0, 15.0)

plot_density

It is now a usefull feature for most simple linear models

Other numerical columns

df.age.hist(bins = 20)
plt.show()

plot_density = df.age.plot.kde()
plot_density

pretty_value_counts(pd.DataFrame(df.age))

## Column age with int64 data type 
##      counts  percentages  cum_percentages
## 20    2104     0.031653         0.031653
## 28    2056     0.030931         0.062585
## 26    2040     0.030691         0.093275
## 22    2034     0.030600         0.123875
## 27    2017     0.030345         0.154220

Age seems fine

Remember from the the categorical variables we passed years_study here so that we could impute its missing values

df.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 66470 entries, 0 to 66469
## Data columns (total 8 columns):
##  #   Column        Non-Null Count  Dtype  
## ---  ------        --------------  -----  
##  0   age           66470 non-null  int64  
##  1   sex           66470 non-null  object 
##  2   years_study   66036 non-null  float64
##  3   color_race    66228 non-null  object 
##  4   salary        66470 non-null  float64
##  5   civil_status  66470 non-null  object 
##  6   region        66470 non-null  object 
##  7   log_salary    66470 non-null  float64
## dtypes: float64(3), int64(1), object(4)
## memory usage: 4.1+ MB

We are missing 66470 - 66036 = 434 observation, this is a small enough number that we decided to drop these rows

While we are droping missing values lets drop the color_race missing observations as well

df.dropna(subset = ["years_study","color_race"],inplace= True)

df.info()

## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 65795 entries, 0 to 66469
## Data columns (total 8 columns):
##  #   Column        Non-Null Count  Dtype  
## ---  ------        --------------  -----  
##  0   age           65795 non-null  int64  
##  1   sex           65795 non-null  object 
##  2   years_study   65795 non-null  float64
##  3   color_race    65795 non-null  object 
##  4   salary        65795 non-null  float64
##  5   civil_status  65795 non-null  object 
##  6   region        65795 non-null  object 
##  7   log_salary    65795 non-null  float64
## dtypes: float64(3), int64(1), object(4)
## memory usage: 4.5+ MB

Checking on year_study

df.years_study.hist(bins = 20)
plt.show()

Let’s convert it back into a Category

df.years_study = df.years_study.astype('category')

Saving our work for later

Here we have many options we can for example run this script later or save this modified df as a csv, both options are okay but I will promote the usage of an Data format that keeps the mindful choices of encoding that we made into consideration, there are many alternatives in this case as well but I will use feather.

It is also always a good idea to separate the Data from the script if you want reproducible work, that is where Excel mostly fails for me.

So showing our Data Types

df.dtypes

## age                int64
## sex               object
## years_study     category
## color_race        object
## salary           float64
## civil_status      object
## region            object
## log_salary       float64
## dtype: object

Using csv will may lose some Data Types

df.to_csv(file_path_linux + '/finished_work.csv')

pd.read_csv(file_path_linux + '/finished_work.csv').dtypes

## Unnamed: 0        int64
## age               int64
## sex              object
## years_study     float64
## color_race       object
## salary          float64
## civil_status     object
## region           object
## log_salary      float64
## dtype: object

We lost our encoding of years_study and when writing a csv we made this useless to us Unnamed: 0 column

a better way is using the feather file format, you need to pip install pyarrow beforehand

df.reset_index().to_feather(file_path_linux + '/sex_thesis_assignment.feather')

pd.read_feather(file_path_linux + '/sex_thesis_assignment.feather').dtypes

## index              int64
## age                int64
## sex               object
## years_study     category
## color_race        object
## salary           float64
## civil_status      object
## region            object
## log_salary       float64
## dtype: object

Feather does keep the years study dtype, but feather is still in a experimental phase so be carefull with it, parquet unfortunally fails to keep the dtypes I don’t know why.

It is also a good idea to keep good file names so that you can easily identify your datasets and scripts.

If you then need to delete these files you can do it inside python

os.remove(file_path_linux +'/finished_work.csv')
#os.remove(file_path_linux + '/sex_thesis_assignment.feather')

In the next post I will show the end of the analysis and the “answer” to our hypothesis.

R Markdown reticulate pandas

Bruno Carlin

Data Scientist - Specialist

Data Scientist

exploratory data analysis: basics Python part 1

exploratory data analysis: basics Python part 1

Libraries

The Exercise

Before we get into it

Objectives

Open and read a DataFrame using pandas

Basic analysis of each column using value counts.

Creating a hypothesis that we care about

Data preprocessing

Visualize all the variables

To do in the second post

Define the variables used in the conclusion

Using masks or other methods to filter the data

Visualizing the hypothesis

Conclusion

Reservations

Data Dictionary

Python

Pre-processing

Reading Data

Analyzing some basic stuff about our data frame

Custom count_values()

Using a custom function

Replacing columns names

Cleaning categorical data

Finding which columns are categorical

Replacing values with an dictionary: 1 column

Replacing values with an dictionary: multiple columns

Did we correctly clean the Categorical Variables?

Conversion of types

Seeing the effects of categorical Variables

Cleaning numerical data

Log of numerical data

Other numerical columns

Saving our work for later

Next post

Bruno Carlin

Data Scientist - Specialist

Related