Classification - KNN

Bruno Carlin

Aug 15, 2019 12 min read r-project

Classification - KNN

Bruno Carlin

Aug 15, 2019 12 min read r-project

Breast Cancer problem.

This is a problem that I have trid to solve using just the old tidymodels package and got stuck so here is the new implementation using the amazing tune and workflows packages

Setting up Rmarkdown

root.dir =

Loading Libraries

library(tidyverse)

## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------------------------- tidymodels 0.0.3 --

## <U+2713> broom     0.5.3     <U+2713> recipes   0.1.8
## <U+2713> dials     0.0.4     <U+2713> rsample   0.0.5
## <U+2713> infer     0.5.1     <U+2713> yardstick 0.0.4
## <U+2713> parsnip   0.0.4

## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------ tidymodels_conflicts() --
## x scales::discard()   masks purrr::discard()
## x dplyr::filter()     masks stats::filter()
## x recipes::fixed()    masks stringr::fixed()
## x dplyr::lag()        masks stats::lag()
## x dials::margin()     masks ggplot2::margin()
## x yardstick::spec()   masks readr::spec()
## x recipes::step()     masks stats::step()
## x recipes::yj_trans() masks scales::yj_trans()

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(skimr)
library(DataExplorer)

Loading Libraries: 3.65 sec elapsed

Getting Data

Got the dataset with headers on kaggle link, there is also a cool explanation about the problem there.

df <- read_csv("breast_cancer.csv")

## Warning: Missing column names filled in: 'X33' [33]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   diagnosis = col_character(),
##   X33 = col_character()
## )

## See spec(...) for full column specifications.

## Warning: 569 parsing failures.
## row col   expected     actual                file
##   1  -- 33 columns 32 columns 'breast_cancer.csv'
##   2  -- 33 columns 32 columns 'breast_cancer.csv'
##   3  -- 33 columns 32 columns 'breast_cancer.csv'
##   4  -- 33 columns 32 columns 'breast_cancer.csv'
##   5  -- 33 columns 32 columns 'breast_cancer.csv'
## ... ... .......... .......... ...................
## See problems(...) for more details.

Set Chunk: 0.16 sec elapsed

There is a strange extra column named X33 dealing with that using janitor package

df <- df %>% 
  janitor::remove_empty_cols()

## Warning: 'janitor::remove_empty_cols' is deprecated.
## Use 'remove_empty("cols")' instead.
## See help("Deprecated")

Cleaning Data: 0.01 sec elapsed

Visualizing the data using DataExplorer and Skimr

Skimr is a fast way to get info on your data even though the hist plot fails on my blog :(

df %>% 
  skimr::skim()

Table 1: Data summary
Name	Piped data
Number of rows	569
Number of columns	32
_______________________
Column type frequency:
character	1
numeric	31
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
diagnosis	0	1	1	1	0	2	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	30371831.43	125020585.61	8670.00	869218.00	906024.00	8813129.00	911320502.00	▇▁▁▁▁
radius_mean	1	14.13	3.52	6.98	11.70	13.37	15.78	28.11	▂▇▃▁▁
texture_mean	1	19.29	4.30	9.71	16.17	18.84	21.80	39.28	▃▇▃▁▁
perimeter_mean	1	91.97	24.30	43.79	75.17	86.24	104.10	188.50	▃▇▃▁▁
area_mean	1	654.89	351.91	143.50	420.30	551.10	782.70	2501.00	▇▃▂▁▁
smoothness_mean	1	0.10	0.01	0.05	0.09	0.10	0.11	0.16	▁▇▇▁▁
compactness_mean	1	0.10	0.05	0.02	0.06	0.09	0.13	0.35	▇▇▂▁▁
concavity_mean	1	0.09	0.08	0.00	0.03	0.06	0.13	0.43	▇▃▂▁▁
concave points_mean	1	0.05	0.04	0.00	0.02	0.03	0.07	0.20	▇▃▂▁▁
symmetry_mean	1	0.18	0.03	0.11	0.16	0.18	0.20	0.30	▁▇▅▁▁
fractal_dimension_mean	1	0.06	0.01	0.05	0.06	0.06	0.07	0.10	▆▇▂▁▁
radius_se	1	0.41	0.28	0.11	0.23	0.32	0.48	2.87	▇▁▁▁▁
texture_se	1	1.22	0.55	0.36	0.83	1.11	1.47	4.88	▇▅▁▁▁
perimeter_se	1	2.87	2.02	0.76	1.61	2.29	3.36	21.98	▇▁▁▁▁
area_se	1	40.34	45.49	6.80	17.85	24.53	45.19	542.20	▇▁▁▁▁
smoothness_se	1	0.01	0.00	0.00	0.01	0.01	0.01	0.03	▇▃▁▁▁
compactness_se	1	0.03	0.02	0.00	0.01	0.02	0.03	0.14	▇▃▁▁▁
concavity_se	1	0.03	0.03	0.00	0.02	0.03	0.04	0.40	▇▁▁▁▁
concave points_se	1	0.01	0.01	0.00	0.01	0.01	0.01	0.05	▇▇▁▁▁
symmetry_se	1	0.02	0.01	0.01	0.02	0.02	0.02	0.08	▇▃▁▁▁
fractal_dimension_se	1	0.00	0.00	0.00	0.00	0.00	0.00	0.03	▇▁▁▁▁
radius_worst	1	16.27	4.83	7.93	13.01	14.97	18.79	36.04	▆▇▃▁▁
texture_worst	1	25.68	6.15	12.02	21.08	25.41	29.72	49.54	▃▇▆▁▁
perimeter_worst	1	107.26	33.60	50.41	84.11	97.66	125.40	251.20	▇▇▃▁▁
area_worst	1	880.58	569.36	185.20	515.30	686.50	1084.00	4254.00	▇▂▁▁▁
smoothness_worst	1	0.13	0.02	0.07	0.12	0.13	0.15	0.22	▂▇▇▂▁
compactness_worst	1	0.25	0.16	0.03	0.15	0.21	0.34	1.06	▇▅▁▁▁
concavity_worst	1	0.27	0.21	0.00	0.11	0.23	0.38	1.25	▇▅▂▁▁
concave points_worst	1	0.11	0.07	0.00	0.06	0.10	0.16	0.29	▅▇▅▃▁
symmetry_worst	1	0.29	0.06	0.16	0.25	0.28	0.32	0.66	▅▇▁▁▁
fractal_dimension_worst	1	0.08	0.02	0.06	0.07	0.08	0.09	0.21	▇▃▁▁▁

Skimr: 0.15 sec elapsed

Data Explorer

Is a imho a prettier option with individual cool plots and a super powerfull(but slow) report creation tool when working outside of an Rmarkdownm document

df %>% 
  DataExplorer::plot_intro()

df %>% 
  DataExplorer::plot_bar()

df %>% 
  DataExplorer::plot_correlation()

Data Explorer individual plots: 1.05 sec elapsed

There are much more amaziong tools such as the ggforce package , but I hope you get the gist of the exploration stage.

Modeling

For now I am going to focus on the tools provided by the tidymodels packages and the KNN, in the future I may come back to add more models and probably to play around the DALEX package a little bit.

Just to remember M is Malignant and B is Benign, we are trying to correcly classify our patients, I am going to ignore the id Varible since it should not be reliaded upon to generate predictions(Even though it may capture some interesting effects such as better screening for patients on the latter id’s).

Train Test Split

Usually we split our data into training and test data to ensure a fair evaluation of the models or parameters being tested(hoping to avoid overfitting).

The workflow for the tidymodels is that we first split our data.

df_split <- df %>% 
  rsample::initial_split(prop = 0.8)

Initial Split: 0 sec elapsed

df_training <- df_split %>% training()
df_testing <- df_split %>% testing()

Train test split: 0.02 sec elapsed

Then we model on our Training Data

Recipes

Recipes are used to preprocess our data, the main mistake here is using the whole data set.

The recipe package helps us with this process.

For those not familiarized with the formula notation I am fitting the model on all variables except the id variable.

I am than Normalizing my data since the KNN alghoritm is sensible to the scale of the variables being used, I am also excluding variables with high absolute correlation amongst themselves.

Recipes are easy to read and can be quite complex

df_recipe <- training(df_split) %>% 
  recipe(diagnosis ~ .) %>%
  step_rm(id) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors(),all_numeric()) %>% 
  step_corr(all_predictors())

recipes: 0 sec elapsed

We could then create our train and test data frames by baking our recipe and juicing our recipe

# df_testing <- df_recipe %>% 
#   bake(testing(df_split))
# df_testing
#df_training <- juice(df_recipe)

unnamed-chunk-1: 0 sec elapsed

But I am going for a Bayes search approch

Cross Validation and Bayes Search

Cross Validation

We further divide our data frame into folds in order to improve our certainty that the ideal number of neighbours is right.

cv_splits <- df_testing %>% vfold_cv(v = 5)

cvfold split: 0.02 sec elapsed

Using the new tune package currently on github

library(tune)
knn_mod <- 
  nearest_neighbor(neighbors = tune(), weight_func = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

unnamed-chunk-2: 0.05 sec elapsed

Combining everything so far in the new package workflow

library(workflows)
knn_wflow <- 
  workflow() %>% 
  add_model(knn_mod) %>% 
  add_recipe(df_recipe)

unnamed-chunk-3: 0.04 sec elapsed

Limiting our search

knn_param <- 
  knn_wflow %>% 
  parameters() %>% 
    update(
    neighbors = neighbors(c(3, 50)),
    weight_func = weight_func(values = c("rectangular", "inv", "gaussian", "triangular"))
  )

unnamed-chunk-4: 0.03 sec elapsed

Searching for the best model

I used 5 iterations as the limit for the process because of printing reasons.

Keep in mind that mtune will maximize just the first metric from the package yardstick

ctrl <- control_bayes(verbose = TRUE,no_improve = 5)
set.seed(42)
knn_search <- tune_bayes(knn_wflow,
                         resamples = cv_splits,
                         initial = 5,
                         iter = 20,
                         param_info = knn_param,
                         control = ctrl,
                         metrics = metric_set(roc_auc,accuracy))

##

## >  Generating a set of 5 initial parameter results

## v Initialization complete

##

## Optimizing roc_auc using the expected improvement

##

## -- Iteration 1 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------

##

## i Current best:      roc_auc=0.9706 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 139 candidates

## i Predicted candidates

## i neighbors=31, weight_func=inv

## i Estimating performance

## v Estimating performance

## <U+24E7> Newest results: roc_auc=0.969 (+/-0.013)

##

## -- Iteration 2 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------

##

## i Current best:      roc_auc=0.9706 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 138 candidates

## i Predicted candidates

## i neighbors=3, weight_func=rectangular

## i Estimating performance

## v Estimating performance

## <U+24E7> Newest results: roc_auc=0.9538 (+/-0.0208)

##

## -- Iteration 3 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------

##

## i Current best:      roc_auc=0.9706 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 137 candidates

## i Predicted candidates

## i neighbors=50, weight_func=rectangular

## i Estimating performance

## v Estimating performance

## <U+24E7> Newest results: roc_auc=0.9432 (+/-0.0214)

##

## -- Iteration 4 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------

##

## i Current best:      roc_auc=0.9706 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 136 candidates

## i Predicted candidates

## i neighbors=15, weight_func=rectangular

## i Estimating performance

## v Estimating performance

## <U+24E7> Newest results: roc_auc=0.9676 (+/-0.0184)

##

## -- Iteration 5 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------

##

## i Current best:      roc_auc=0.9706 (@iter 0)

## i Gaussian process model

## v Gaussian process model

## i Generating 135 candidates

## i Predicted candidates

## i neighbors=3, weight_func=gaussian

## i Estimating performance

## v Estimating performance

## <U+24E7> Newest results: roc_auc=0.9562 (+/-0.0205)

## ! No improvement for 5 iterations; returning current results.

Bayes Search: 26.38 sec elapsed

Visualizing our search

autoplot(knn_search, type = "performance", metric = "accuracy")

unnamed-chunk-5: 0.16 sec elapsed

autoplot(knn_search, type = "performance", metric = "roc_auc")

unnamed-chunk-6: 0.2 sec elapsed

Seing the best result

collect_metrics(knn_search) %>% 
  dplyr::filter(.metric == "accuracy") %>% 
  arrange(mean %>% desc)

## # A tibble: 10 x 8
##    neighbors weight_func .iter .metric  .estimator  mean     n std_err
##        <int> <chr>       <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
##  1         3 gaussian        5 accuracy binary     0.939     5  0.0261
##  2         3 rectangular     2 accuracy binary     0.939     5  0.0261
##  3         5 inv             0 accuracy binary     0.939     5  0.0295
##  4        13 inv             0 accuracy binary     0.939     5  0.0221
##  5        42 gaussian        0 accuracy binary     0.939     5  0.0294
##  6        15 rectangular     4 accuracy binary     0.938     5  0.0263
##  7        33 inv             0 accuracy binary     0.930     5  0.0259
##  8        31 inv             1 accuracy binary     0.929     5  0.0222
##  9        27 rectangular     0 accuracy binary     0.921     5  0.0253
## 10        50 rectangular     3 accuracy binary     0.859     5  0.0207

unnamed-chunk-7: 0.03 sec elapsed

collect_metrics(knn_search) %>% 
  dplyr::filter(.metric == "roc_auc") %>% 
  arrange(mean %>% desc)

## # A tibble: 10 x 8
##    neighbors weight_func .iter .metric .estimator  mean     n std_err
##        <int> <chr>       <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
##  1        33 inv             0 roc_auc binary     0.971     5  0.0165
##  2        31 inv             1 roc_auc binary     0.969     5  0.0130
##  3        42 gaussian        0 roc_auc binary     0.968     5  0.0182
##  4        15 rectangular     4 roc_auc binary     0.968     5  0.0184
##  5        13 inv             0 roc_auc binary     0.967     5  0.0203
##  6        27 rectangular     0 roc_auc binary     0.961     5  0.0168
##  7         5 inv             0 roc_auc binary     0.959     5  0.0182
##  8         3 gaussian        5 roc_auc binary     0.956     5  0.0205
##  9         3 rectangular     2 roc_auc binary     0.954     5  0.0208
## 10        50 rectangular     3 roc_auc binary     0.943     5  0.0214

unnamed-chunk-8: 0.03 sec elapsed

Extracting the best model

best_metrics <- collect_metrics(knn_search) %>% 
  dplyr::filter(.metric == "roc_auc") %>% 
  arrange(mean %>% desc) %>% 
  head(1) %>% 
  select(neighbors,weight_func) %>% 
  as.list()

unnamed-chunk-9: 0.01 sec elapsed

Creating production model

production_knn <-  
  nearest_neighbor(neighbors = best_metrics$neighbors,weight_func = best_metrics$weight_func) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

unnamed-chunk-10: 0 sec elapsed

Creating production wflow

production_wflow <-   workflow() %>% 
  add_model(production_knn) %>% 
  add_recipe(df_recipe)

unnamed-chunk-11: 0 sec elapsed

Finally applying testing production model

fit_prod <- fit(production_wflow,df_training)

Fitting production: 0.28 sec elapsed

Metrics

fit_prod %>% predict(df_testing) %>% 
  bind_cols(df_testing %>% transmute(diagnosis = diagnosis %>% as.factor())) %>% 
  yardstick::metrics(truth = diagnosis,estimate = .pred_class)

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.965
## 2 kap      binary         0.926

Calculating metrics: 0.03 sec elapsed

predict(fit_prod,df_testing,type = 'prob') %>%
  bind_cols(df_testing %>% transmute(diagnosis = diagnosis %>% as.factor())) %>% 
  yardstick::roc_auc(truth = diagnosis,.pred_B)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.989

Calculating metrics auc: 0.04 sec elapsed

Another visualization

knn_naive <- 
  nearest_neighbor(neighbors = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

unnamed-chunk-12: 0 sec elapsed

Combining everything so far in the new package workflow

knn_wflow_naive <- 
  workflow() %>% 
  add_model(knn_naive) %>% 
  add_recipe(df_recipe)

unnamed-chunk-13: 0 sec elapsed

knn_naive_param <- 
  knn_wflow %>% 
  parameters() %>% 
    update(
    neighbors = neighbors(c(10, 50))
  )

unnamed-chunk-14: 0.02 sec elapsed

One advantage of the naive search is that it is easy to parallelize

all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: iterators

## Loading required package: parallel

cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

set up parallel: 1.29 sec elapsed

ctrl <- control_grid(verbose = FALSE)
set.seed(42)
naive_search <- tune_grid(knn_wflow_naive,
                         resamples = cv_splits,
                         param_info = knn_naive_param,
                         control = ctrl,
                         grid = 50,
                         metrics = metric_set(roc_auc,accuracy))

naive grid search: 6.31 sec elapsed

best_naive_metrics <- collect_metrics(naive_search) %>% 
  dplyr::filter(.metric == "roc_auc") %>% 
  arrange(mean %>% desc)
DT::datatable(best_naive_metrics,options = list(pageLength = 5, scrollX=T))

unnamed-chunk-15: 0.03 sec elapsed

p <- best_naive_metrics %>% 
  ggplot() +
  aes(neighbors,mean) +
  geom_point() +
  ylim(.95,1)
p

unnamed-chunk-16: 0.16 sec elapsed

Checking the timing table

(x <- tic.log() %>%
  as.character() %>% 
  tibble(log = .) %>% 
  separate(log,sep = ': ',into = c('name','time'))) %>% 
  separate(time, sep = ' ',c('measure','units')) %>%
  mutate(measure = measure %>% as.numeric()) %>% 
  arrange(measure %>% desc())

## Warning: Expected 2 pieces. Additional pieces discarded in 31 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

## # A tibble: 31 x 3
##    name                            measure units
##    <chr>                             <dbl> <chr>
##  1 Bayes Search                      26.4  sec  
##  2 naive grid search                  6.31 sec  
##  3 Loading Libraries                  3.65 sec  
##  4 set up parallel                    1.29 sec  
##  5 Data Explorer  individual plots    1.05 sec  
##  6 Fitting production                 0.28 sec  
##  7 unnamed-chunk-6                    0.2  sec  
##  8 Set Chunk                          0.16 sec  
##  9 unnamed-chunk-5                    0.16 sec  
## 10 unnamed-chunk-16                   0.16 sec  
## # … with 21 more rows

unnamed-chunk-17: 0.02 sec elapsed

tidyverse tidymodels

Bruno Carlin

Data Scientist - Specialist

Data Scientist

Classification - KNN

Classification - KNN

Setting up Rmarkdown

Loading Libraries

Getting Data

Visualizing the data using DataExplorer and Skimr

Skimr is a fast way to get info on your data even though the hist plot fails on my blog :(

Data Explorer

Modeling

Train Test Split

Recipes

Cross Validation and Bayes Search

Cross Validation

Using the new tune package currently on github

Combining everything so far in the new package workflow

Limiting our search

Searching for the best model

Visualizing our search

Seing the best result

Extracting the best model

Creating production model

Creating production wflow

Finally applying testing production model

Metrics

Another visualization

Combining everything so far in the new package workflow

Checking the timing table

Bruno Carlin

Data Scientist - Specialist

Related