small_things

A repo to stash small (single file) coding solutions.
git clone https://git.eamoncaddigan.net/small_things.git
Log | Files | Refs | README | LICENSE

scale_and_apply.Rmd (3656B)


      1 ---
      2 title: "R Notebook"
      3 output: html_notebook
      4 ---
      5 
      6 # Scale and Apply
      7 
      8 When fitting regression models to data (other models too, but regression especially), it's useful to center variables to a mean of 0 and (depending on the type of variable) scale them to a standard deviation of 1. I'm feeling lazy and imprecice, so I'll flip between calling this operation "normalizing" or "scaling".
      9 
     10 When performing machine learning, it's best practice to withold the testing set and perform the normalization only on the training set, and then apply the training set features' mean and sd to those in the testing set. This is one of those things that I was taught by my grad advisor and have taken to heart, but I don't think everybody is so careful about this.
     11 
     12 Regardless, I feel like I write the same code again and again, so I'm going to stash an implementation in this notebook and put it on Git for future use. 
     13 
     14 There are two steps: `find_norm` is called on a data frame and returns a named list. `apply_norm` is called with the aforementioned named list and a data frame (which can be the same one used in `find_norm` or a new with the same column names) and returns a normalized data frame.
     15 
     16 * Numeric columns in the range [0, 1] (integer or double) are centered but not scaled per my recollection of a suggestion from Andrew Gelman on handling binary variables
     17 * All other numeric columns are centered and scaled
     18 * Non-numeric columns (including factors) are left alone; convert them to numeric first if you want this to do anything
     19 
     20 `mtcars` is a good demo; we'll ignore row names, and we have a couple variables in the range [0, 1]. We'll convert the number of cylinders to a factor.
     21 
     22 ```{r update_cars}
     23 mtcars$cyl <- as.factor(mtcars$cyl)
     24 ```
     25 
     26 We'll split the data into a training and test set.
     27 
     28 ```{r}
     29 set.seed(414726326)
     30 all_id <- sample(seq(nrow(mtcars)))
     31 test_cars <- mtcars[all_id[1:6], ]
     32 train_cars <- mtcars[all_id[7:length(all_id)], ]
     33 ```
     34 
     35 Define `find_norm` and run it on our "training data"
     36 
     37 ```{r find_norm}
     38 find_norm <- function(dat) {
     39   center_and_scale <- function(x) {
     40     if (is.logical(x))
     41       x <- as.numeric(x)
     42     if (is.numeric(x)) {
     43       x_mean <- mean(x, na.rm = TRUE)
     44       if (min(x) >= 0 && max(x) <= 1)
     45         c(center = x_mean)
     46       else
     47         c(center = x_mean, scale = sd(x, na.rm = TRUE))
     48     } else {
     49       NULL
     50     }
     51   }
     52   
     53   lapply(dat, center_and_scale)
     54 }
     55 
     56 norm_list <- find_norm(train_cars)
     57 ```
     58 
     59 Now we'll apply it to the same data we used to find the normalization values. For numeric data outside the range [0, 1], this is the same as applying `base::scale(x, center = T, scale = T)`.
     60 
     61 ```{r apply_norm}
     62 # Any columns present in `dat` and missing in `norm_list` will be ignored. Any
     63 # columns missing from `dat` but present in `norm_list` will throw an error;
     64 # I'll leave it to downstream me to deal with that.
     65 apply_norm <- function(dat, norm_list) {
     66   for (i in seq_along(norm_list)) {
     67     col_name <- names(norm_list)[i]
     68     if (!is.null(norm_list[[i]])) {
     69       dat[[col_name]] <- dat[[col_name]] - norm_list[[i]]['center']
     70       if (!is.na(norm_list[[i]]['scale'])) {
     71         dat[[col_name]] <- dat[[col_name]] / norm_list[[i]]['scale']
     72       }
     73     }
     74   }
     75   dat
     76 }
     77 
     78 train_cars_normed <- apply_norm(train_cars, norm_list)
     79 zapsmall(c(mean(train_cars_normed$mpg), sd(train_cars_normed$mpg)))
     80 ```
     81 
     82 We can apply the same normalization values to a different set of data...
     83 
     84 ```{r apply_norm_different_data}
     85 test_cars_normed <- apply_norm(test_cars, norm_list)
     86 zapsmall(c(mean(test_cars_normed$mpg), sd(test_cars_normed$mpg)))
     87 ```
     88 
     89 And now we don't have data leakage from our training data to our testing data!