My Profile Photo

I'm Izzy


Focusing in Business Analytics and currently MBA Candidate for USC Marshall Class of '18.


P-value elimination in R

Update: Ok… I know p-value elimination for model-building is deeply flawed per https://robjhyndman.com/hyndsight/crossvalidation/, though this nonetheless helped me save a lot of time during my midterms~

So, I’m taking the Applied Modern Statistical Learning Methods class at USC Marshall, and we have a homework where we’re going to build a regression using a dataset with 38 variables. Right from the get go, we were to exclude 5 variables that were categorical/unique etc.

I hate typing things over and over again, and I’m not a fan of copy & paste. Thus, it took me some researching to figure out how to exclude variables in a list using the dplyr package. Really handy.

# I replaced the actual variable names with var

excluded_vars <- c('var1', 'var2', 'var3', 'var4', 'var5')

df.lm <- glm(yvar ~ .,
            data = select(df.training, -one_of(excluded_vars)),
            family = 'binomial')

I was pleased at my progress (which really could have been much faster if I just typed -var1 -var2 -etc in the regression model or using select, but I was so set on having a list for some reason). Then I saw the model’s summary, and was unhappy again. That’s still at least 30 variables, and with a quick look there were around 10 predictor variables with p-values greater than 0.05. Frankly, looking at those numbers made my eyes hurt.

Thus, I decided to try writng code that would automatically identify the predictor variable with the greatest p-value. This, again, took me more time than simply squinting at the numbers and finding the largest p-value…but no regrets.

## first extract predictor coefficient values
df.lm.summary <- as.data.frame(summary(df.lm)$coefficients)

## isolate variable with largest p-value
coefs <- df.lm.summary %>%
              mutate(variable = rownames(df.lm.summary)) %>%
              arrange(desc(`Pr(>|z|)`)) %>%
              filter(row_number() == 1)

Again, I thought I’d be satisfied… but then I would need to exclude these automatically identified variables over and over again through very manual typing (or copy & paste), which will result in a very long code/R markdown file. Once again I was very annoyed by the prospect of this manual work, so I naturally tried to find a way of automating this menial task with a for loop. After ~30 minutes of trying to solve this pesky problem, I finally figured out how to do it with for loops! (By the way, I do not really like R’s for loops…)

The final code I used to build that regression model is as follows:

# Automating the model building based on significance level of 0.05

## initialize dataset to use
df.training.new <- select(df.training, -one_of(excluded_vars))

## initialize data frame to record largest p-value in regression model, nrow = ncol() because it represents number of variables
pvalues <- matrix(NA, nrow = ncol(df.training.new), ncol = 2, dimnames = list(NULL,c('variable', 'pvalue'))) 

pvalues <- as.data.frame(pvalues)

for(i in 1:ncol(df.training.new)){
  
  df.lm <- glm(yvar ~ .,
                   data = df.training.new,
                   family = 'binomial')
  
  ## extract variable summary
  df.lm.summary <- as.data.frame(summary(df.lm)$coefficients)
  
  ## isolate variable with largest p-value
  coefs <- df.lm.summary %>%
              mutate(variable = rownames(df.lm.summary)) %>%
              arrange(desc(`Pr(>|z|)`)) %>%
              filter(row_number() == 1)
  
  ## write variable name and p-value to dataframe
  pvalues[i,1] <- coefs$variable
  pvalues[i,2] <- coefs$`Pr(>|z|)`
  
  variable <- as.character(coefs$variable) # predictor variable with largest p-value in i iteration of regression
  pvalue <- as.numeric(coefs$`Pr(>|z|)`) # largest p-value in i iteration of regression
  
  
  ## remove the variable with largest p-value
  df.training.new <- df.training.new %>%
    select(-one_of(variable)) 
  
  ## end loop if the largest p-value is less than 0.05
  if (pvalue <= 0.05) break
  
}

## variables to exclude based on pvalues dataframe
exclude <- pvalues %>%
  filter(pvalue >= 0.05) %>% #because on variable will have p-value less than 0.05
  select(variable) %>% #only need variable names
  t() %>% # transform so it can be turned into a vector
  as.vector() 

excluded_vars.final <- c(excluded_vars, exclude) #combine vectors to form list of variables to exclude


df.lm.final <- glm(yvar ~ .,
            data = select(df.training, -one_of(excluded_vars.final)),
            family = 'binomial')

summary(df.lm.final) # all predictor variables in final summary results have p-values <= 0.05

Needless to say, I am very pleased with this method, even though I could’ve just done the same thing much faster if I manually looked at every summary table and typed/copy-pasted many iterations of the regression model. But the thought of doing such a repetitive task makes me feel lazy, so I would much rather spend time figuring out how to automate this p-value elimination hell…. Maybe there is an existing library/function that does something similar, but it felt good figuring this out.

I shall use this process in the future if I, for some reason, ever need to use regression on a dataset with many variables again~