R Programming By Example
上QQ阅读APP看书,第一时间看更新

Cleaning and setting up the data

Setting up the data for this example is straightforward. We will load the data, correctly label missing values, and create some new variables for our analysis. Before we start, make sure the data.csv file is in the same directory as the code you're working with, and that your working directory is properly setup. If you don't know how to do so, setting up your working directory is quite easy, you simply call the setwd() function passing the directory you want to use as such. For example, setwd(/home/user/examples/) would use the /home/user/examples directory to look for files, and save files to.

If you don’t know how to do so, setting up your working directory is quite easy, you simply call the setwd() function passing the directory you want to use as such. For example, setwd(/home/user/examples/) would use the /home/user/examples directory to look for files, and save files to.

We can load the contents of the data.csv file into a data frame (the most intuitive structure to use with data in CSV format) by using the read.csv() function. Note that the data has some missing values in the Leave variable. These values have a value of -1 to identify them. However, the proper way to identify missing values in R is with NA, which is what we use to replace the -1 values.

data <- read.csv("./data_brexit_referendum.csv") 
data[data$Leave == -1, "Leave"] <- NA

To count the number of missing values in our data, we can use the is.na() function to get a logical (Boolean) vector that contains TRUE values to identify missing values and FALSE values to identify non-missing values. The length of such a vector will be equal to the length of the vector used as input, which is the Leave variable in our case. Then, we can use this logical vector as input for sum() while leverage the way R treats such TRUE/FALSE values to get the number of missing values. TRUE is treated as 1, while FALSE is treated as 0. We find that the number of missing values in the Leave variable is 267.

sum(is.na(data$Leave))
#> [1] 267

If we want to, we can use a mechanism to fill the missing values. A common and straightforward mechanism is to impute the variable's mean. In our case, in Chapter 3, Predicting Votes with Linear Models, we will use linear regression to estimate these missing values. However, we will keep things simple for now and just leave them as missing values.

We now proceed to defining a new variable, Proportion, which will contain the percentage of votes in favor of leaving the EU. To do so we pide the Leave variable (number of votes in favor of leaving) by the NVotes variable (number of votes in total), for each ward. Given the vectorized nature of R, this is straightforward:

data$Proportion <- data$Leave / data$NVotes

We are creating a new variable in the data frame by simply assigning to it. There's no difference between creating a new variable and modifying an existing one, which means that we need to be careful when doing so to make sure we're not overwriting an old variable by accident.

Now, create a new variable that contains a classification of whether most of the wards voted in favor of leaving or remaining in the EU. If more than 50 percent of each ward's votes were in favor of leaving, then we will mark the ward as having voted for leaving, and vice versa for remaining. Again, R makes this very simple with the use of the ifelse() function. If the mentioned condition (first parameter) holds true, then the value assigned will be "Leave" (second parameter); otherwise it will be "Remain" (third parameter). This is a vectorized operation, so it will be done for each observation in the data frame:

data$Vote <- ifelse(data$Proportion > 0.5, "Leave", "Remain")

Sometimes, people like to use a different syntax for these types of operations; they will use a subset-assign approach, which is slightly different from what we used. We won't go into the details of the differences among these approaches, but keep in mind that the latter approach may give you an error in our case:

data[data$Proportion >  0.5, "Vote"] <- "Leave"
data[data$Proportion <= 0.5, "Vote"] <- "Remain"

#> Error in `[<-.data.frame`(`*tmp*`, data$Proportion 0.5, "Vote", value = "Leave"):
#> missing values are not allowed in subscripted assignments of data frames

This happens because the Proportion variable contains some missing values that were consequences of the Leave variable having some NA values in the first place. Since we can't compute a Proportion value for observations with NA values in Leave, when we create it, the corresponding values also get an NA value assigned.

If we insist on using the subset-assign approach, we can make it work by using the which() function. It will ignore (returning as FALSE) those values that contain NA in the comparison. This way it won't give us an error, and we will get the same result as using the ifelse() function. We should use the ifelse() function when possible because it's simpler, easier to read, and more efficient (more about this in Chapter 9Implementing an Efficient Simple Moving Average).

data[which(data$Proportion >  0.5), "Vote"] <- "Leave"
data[which(data$Proportion <= 0.5), "Vote"] <- "Remain"

Down the road we will want to create plots that include the RegionName information and having long names will most likely make them hard to read. To fix that we can shorten those names while we are in the process of cleaning the data.

data$RegionName <- as.character(data$RegionName)
data[data$RegionName == "London", "RegionName"] <- "L"
data[data$RegionName == "North West", "RegionName"] <- "NW"
data[data$RegionName == "North East", "RegionName"] <- "NE"
data[data$RegionName == "South West", "RegionName"] <- "SW"
data[data$RegionName == "South East", "RegionName"] <- "SE"
data[data$RegionName == "East Midlands", "RegionName"] <- "EM"
data[data$RegionName == "West Midlands", "RegionName"] <- "WM"
data[data$RegionName == "East of England", "RegionName"] <- "EE"
data[data$RegionName == "Yorkshire and The Humber", "RegionName"] <- "Y"

Note that the first line in the previous code block is assigning a transformation of the RegionName into character type. Before we do this, the type of the variable is factor (which comes from the default way of reading data with read.csv()), and it prevents us from assigning a different value from the ones already contained in the variable. In such a case, we will get an error, Invalid factor level, NA generated. To avoid this problem, we need to perform the type transformation.

We now have clean data ready for analysis. We have created a new variable of interest for us (Proportion), which will be the focus of the rest of this chapter and the next one, since in this example, we're interested in finding out the relations among other variables and how people voted in the referendum.