Skip to contents

The GEO712PACKAGE is a test R package created for the GEO712 course at McMaster University (fall term of 2023). In this R package we share the processed data of a multiple regression model. We use a data set from Kaggle, named “House Sales in King County, USA”, to predict house price using a regression model. The data set contains house sales prices for King County, which includes Seattle. The original data set includes homes sold between May 2014 and May 2015, with 21,613 registers and 21 variables regarding the houses sold.

You can install the development version of GEO712PACKAGE from GitHub with:

# install.packages("devtools")
# library("devtools")
# devtools::install_github("dias-bruno/GEO712PACKAGE")

This is a basic example which shows you how to access the data files. First, we’ll need call the library “GEO712PACKAGE:

Accessing the data:

data(kc_house_data_processed)

Visualizing the summary of the data:

summary(kc_house_data_processed)
#>      price            bedrooms        bathrooms      sqft_living   
#>  Min.   :  75000   Min.   : 0.000   Min.   :0.000   Min.   :  290  
#>  1st Qu.: 321838   1st Qu.: 3.000   1st Qu.:1.750   1st Qu.: 1426  
#>  Median : 450000   Median : 3.000   Median :2.250   Median : 1910  
#>  Mean   : 538176   Mean   : 3.369   Mean   :2.114   Mean   : 2078  
#>  3rd Qu.: 645000   3rd Qu.: 4.000   3rd Qu.:2.500   3rd Qu.: 2550  
#>  Max.   :4668000   Max.   :11.000   Max.   :8.000   Max.   :13540  
#>     sqft_lot           floors        waterfront            view       
#>  Min.   :    520   Min.   :1.000   Min.   :0.000000   Min.   :0.0000  
#>  1st Qu.:   5040   1st Qu.:1.000   1st Qu.:0.000000   1st Qu.:0.0000  
#>  Median :   7616   Median :1.500   Median :0.000000   Median :0.0000  
#>  Mean   :  15109   Mean   :1.494   Mean   :0.007414   Mean   :0.2335  
#>  3rd Qu.:  10684   3rd Qu.:2.000   3rd Qu.:0.000000   3rd Qu.:0.0000  
#>  Max.   :1651359   Max.   :3.500   Max.   :1.000000   Max.   :4.0000  
#>    condition         grade          sqft_above   sqft_basement   
#>  Min.   :1.000   Min.   : 1.000   Min.   : 290   Min.   :   0.0  
#>  1st Qu.:3.000   1st Qu.: 7.000   1st Qu.:1190   1st Qu.:   0.0  
#>  Median :3.000   Median : 7.000   Median :1560   Median :   0.0  
#>  Mean   :3.409   Mean   : 7.655   Mean   :1787   Mean   : 290.9  
#>  3rd Qu.:4.000   3rd Qu.: 8.000   3rd Qu.:2210   3rd Qu.: 560.0  
#>  Max.   :5.000   Max.   :13.000   Max.   :9410   Max.   :4820.0  
#>     yr_built     yr_renovated        zipcode           lat       
#>  Min.   :1900   Min.   :   0.00   Min.   :98001   Min.   :47.16  
#>  1st Qu.:1951   1st Qu.:   0.00   1st Qu.:98033   1st Qu.:47.47  
#>  Median :1975   Median :   0.00   Median :98065   Median :47.57  
#>  Mean   :1971   Mean   :  84.35   Mean   :98078   Mean   :47.56  
#>  3rd Qu.:1997   3rd Qu.:   0.00   3rd Qu.:98118   3rd Qu.:47.68  
#>  Max.   :2015   Max.   :2015.00   Max.   :98199   Max.   :47.78  
#>       long        sqft_living15    sqft_lot15     predicted_values 
#>  Min.   :-122.5   Min.   : 399   Min.   :   651   Min.   :-508596  
#>  1st Qu.:-122.3   1st Qu.:1490   1st Qu.:  5100   1st Qu.: 340290  
#>  Median :-122.2   Median :1840   Median :  7620   Median : 489068  
#>  Mean   :-122.2   Mean   :1986   Mean   : 12770   Mean   : 538176  
#>  3rd Qu.:-122.1   3rd Qu.:2360   3rd Qu.: 10080   3rd Qu.: 678469  
#>  Max.   :-121.3   Max.   :6210   Max.   :871200   Max.   :3016522

First, let’s visualize the correlation matrix. We’ll use the ggplot2 library, an R package for data visualization. Calculate the correlation matrix:

cormat <- round(cor(kc_house_data_processed),2)
melted_cormat <- melt(cormat)
head(melted_cormat)
#>          Var1  Var2 value
#> 1       price price  1.00
#> 2    bedrooms price  0.32
#> 3   bathrooms price  0.52
#> 4 sqft_living price  0.70
#> 5    sqft_lot price  0.09
#> 6      floors price  0.26

Creating a correlation heatmap:

# Function to get lower or upper triangle of the correlation matrix
get_triangle <- function(cormat, upper = TRUE){
  if (upper) cormat[lower.tri(cormat)] <- NA
  else cormat[upper.tri(cormat)] <- NA
  return(cormat)
}

# Function to reorder the correlation matrix
reorder_cormat <- function(cormat){
  dd <- as.dist((1 - cormat) / 2)
  hc <- hclust(dd)
  return(cormat[hc$order, hc$order])
}

# Reorder and get upper triangle of the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_triangle(cormat, upper = TRUE)

# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)

# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name = "Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 1, size = 8, hjust = 1)) +
  coord_fixed()

# Print the heatmap
print(ggheatmap)

We can see the ten variables with the highest positive correlation:

melted_cormat <- melted_cormat[order(-melted_cormat$value), ]
melted_cormat[melted_cormat$value != 1,][1:10,]
#>                 Var1             Var2 value
#> 210      sqft_living       sqft_above  0.87
#> 126            price predicted_values  0.84
#> 187 predicted_values      sqft_living  0.83
#> 167 predicted_values            grade  0.80
#> 188    sqft_living15      sqft_living  0.76
#> 189            grade      sqft_living  0.76
#> 209            grade       sqft_above  0.75
#> 250      sqft_living        bathrooms  0.75
#> 208    sqft_living15       sqft_above  0.73
#> 42          sqft_lot       sqft_lot15  0.72

The variables “sqft_living” and “sqft_above” show the highest positive correlation. The variable “sqft_living” refers to the area of the living room in square meters, while the variable “sqft_above” refers to the area of the plot in square meters. We can also see that there is a high positive linear correlation between the “price” and the “predicted_value” of the houses. This is a good result because the “predicted_value” is the result of a linear regression used as a model to predict house values.

We can also see the ten variables with the highest negative correlation:

melted_cormat <- melted_cormat[order(melted_cormat$value), ]
melted_cormat[melted_cormat$value != 1,][1:10,]
#>              Var1          Var2 value
#> 321          long       zipcode -0.56
#> 365      yr_built     condition -0.36
#> 325      yr_built       zipcode -0.35
#> 328 sqft_living15       zipcode -0.28
#> 331    sqft_above       zipcode -0.26
#> 364        floors     condition -0.26
#> 384        floors sqft_basement -0.25
#> 265      yr_built  yr_renovated -0.22
#> 330   sqft_living       zipcode -0.20
#> 333     bathrooms       zipcode -0.20

In general, the variables show more positive correlation than negative correlation. In general, the correlations are coherent, with similar variables having a high positive correlation and contrasting variables showing a negative correlation.

Let’s look at the pair graphs for the first five variables:

pairs(kc_house_data_processed[,1:5])

We can create a boxplot to analyze how the price varies depending on some characteristics of the house. Using the “waterfront” feature (meaning whether the house is located by the sea), the code below creates a boxplot chat:

kc_house_data_processed$waterfront <- factor(kc_house_data_processed$waterfront)

ggplot(kc_house_data_processed, aes(x = waterfront, y = price)) +
  geom_boxplot(fill = "skyblue", color = "blue") +
  labs(title = "Boxplot of Price by Waterfront", x = "Waterfront", y = "Price")

We can see that houses with a waterfront (=1) have higher values compared to houses without a waterfront (=0). The same analysis can be employed for the “bedrooms” variable, in which house values tend to be higher as the number of bedrooms increases, until it becomes constant (9 < n < 10) and then the house price starts to decrease:

kc_house_data_processed$bedrooms <- factor(kc_house_data_processed$bedrooms)

ggplot(kc_house_data_processed, aes(x = bedrooms, y = price)) +
  geom_boxplot(fill = "skyblue", color = "blue") +
  labs(title = "Boxplot of Price by Bedrooms", x = "Bedrooms", y = "Price")

Now, let’s take a look at some graphs of property prices and the predicted values according to the regression model:

ggplot(kc_house_data_processed, aes(x = predicted_values)) +
  geom_histogram(binwidth = 10000, fill = "skyblue", color = "blue", alpha = 0.7) +
  labs(title = "Histogram of Predicted Values", x = "Predicted Values", y = "Frequency")


ggplot(kc_house_data_processed, aes(x = price)) +
  geom_histogram(binwidth = 10000, fill = "lightcoral", color = "red", alpha = 0.7) +
  labs(title = "Histogram of Price Values", x = "Price", y = "Frequency")

Plot a scatter graph of house prices and predicted values:

ggplot(kc_house_data_processed, aes(x = predicted_values, y = price)) +
  geom_point(color = "blue", alpha = 0.6) +
  labs(title = "Scatter Plot of Price vs. Predicted Values", x = "Predicted Values", y = "Price")

It is possible to obtain the regression residuals, which are the difference between the predicted values and the actual values (real prices, in this case):

kc_house_data_processed$residuals <- kc_house_data_processed$price - kc_house_data_processed$predicted_values

# View the first few rows of the dataframe with residuals
head(kc_house_data_processed)
#>     price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
#> 1  221900        3      1.00        1180     5650      1          0    0
#> 2  538000        3      2.25        2570     7242      2          0    0
#> 3  180000        2      1.00         770    10000      1          0    0
#> 4  604000        4      3.00        1960     5000      1          0    0
#> 5  510000        3      2.00        1680     8080      1          0    0
#> 6 1225000        4      4.50        5420   101930      1          0    0
#>   condition grade sqft_above sqft_basement yr_built yr_renovated zipcode
#> 1         3     7       1180             0     1955            0   98178
#> 2         3     7       2170           400     1951         1991   98125
#> 3         3     6        770             0     1933            0   98028
#> 4         5     7       1050           910     1965            0   98136
#> 5         3     8       1680             0     1987            0   98074
#> 6         3    11       3890          1530     2001            0   98053
#>       lat     long sqft_living15 sqft_lot15 predicted_values   residuals
#> 1 47.5112 -122.257          1340       5650         216186.4    5713.553
#> 2 47.7210 -122.319          1690       7639         715102.2 -177102.187
#> 3 47.7379 -122.233          2720       8062         402052.2 -222052.186
#> 4 47.5208 -122.393          1360       5000         452637.5  151362.530
#> 5 47.6168 -122.045          1800       7503         444936.2   65063.816
#> 6 47.6561 -122.005          4760     101930        1431551.4 -206551.391

The following code provides a visual representation of the residue distribution:

kc_house_data_processed$residuals <- kc_house_data_processed$price - kc_house_data_processed$predicted_values
hist(kc_house_data_processed$residuals, main = "Histogram of Residuals", xlab = "Residuals")