Project1writeup

Project 1 Write-up

For this first project, I scraped data from Zillow regarding 400 houses for sale in Charlotte, NC. From each of the houses I collected the following information: address, listed price, number of bedrooms, number of bathrooms, and the square footage of the house. The cleaned data can be found HERE. Here’s a quick summary of the data without having to scroll through each row individually; the price of selected houses ranges from $59,000 to $69,500,000, the amount of bedrooms and bathrooms only stretched between 1 through 6 each, and the size of the houses were between 691 through 5775 square feet (sqft). Because of the spread of the data, I thought it would be interesting to look at a parallel coordinates plot (PCP) which shows relationships between variables. For the plot, I used the scaled price and grouped larger prices together to increase overall understanding. Houses on the higher end of each variable tended to have a higher price. However, towards the bottom of the PCP, it is slightly harder to distinguish between higher and lower priced houses. A reason for this is that these variables are not enough to adequately approximate a model for house prices due to the increased variation and combination of the variables included in this model at these lower numbers of room and square footage.

PCP

The model used to train this data to predict the values for each house is very similar to the first model learned in this class, which uses the Tensorflow package and only contains one Dense layer. The optimizer used for this project was sgd and the loss function used for this project was mean squared error. Like in the first example involving houses, the number of bedrooms, bathrooms, and sqft are stacked along one axis and are trained with the actual house prices to create a model to predict house price for the same houses it was trained on. In an analysis of the output, I thought that a graph showing the difference between predicted and actual prices would be significant to help show how accurate the model is.

Plot

As you can see, this model was not very good at predicting the price of the houses based on the number of bedrooms, bathrooms, and total square foot. Not only is there no apparent correlation between the actual and predicted values, the mean squared error was incredibly high. Although this may be a problem with my model, I thought it would be interesting to look at a heatmap based on the correlations between the actual and predicted price versus the other variables. On the left, in red, is the heatmap involving bedrooms, bathrooms, sqft and predicted price and on the right in blue is the same except with actual price instead of predicted.

Heatmaps

Both of the graphs show very little correlation between price and bedrooms and is slightly more with bathrooms and sqft respectively. I included visuals for both actual and predicted because although they resemble each other, the correlations between the variables and predicted price are scaled much higher than with the actual prices. This led me to conclude that most of the learning came from the sqft variable because the correlation with the predicted prices were significantly higher and there was much more variance in sqft when compared to bathrooms and bedrooms. I also think that another variable unrelated to dimensions of the house would increase the accuracy of the model. An example of this variable would be proximity to good schools, downtown, or a body of water.