Boston housing prices

Return index

## Problem description

Boston is best known for its famous baked beans, Fenway Park, The Boston Marathon, and of course, for the bar from Cheers. Many Americans consider them the best city in the USA. And of course, it can be a very expensive city to live in. All city zones have different house prices, and we want to discover why and what factors influence this. This dataset contents as variables factors: CRIM (per capita crime rate by town), ZN (proportion of residential land zoned for lots over 25,000 sq.ft.), INDUS (proportion of non-retail business acres per town), CHAS (Charles River dummy variable -= 1 if tract bounds river; 0 otherwise-), NOX (nitrogen oxides concentration -parts per 10 million-), RM (average number of rooms per dwelling),

AGE (proportion of owner-occupied units built prior to 1940), DIS (weighted mean of distances to five Boston employment centers), RAD (index of accessibility to radial highways), TAX (full-value property-tax rate per \$10,000), PTRATIO (pupil-teacher ratio by town), B (from formula 1000(Bk - 0.63)2 where Bk is the proportion of African American people by town) and LSTAT (lower status of the population -percent-). Finally, the target variable is called ‘target’ and is the median value of owner-occupied homes in \$1000s.

As our**target variable** is a **continuous variable**, we have a **multivariate regression problem**. We call it multivariate because we have more than one predictor variable.
## Solution Strategy

AGE (proportion of owner-occupied units built prior to 1940), DIS (weighted mean of distances to five Boston employment centers), RAD (index of accessibility to radial highways), TAX (full-value property-tax rate per \$10,000), PTRATIO (pupil-teacher ratio by town), B (from formula 1000(Bk - 0.63)2 where Bk is the proportion of African American people by town) and LSTAT (lower status of the population -percent-). Finally, the target variable is called ‘target’ and is the median value of owner-occupied homes in \$1000s.

As our

- Load
**Boston**dataset from our example database, You will find it in the "**Popular examples**" window. - Once loaded, open the dataset and go to the ‘
**Data**’ section. Then open the ‘**Review and transform data**’ page. - Take a look at the data features (or variables). You will see the column name and the data type. We are going to find some continuous variables (‘CRIM’ or ‘INDUS’) and some categorical ones (for example ‘CHAS’, ‘RAD’ or ‘TAX).
- The first thing to do is to remove those columns that are not needed for predicting as the identification columns (‘ID’).
- Go to ‘
**Exploratory analysis**’ in the left menu. We are going to check how the different features are correlated. When two features are highly correlated we can remove one of them from our prediction, because both are going to supply very similar information to the algorithm. We consider high correlation when its coefficient is higher than 0.8. For example, variables RAD and TAX have a**correlation coefficient**of 0.91, so we can ignore one of them in the analysis. - Then go to the ‘
**General statistics**’ menu. There we will explore the different variables more deeply. You can check different**statistical parameters**as the**mean, standard deviation, or the data skewness**(if the data distribution curve is shifted to the left or to the right). When the skewness coefficient is higher than 0.5 (or lower than -0,5) we say that this variable distribution curve is very shifted. In this dataset, variable CRIM has a skewness of 5.22 (very big). We will need to make a data transformation. We will substitute this variable for its natural logarithm. **Data transformation**is done under the overview menu. We look for the variable and click in ‘More’. Then click in ‘Data transformation’ and choose the transformation (in our case a natural logarithm). We have the option of substituting the current variable or to create a new variable with this transformation. We will apply also the natural logarithm transformation to variables ZN, DIS, PTATRIO, B,LSTAT and target- In the ‘
**Missing values**’ menu you can explore how many cells are empty and where. In this dataset we are lucky because we haven’t any missing value. - Now go to the ‘
**Data insights**’ menu to get more information from the dataset before creating any model. We are going to see how the different features combination leads to survival or not. We can see for example that high house prices (target value from 35 to 50), represents only 10% of the dataset. If we add the variable IND with values from 0 to 7 (it is, with low non-retail businesses), the probability of finding an expensive house (between 35000 and 50000$) goes up to 20%. It means, in industrial areas, house prices are lower. - Once we have reviewed our data we can go to select a target. In the ‘
**Target’**page we selected the column ‘target’ as is the one we want to predict. Once selected, NextBrain automatically gives you more recommendations to prepare the data and the problem type (multivariate regression). Problem type is detected according to the nature of the target variable. If it is a continuous variable, the problem will be a multivariate regression one. - Let’s check the last recommendation on how to prepare data before training a model. It will be as if an expert data scientist is guiding you. Our expert will say that features CRIM ≈ RAD, CRIM ≈ TAX, NOX ≈ DIS and RAD ≈ TAX have a high correlation. We are going to ignore variables CRIM, CHASS and DIS in our model.

- Once we have followed the recommendations, we can go to prepare our model. Back to the target section, click on
**‘Go to prepare model’**and then in**‘Create a new model**’. In the model section we still have to make some more decisions. The first one is we want to use all features to predict the target. In the data preparation window we have already removed the feature ‘ID’ because it was an identification feature. Now when we are selecting the features we want to use to predict (then it will be called predictors) we can omit the features CRIM, CHASS and DIS as we said. - The second decision will be splitting the dataset in two parts: one for
**training**the algorithm and the other for**testing**(measuring the accuracy of its prediction). - We also have the possibility to select the
**algorithm**and the**accuracy parameter**it’s going to use for its internal optimization. If you don’t select any, NextBrain will use an ensemble of algorithms to make the prediction. - Lastly, we can select the
**time**we want the algorithm to spend to find the solution. Normally we are going to use short training times but when we have very big datasets. Once we make this last decision we can train the algorithm. - When the algorithm is trained we can go to check if it has done it right and what insights can we get from its work.
- Under the ‘
**Evaluation**’ button we will find the accuracy of the prediction. We can see for example a list of the features that have had more influence in the prediction and which haven’t had any influence at all. Our model has reached a nice**accuracy**(86%). In ‘**Advanced metrics**’ we can see a plot comparing the original target values with the predicted values. We can see that all points are very close to a 45 degrees line in the plot. - As an algorithm doesn't have to be a ‘black box’, we can check what are the troubles the algorithm has found to make this prediction. We can find it under the ‘
**Explainability**’ button. - Inside explainability we will get more insights on our problem. For example, we will see a plot where all our data features are mapped into a two-dimensional plot (
**Dimensionality reduction**). Then points are colored according to the target variable (a color scale). A classification problem can be described as the separation of these points according to their colors. If data points with different colors are mixed, the classification will be difficult. If we can see that different color points are more or less concentrated in different areas, the algorithm will perform the classification with high accuracy. In this case we can see that darker colors (dark blue) are more or less grouped as well as lighter colors (yellow). It means that the classification is possible. - In explainability we will also see which features have been more important (
**features influence**) for the algorithm to make this classification, and also in which range. We can see that features LSTAT and RM are the ones the algorithm basically used to make the classification. LSTAT has to do with the status of the population and RM are the number of rooms. Both variables make sense: people with higher status buy more expensive houses, and houser are more expensive as the number of rooms increases. LSTAT influences the algorithm in the higher range and RM influences in the lower range. Then specifically what algorithm has found is that people with higher status have more expensive houses, and that houses with fewer rooms are cheaper. - In the
**‘Features similarity’**menu we will see how variables are similar between them. We can see that ZN and B are different from the rest of variables and also that INDUS and NOX are very similar (which makes a lot of sense because nitrogen oxides concentration in air is dependent on the non-retail business concentration - industries-). - In the ‘
**Evaluation**’ window we can also predict if a certain combination of feature values will lead to higher or lower prices - Imagine we enter some data to make the
**prediction**and the outcome is that the price is high (for example 45, that means 45000 dollars). And you wonder what makes this house expensive. To better understand it you can check the ‘**What if tool**’. We have entered some variable values and algorithm prediction is that the price is high as we said (45). Among the values that we have chosen for the different variables, it stands out that for RM (number of rooms) we have selected 9. It’s clear that with this number of rooms, the house will be expensive. Anyway, let’s use the ‘What of tool’ and revert the outcome. To do it we will select a new outcome with values from 15 to 25 (between 15000 y 25000 dollars). The what if tool will tell us to reduce the number of rooms below 6 (we already knew that number of rooms make houses expensive). But also is highlighting that if we go to a neighborhood with old houses (variable AGE) we can also reduce the price. It’s another variable also important to determine house prices. Additionally we can see that if we change the house to a neighborhood where we find people with lower status, the house price will decrease as well.

**Predict**’ on the side bar (you have to close NextBrain app and open only the sidebar), select the model (the model you have trained has a specific id number) and click on it. Automatically the target feature (target) will be replaced by the model predictions with a **formula**. This formula is =PREDICT. So if you change a data point, the sheet will be refreshed and the prediction will be updated.