How to use PyCaret 2.0 to Predict House Prices

Ben Hart
May 15
6 min read

In this article we will setup and use PyCaret 2.0 to predict house prices using data from the House Prices: Advanced Regression Techniques kaggle competition.

What is PyCaret?

PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment. — PyCaret homepage

A huge number of changes have been made in the version 2 release. For details on these changes you should look here https://github.com/pycaret/pycaret/releases/tag/2.0.

You should also check out an article written by the library’s author Moez Ali in Towards Data Science.

Just a Note

I’m using Ubuntu on my machine and as such some of the setup code presented below may not work on a Windows machine. None of the setup code is complex and you should be familiar with alternatives if you have ever used Anaconda and pip before. I might include some Windows specific instructions if there is enough demand.

Also I’m using a 5 year old laptop with an i7 intel CPU and 16GB of memory. When running any models I tried to keep the processing below 10 minutes. If your machine is better than mine then go ahead and smash it :-).

I’m also going to be writing this in a Jupyter notebook.

Now on with the show.

Installing and setting up PyCaret 2.0

First we need to setup PyCaret 2.0. I would recommend using Anaconda and setting up a new conda environment to avoid any conflicts with other libraries you may use. I’m using Python 3.7.6 and my environment is called “pycaret_example”.

conda create --name pycaret_example python=3.7.6

Once that has completed we need to activate the environment.

conda activate pycaret_example

Now we can go ahead and install PyCaret 2.0 using pip, this should take around 5 minutes. This will also download all of the dependencies needed to run PyCaret 2.0.

pip install pycaret==2.0

Once that’s done we should check the version to make sure we have what we need.

> pip show 
pycaretName: pycaretVersion: 2.0Summary: PyCaret - An open source, low-code machine learning library in Python.Home-page: https://github.com/pycaret/pycaretAuthor: Moez AliAuthor-email: moez.ali@queensu.caLicense: MITLocation: ....

Getting the data

We are going to use the dataset found in the House Prices: Advanced Regression Techniques Kaggle competition. I’d recommend spending some time reading the overview page to get an idea of what the data is as well as the Evaluations tab to get some background into the tasks.

As a very brief summary though, we are trying to predict house prices using known metrics about each house. This is a classic regression problem so we will need to build some kind of model capable of this task. Just as a side note, while working with non-data science clients this is usually as detailed as it gets.

So lets look at the data. I used the Kaggle API to download the data but you can also simply download the zip from the data tab of the competition.

> conda install -c conda-forge kaggle
> kaggle competitions download -c house-prices-advanced-regression-techniques

You should now have a zip file that you can extract everything from. You should have the following 4 files

train.csv
test.csv
data_description.txt
sample_submission.csv

I moved train.csv and test.csv to a new directory called data. This looks good to me and helps keep everything structured.

data_description.txt includes explanations of what each column means as well as possible values and their meanings, ex.

SaleType: Type of sale

       WD  Warranty Deed - Conventional
       CWD Warranty Deed - Cash
       VWD Warranty Deed - VA Loan
       New Home just constructed and sold
       ...

We also have the sample_submission.csv file that shows us what format we need to submit to the competition in, ex.

Id,SalePrice
1461,169277.0524984
1462,187758.393988768
1463,183583.683569555
1464,179317.47751083
1465,150730.079976501
1466,177150.989247307
...

Both files are very useful and you should try to become somewhat familiar with them.

The actual data we will be using is already split into training and test sets. We will use the test set at the end to check our final model. Let’s start looking at train.csv and leave test.csv for, well, testing.

Setting up your PyCaret Environment

Lets first import the regression libraries for PyCaret as well and pandas to read in the data.

# import libraries
from pycaret.regression import *
import pandas as pd

Now let’s read in the data and take a look at the top 5 rows.

# read in the data
df = pd.read_csv('data/train.csv')pd.options.display.max_columns = 500df.head()

You should see 81 columns including our target column SalePrice right at the end. Without doing any transformations let’s initialize PyCaret.

# initialize the setup
exp_reg = setup(df, target='SalePrice')

This will start off the process of checking the data, I checked that the columns were the right type (i.e. numeric, categorical) and pressed enter. We get a summary of what PyCaret did, we haven’t tried to do any transformations so any transformations should be False or None. You should see we have missing values, 21 numeric features and 59 categorical features (PS, columns and features are the same thing).

Comparing Models

Now let’s compare models by running

compare_models(sort='RMSE', blacklist=['tr'])

We are sorting by RMSE because that is how the submissions will be evaluated. I also blacklisted the TheilSen Regressor as the estimated training time was around 10 hours.

The best model from the comparison was an orthogonal matching pursuit model. You can read more about the model here. The output from above is

Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (sec)
Orthogonal Matching Pursuit	16374.0281	563482649.7289	23474.3808	0.9027	0.1353	0.0988	0.0250
Lasso Least Angle Regression	16272.8592	598135396.7780	23993.1535	0.8962	0.1338	0.0972	0.0821
CatBoost Regressor	15222.4519	624393412.6718	24270.6871	0.8961	0.1244	0.0884	8.1233

We have an RMSE of 23474.3803 or $23,474.38. Not too bad for just a few lines of code. This is fine but we should tune these models to potentially get better results.

Tuning your Models

Let’s take the top 3 models and work with those for the remainder of this article. This means we need to create models for them. We can do this using create_models() as follows.

# create the top 3 models
OMP = create_model('omp', verbose=False)
LLAR = create_model('llar', verbose=False)
CBR = create_model('catboost', verbose=False)

I’ve suppressed the output as it will be the same as when we ran compare_models().

Now we can tune the models using tune_model().

tuned_OMP = tune_model(OMP, optimize='RMSE')
tuned_LLAR = tune_model(LLAR, optimize='RMSE')
tuned_CBR = tune_model(CBR, optimize='RMSE')

I ran each of the above separately so I could see the resulting RMSE in the output. When tuned the best performing model of the 3 was a Lasso Least Angle Regression model with a RMSE of 23000.3172.

Plotting your Models

PyCaret also comes with some really useful plotting tools that can be used to diagnose and check performance of the models. Plotting the residuals is easy

plot_model(tuned_LLAR)

We can also see the prediction error

plot_model(tuned_LLAR, plot='error')

And the feature importance

plot_model(tuned_LLAR, plot='feature')

We can also create other plots that might be better suited to your needs

"residuals" Residuals Plot
"error" Prediction Error Plot
"cooks" Cook’s Distance Plot
"rfe" Recursive Feat. Selection
"learning" Learning Curve
"vc" Validation Curve
"manifold" Manifold Learning
"feature" Feature Importance
"parameter" Model Hyperparameter

Testing the Models

Before we finalize the model we need to test its performance against a test set of data the model has not seen before. By default the setup step sets aside 30% for this task. Let’s look at our top three models again tested against a test set.

predict_model(tuned_OMP)
predict_model(tuned_LLAR)
predict_model(tuned_GBR)

Just like in life things don’t always work as well in the real world. Our best performing model is now the CatBoost Regressor with a RMSE of 39805.99. The RMSE has increased a lot from the training cross validation mean RMSE suggesting overfitting on all three models. This is where you have to put on your data scientist cap and go back and figure out why. For now I’m just going to push forward.

Finalizing the Model for Deployment

Now we are at the point where we have decided on a model, CatBoost, and we want to get it ready for use in the real world. The finalize_model() function will train your earlier tuned model but will use the entire data set.

final_CBR = finalize_model(tuned_CBR)

Check Final Model on test.csv

Let’s read in the test set and check our model performance against it. We will need to create predictions so we can upload them to the kaggle competition and check there.

df_test = pd.read_csv('data/test.csv')
test_predictions = predict_model(final_CBR, data=df_test)
submission = test_predictions[['Id', 'Label']]
submission.columns = ['Id', 'SalePrice']
submission.to_csv('submission.csv', index=False)

Jump to the Leaderboard tab on the competition and select Submit Predictions. Once submitted you will get a score.

The score for my submission was 0.14249, hardly a competition winner but it placed me at 2661 out of 5117 submissions by using basically no data science skills and in a few lines of code you could write in a matter of minutes.

Summary

PyCaret is a great tool and version 2 builds upon an already strong foundation of AutoML.

The author, Moez Ali, should be congratulated for this awesome tool.

Since learning of PyCaret I have included it in my tool belt as it allows me to rapidly build ML solutions.

I didn’t touch on everything that PyCaret can do in this tutorial, there is a huge amount of additional functionality that can take my submission to the next level. I’d encourage you to check out the website and read through the examples and docs there.

If you like to have a chat about data science or have a project you'd like to involve me in, you can contact me here,

Contact

DataLeaf AI