Musings on Data Analytics and Machine Learning: Yet another house price prediction


The background

From the day when I started to look in to Data Analytics back in 2012-2013, the first example I saw was predicting the house prices. Andrew NG's course came, coursera became hit, WEKA became popular for non-coders; Matlab started selling ML Kits and yet it took me some time for me to grasp it

Now that I have a feel of it, let me put it in simplest way.
  1. There is no silver bullet in machine learning (We can try asking Royal Enfield to be innovative here ) 
  2. Start by understanding how the data is: Exploratory data analysis
  3. No column should be wasted, it has to be "feature engineered" 
  4. Add as many new features engineered till you run out of options (and then copy from smart guys) 

 The problem of King County Data, Kaggle 

"This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015". That is the kaggle statement
There were enough Exploration done on what the existing data gives which did not amuse me much. What came handy was the suggestions on feature engineering. From a user perspective if we think, these information will not tell me if I will buy that house. What I would look for are "age" and amenities.

There were three date columns (yr_built, yr_renovated and date when it is sold). Converted these to two new columns "age" and "renovated_age".  Now we can remove these three columns.
Then comes the location. From my earlier works on GIS and openstreetmap data, I knew that amenities could be easily fetched. Getting amenities would not make feature. Thus it was converted as number of "schools" in 2KM kind of columns for each house.  Now we can remove zip, latitude and longitude

With these two steps, when I looked at the results; I myself felt "wow"; way better than the original scores.
Details on code and the feature engineered data can be found at my kaggle post

With this, I was confident to apply to the open kaggle competition on another house price challenge "

House Prices: Advanced Regression Techniques


and as said "no silver bullet", this one did not come with lat-lon  :(

A related post on ML without coding is in my linkedin

Also, don't miss my other open source posts in linkedin
Part 1 (Freertos), Part 2 (android prototyping), Part 3 (Mobile backend as service)