Friday, October 17, 2014

Analyzing a cross section of the Irish secondhand car carket - Part 1

I've been planning to learn a bit more about R and I'm also in the market for a new car so what better time to play around with the data and look at whats available in the Irish car market.

Getting the data
I used Python and the Beautiful Soup package to gather some high level details about the cars available in Ireland. The details that the script pulled was limited to:

  • Make
  • Model
  • Engine
  • Seller Type
  • County
  • Mileage
  • Year
  • Price
Selecting the data
I reduced the list to cars 20 years old and newer. This produced a list of 12990 cars on offer.
The above is produced by in R with: hist(irl_cars$Year)


I focused on a range of Makes and models which excluded high end vehicles. I acknowledge that this was partly to reduce the skewness of the distribution but also on the basis that this analysis focused on cars one might see regularly.

  • BMW
  • Fiat
  • Ford
  • Honda
  • Hyundai
  • Mercedes-Benz
  • Nissan
  • Opel
  • Peugeot
  • Renault
  • Seat
  • Skoda
  • Toyota
  • Volkswagen
  • Volvo


lablist <- as.vector((unique(modCars$Maker))
counts <- table(modCars$Maker)
plot(counts, xaxt="n", main="Qty of Cars by Make")
text(1:15, par("usr")[3], labels=lablist, srt=90, pos=2, xpd=TRUE)

Mileage
Looking at the mileage it looks like there are a number of cars above 500,000 miles
While this is not impossible, it seems unlikely that cars are doing so many miles. So I plotted out the number of miles by year for any cars above 100k:

plot(limCar$Year, limCar$Mileage, main="Mileage plot by Year", xlab="Year", ylab="Miles")
abline(h=400000)
There is a very clear separation (which I have highlighted with a black line) where the bulk of observations lie below the 400,000 mile mark. There is still a great deal of dispersion above this line with some values looking high but not necessarily unreasonable. At this point so common sense might help. The maximum mileage is 2,500,002 miles for an 11 year old car. This implies that car did 622 miles a day every day for 11 years; 60 miles an hour 10 hours a day? Seems very unlikely, so its probably safe enough to discard this as a typo.
What about the ones of 1,000,000 miles and over? The earliest one of those is 13 years old so applying the same logic this would imply 210 miles a day every day for 13 years which equally seems unlikely. 

If the above seems too much like intuition we can apply something a little more reasoned, and I can try out 2 different methods of outlier detection to boot;

  1. Mean and standard deviation
  2. Median and median absolute deviation
Method 1 is more common, but method 2 is better when there are large outliers.
In both cases I am using a k-factor of 3as the threshold to detect outliers
1) Mean(modCars$Mileage) = 75841.97; sd(modCars$Mileage) = 53810.45
75841.97 + (3*53810.45) = 236,913.3 Miles
2) Median(modCars$Mileage) = 74936; mad((modCars$Mileage)=43471.31
74936 + (3*43471.31) = 205,349.90 miles


Method 1 exposes 32 cars and method 2 exposes 78 cars; my "intuition" exposed 13 cars though it was based on a simple visual inspection. I'll create 3 data sets for the follow up posts.




Sunday, August 3, 2014

The Birthday paradox at a wedding

I know I've been very quiet of late but I started a new job 4 months ago and have been pouring myself into it to learn as much as I can. More importantly I've been preparing for my wedding to the lovely Lelly Ann, though to be honest she has been doing most of the tough work so she deserves all the glory!.

The big day is only 5 days away now. I had great hopes of writing a cool optimizer that would seat guests based on their affinity but unfortunately laziness and an over-estimation of my programming skills got in the way. However I have been thinking about the birthday paradox recently and since there will be over 100 people in the room next Friday I thought it was a nice anecdote.

The Birthday paradox arises from the chances of two or more people in a group having the same birthday. Given that there are 365 days (ignoring leap years for simplicitys sake) in a year you would think that the chances that any 2 people might have the same birthday would be extremely low;
however this is not the case.
Wikipedia has a great page on it so I won't reproduce their excellent explanation but it turns out that at 23 people the odds tip over 50%, which is better than a coin toss. In terms of our day we are due to have 106 guests.
The formula is:
1 - (Permutation(365,n)/(365^n)) Where n is the number of people involved.

so 106 guests works out at 99.99999574936430000% or as close to 100% as makes no difference.
We also have tables of 8,10 & 11 which work out respectively at 7.43%, 9.46% & 14.11% respectively. I wonder would it be worth sampling each table to see how many times this actually comes through. with 10 tables we should probably see this at least once!