Getting the data
I used Python and the Beautiful Soup package to gather some high level details about the cars available in Ireland. The details that the script pulled was limited to:
- Seller Type
Selecting the data
I reduced the list to cars 20 years old and newer. This produced a list of 12990 cars on offer.
The above is produced by in R with: hist(irl_cars$Year)
I focused on a range of Makes and models which excluded high end vehicles. I acknowledge that this was partly to reduce the skewness of the distribution but also on the basis that this analysis focused on cars one might see regularly.
lablist <- as.vector((unique(modCars$Maker))
counts <- table(modCars$Maker)
plot(counts, xaxt="n", main="Qty of Cars by Make")
text(1:15, par("usr"), labels=lablist, srt=90, pos=2, xpd=TRUE)
Looking at the mileage it looks like there are a number of cars above 500,000 miles
There is a very clear separation (which I have highlighted with a black line) where the bulk of observations lie below the 400,000 mile mark. There is still a great deal of dispersion above this line with some values looking high but not necessarily unreasonable. At this point so common sense might help. The maximum mileage is 2,500,002 miles for an 11 year old car. This implies that car did 622 miles a day every day for 11 years; 60 miles an hour 10 hours a day? Seems very unlikely, so its probably safe enough to discard this as a typo.
What about the ones of 1,000,000 miles and over? The earliest one of those is 13 years old so applying the same logic this would imply 210 miles a day every day for 13 years which equally seems unlikely.
If the above seems too much like intuition we can apply something a little more reasoned, and I can try out 2 different methods of outlier detection to boot;
- Mean and standard deviation
- Median and median absolute deviation
Method 1 is more common, but method 2 is better when there are large outliers.
In both cases I am using a k-factor of 3as the threshold to detect outliers
1) Mean(modCars$Mileage) = 75841.97; sd(modCars$Mileage) = 53810.45
75841.97 + (3*53810.45) = 236,913.3 Miles
2) Median(modCars$Mileage) = 74936; mad((modCars$Mileage)=43471.31
74936 + (3*43471.31) = 205,349.90 miles
Method 1 exposes 32 cars and method 2 exposes 78 cars; my "intuition" exposed 13 cars though it was based on a simple visual inspection. I'll create 3 data sets for the follow up posts.