In this post, we’ll be exploring the application of Dispersive Flies Optimisation, as originally pondered in my previous post. Specifically, we’ll discuss applying DFO to AirBnB data, as the AirBnB data is readily available with very little effort. I will be referring to the data provided for London, however all of the available data should be the same.
There are probably loads of ways we can apply DFO to search this information; I’m going to be looking for the best place to stay.
Each city has four different sets of information:
- A calendar that spans a period of a year.
- A complete set of listings.
- A list of neighbourhoods in the city.
- A complete set of reviews.
The complete set of listings contains the bulk of the information. It contains each property available, which user owns it, which borough it’s based in, the property’s address, some shoddy GPS information, prices, ratings, quantitative data about reviews, the available amenities, and more.
We’ll be disregarding the calendar, the neighbourhoods and the reviews files, as the information they contain is either redundant or qualitative.
Processing the Information
Because we’re going to be both navigating a multidimensional space and calculating the fitness of a listing, we need to define what data is going to represent what. Thus, if processing a piece of data is a min-max issue that isn’t directly effected by another value, it’s going to be a dimension (e.g. distance from desired location). If it’s a value weighed alongside another, it’s going to be fitness (e.g. I want internet AND cable TV, so obviously only having one means it’s less suitable).
Based on these definitions, we can say that the following could represent dimensions:
- Location, judged as distance from the desired place. Probably using the shoddy GPS data.
- How desirable a location is. After all, your conference might be in Croydon, but you probably don’t want to stay in Croydon.
- Price. Cheaper is better, duh.
- Public Transport connections. This would be awkward to represent numerically, since this information is qualitative.
And that these could determine a location’s fitness:
- The quantity of desired amenities.
- The quantity and outlook of reviews.
First and foremost however, we have to filter out absolute rejections. If we want a room not a whole property, then we’re going to completely disregards any listings that are for a whole property.
We’d pass in numerical representations of all the information for both dimensions and fitness, and set the specification for how the algorithm will calculate fitness. Then – fingers crossed – it should give us a useful result.
How are AirBnB allowed to share some of this data? Most of the postcodes are complete postcodes. That narrows the location to less than 100 houses, with the listed average being 15 per postcode. That’s an alarmingly small area, especially when combined with a photo of the poster, their name, and the likelihood that you can find them on a public directory service like 192.com. That’s an alarming lack of personal security.