So recently, as many fellow young graduates with a stable job I’ve been thinking about moving to a new flat, maybe something affordable somewhere near to my workplace. But also, as many of my fellow young workers I’m far from being super wealthy so I have to watch out and take every dime into account.

Recently a lot of people have rised their voices in Spain against the savage real estate market as they claim that the prices for flat renting/buying are skyrocketing and thus, the people, specially young workers are facing tremendous difficulties when it comes to leave their parents house and start a life.

Those words made a lot of sense to me but I always prefeer to trust facts and data instead of plain opinions, so I decided to perform a small study just to check how expensive is to live in my area based on some data.

## THE DATA SOURCE

habitaclia.com is one of the most popular online sites for finding flats and apartments or houses to buy/rent. A lot of my friends use it and after browsing the page one is able to see that there are a lot of different anouncements. The user is able to filter offers by province, so I decided to look inside “Girona”.

I spent some time browsing through the offers. One can quickly see that there are always about 850-100 renting offers divded by zones, being “Girona city” the zone with more offers as it is the capital city of the province.

After doing that initial check I came to the conclusion that a manual analysis would be hardcore time consuming and close to impossible. So I decided to write a small bot in python using scrapy to automatically extract all the offers in each zone and insert them into a MongoDB collection.

## GETTING THE DATA

Scrapy is a simple but useful system used to automatically perform requests on websites, parse their html response and extract relevant information based on a system which is similar to Regex. Based on the information that is extracted from a response, scrapy can perform various tasks such as store that information or perform new requests automatically based on that information.

In my case, scrapy was used to automatically browse and store every offer in the page and then browse to a new page repeating the whole process till no more offers found.

Basically, first of all you have to create a bot with scrapy and edit its main code to make it start with some base URLs to browse from there.

On my case it went like that

```
class HabitacliadataSpider(scrapy.Spider):
name = 'habitacliadata'
allowed_domains = ['www.habitaclia.com']
start_urls = ['https://www.habitaclia.com/alquiler-en-alt_emporda.htm',
'https://www.habitaclia.com/alquiler-en-baix_emporda.htm',
'https://www.habitaclia.com/alquiler-en-cerdanya.htm',
'https://www.habitaclia.com/alquiler-en-cerdanya_francesa.htm',
'https://www.habitaclia.com/alquiler-en-garrotxa.htm',
'https://www.habitaclia.com/alquiler-en-girones.htm',
'https://www.habitaclia.com/alquiler-en-pla_de_l_estany.htm',
'https://www.habitaclia.com/alquiler-en-ripolles.htm',
'https://www.habitaclia.com/alquiler-en-selva.htm']
```

Then as my interest was to store all the relevant information in a MongoDB system for further processing I initialized the python client.

```
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client['habitaclia2']
```

After that it was the time for me to tell scrapy what to exactly look for. Based on the start urls scrapy would start looking for stuff on a page that looks like this:

A simple page that includes a batch of offers and ends with something like this:

So as we can see we can be browse the zone page by page untill we reach a page that has no “siguiente” link.

An offer page looks like this:

An offer with a title, maybe some photos and details such as the price, the m2, number of rooms, number of toilets, the renting company and some description.

## STORING THE DATA

At that point I deceded to retrieve the following values:

**Offer title**: string**Renting company**: string**Zone (county)**: string**Zone (city)**: string**Price**: double**Surface (m^2)**: double**Number of rooms**: int**Number of toilets**: int**Price/meter**: double**Description**: chracteristics: string**Description**: plain text: string**Date of posting**: date**Number of photos provided**: int**Language of the offer (based on a detection library on python)**: string

From that point it is simple, we have to jump to every offer in the page scan it and then jump to the “siguiente” page.

With scrapy we have to identify the html related to the offer access and to the next page access. We can simply do this by looking at the source code with the html inspector on the browser:

Note that we can do the same with an offer page for extracting those values mentioned before.

The following code does the trick for getting offers and the next page:

```
items = '//h3[@class="list-item-title"]/a/@href'
items_contenido = response.xpath(items).extract()
comarca = '//span[@class="txt-geo"]/text()'
comarca_contenido = response.xpath(comarca).extract()[0].strip()
for item in items_contenido:
yield scrapy.Request(url=item, callback=self.inspect_item,meta={'comarca':comarca_contenido})
time.sleep(6)
siguiente = '//li[@class="next"]/a/@href'
siguiente_contenido = response.xpath(siguiente).extract()[0]
#print siguiente_contenido
# Nos movemos a la siguiente pagina de la lista
yield scrapy.Request(url=siguiente_contenido)
```

I decided to use the xpath system to parse the html and get the content of the HREF related to the offer and the next page. The code is simple, we jump to a page, we gather all the offers and then in parallell we browse all of them performing one new request per offer and we move to the next page to repeat the process. Note that with yield and scrapy.Request we perform new get requests.

So our bot gets data from every offer in a form like that:

```
# SUPERFICIE
try:
superficie = caracteristicas_contenido[0]
superficie = int(superficie)
except:
superficie = 0
# HABITACIONES
try:
habitaciones = caracteristicas_contenido[1]
habitaciones = int(habitaciones)
except:
habitaciones = 0
```

While doing that, performing the correct string to int, string to double counversion was important as an improper conversion and storage can easily translate into a long and tedious process of data cleaning afterwards.

Then the insertion into a mongodb goes like this.

```
datosinmueble = {'titulo':titulo_contenido,'inmobiliaria':inmobiliaria_contenido,'zona':zona_contenido,
'precio':precio_contenido,'superficie':superficie,'habitaciones':habitaciones,'lavabos':lavabos,
'preciometro':preciometro,'caracteristicas':detalles_contenido,'numfotos':fotos_contenido,
'fecha_edicion':fecha_contenido,'idioma':idioma,'descripcion':descripcion_contenido2,'poblacion':comarca_contenido,
'comarca':comar}
coleccion = self.db.alquiler
resultado = coleccion.insert_one(datosinmueble)
```

We simple create a JSON object and use insert_one to feed our Mongo system with some new fresh data. Other options can be used as well, such as, for example, creating a JSON object with all of the values and then doing a simple insert instead of an inser_one per every offer, but I liked it better my way.

So after writing and running the bot (the code can be found here) I came with about 870 different offers:

## EXPLORING THE DATA

Now that we have the data in store we would like to explore it a little bit. I used Rstudio for my analysis with libraries such as mongolite for data retrieval, plyr and dplyr for data manipulation and ggplot for basic data plotting (in a nicer way).

```
library(mongolite)
library(plyr)
library(dplyr)
library(ggplot2)
```

After loading the libraries I used the mongolite library to retrieve the data from my local MongoDB system.

```
con <- mongo("alquiler", url = "mongodb://127.0.0.1:27017/habitaclia2")
mydata <- con$find('{}')
```

I decided to perform a {} search and retrieve all of the values and then filter that dataset in memory according to specific needs as I don’t have a lot of values.

```
> con$count()
[1] 872
>
```

Now as we have the access to the flat rental data. We may want to describe our dataset a little bit. We can try the summary function on our key features. Specially price and surface:

```
> summary(mydata$precio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
250.0 600.0 772.5 1289.0 1100.0 30000.0
> summary(mydata$superficie)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 70.0 90.0 118.9 124.0 938.0
> summary(mydata$numfotos)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 11.00 16.00 17.37 22.00 75.00
> summary(mydata$habitaciones)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 2.876 4.000 13.000
> summary(mydata$lavabos)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.00 2.00 1.83 2.00 14.00
>
```

So in here we can see that we can find flats from 250EUR/Month to 30000EUR/Month with a median of 772.5 and a mean of 1289. At first sight based on the summary of the prices, we can see that those prices are pretty diverse, so, our studies will sure be affected by outliers. Regarding to the surface of those flats we can find values from ~70 to 938, as we can see a minimum of 0.0 is listed there but that is for sure a mistake related to offers that do not include the surface value, so we may need to do some more cleaning later on to perform more precise studies.

Regarding to the number of photos, toilets and rooms there are no surprises in particular, flats use to have 3 rooms and 2 toilets and people use to post between 10 and 18 photos on their offers.

Let’s now check if we can see some correlation between our most important variables, being those: the price, the surface, and the rooms and toilets of the flat.

```
pairs(~precio + superficie + habitaciones + lavabos,data=mydata,
main="correlation matrix") #Matríz de correlaciones
```

One can think that there must be a clear correlation between the price of the flat and its surface such as: the bigger the property the more expensive, but according to what we can see there it is not 100% like that. Of course we can check some obvious facts such as the more surface you have, the more rooms you will have (including toilets).

To be more sure about what we see in those graphs, we can run a correlation test, testing with, pearson, kendall and spearman tests on the price and surface variables.

```
> cor.test(data.matrix(mydata["superficie"]), data.matrix(mydata["precio"]),
method=c("pearson", "kendall", "spearman"))
Pearson's product-moment correlation
data: data.matrix(mydata["superficie"]) and data.matrix(mydata["precio"])
t = 7.0032, df = 870, p-value = 5.005e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1671833 0.2929060
sample estimates:
cor
0.2310088
>
```

As we can see, we can appreciate some correlation between the price and the surface of the flat, being that correlation positive (the more pricy the more large) but also week, so we can conclude that the correlation between those two variables may exist but it is not very reliable with the data we do have.

We can try to explain that by taking into account that the price may not be directly related to the surface, but to other factors such as the “degree of luxury” of the property and maybe also to the location of the property.

As the huge difference between the cheapest flat and the most expensive is a little bit suspicious and the results we obtained inspecting those correlations were not so relevant, let us inspect the distribution of values of the price and surface in more detail:

```
> mean(data.matrix(mydata["precio"]))
[1] 1288.963
> sd(data.matrix(mydata["precio"]))
[1] 2075.737
>
```

```
> shapiro.test(data.matrix(mydata["precio"]))
Shapiro-Wilk normality test
data: data.matrix(mydata["precio"])
W = 0.35936, p-value < 2.2e-16
>
```

Regarding to the price, we can observe that the standard deviation of the sample is almost two times the mean, that along with the small p-value we obtained in the shapiro wilk test indicates us that there must be an important amount of outliers that are messing up our analysis and thus messing the normality of our data.

```
> mean(data.matrix(mydata["superficie"]))
[1] 118.875
> sd(data.matrix(mydata["superficie"]))
[1] 99.73935
>
```

```
> shapiro.test(data.matrix(mydata["superficie"]))
Shapiro-Wilk normality test
data: data.matrix(mydata["superficie"])
W = 0.64201, p-value < 2.2e-16
>
```

Regarding to the surface, maybe the data does not look that extreme at first sight but we are practically facing the same situation. To be more clear we can plot a couple of density graphs:

As we can see our data does not exactly follow a normal distribution (mostly due to outliers) as almost all of the values are concentrated in the left. That leads us to think that, as we said our means are affected by extreme values, so if we want to get a better understaing of the standard price of a standard flat in Girona we should use something like the median or the modal, to get a real value. Let’s extract the most common values of every relevant information:

```
> sort(table(mydata["precio"]),decreasing=TRUE)[1:5]
700 750 650 850 600
52 42 40 35 34
> sort(table(mydata["superficie"]),decreasing=TRUE)[1:5]
90 100 80 70 60
49 40 38 34 30
> sort(table(mydata["habitaciones"]),decreasing=TRUE)[1:5]
3 2 4 1 5
315 242 160 96 35
>
```

As we can see 700EUR/Month is the most common price we can find in Girona for a home, the rest of the values do not differ so much from that. Regarding to the other interesting data most of the homes go from 70 to 100 m^2 and use to have 3 rooms.

As we can see, from this point on we don’t have a lot to do, unless we start filtering our data a little bit more to look for extreme values or to group our data by existent categories such as by geographical zone or by new categories such as cheap/normal/expensive, wide/small, etc.

An interesting first approach to that might be to group those values by “comarca”, “zona”or by real estate company “inmobiliaria”. That will allow us to continue with our basic descriptive analysis while we can focus on looking for differences between categories.

We can do the groping using the dplyr library:

```
comarcas <- mydata %>%
group_by(comarca) %>%
summarize(precio_media=mean(precio), precio_desviacion=sd(precio), precio_var=var(precio),
precio_mediana=median(precio), total_valores=length(precio),precio_maximo=max(precio),
precio_minimo=min(precio),precio_rango=max(precio)-min(precio), habitaciones_media=mean(habitaciones),
lavabos_mean=mean(lavabos), fotos_mean=mean(numfotos),
superficie_maximo=max(superficie), superficie_minimo=min(superficie), superficie_media=mean(superficie),
superficie_mediana=median(superficie), superficie_desviacion=sd(superficie), superficie_varianza=var(superficie),
superficie_rango=max(superficie)-min(superficie), precio_metro=sum(precio)/sum(superficie))
zonas <- mydata %>%
group_by(poblacion) %>%
summarize(precio_media=mean(precio), precio_desviacion=sd(precio), precio_var=var(precio),
precio_mediana=median(precio), total_valores=length(precio),precio_maximo=max(precio),
precio_minimo=min(precio),precio_rango=max(precio)-min(precio), habitaciones_media=mean(habitaciones),
lavabos_mean=mean(lavabos), fotos_mean=mean(numfotos),
superficie_maximo=max(superficie), superficie_minimo=min(superficie), superficie_media=mean(superficie),
superficie_mediana=median(superficie), superficie_desviacion=sd(superficie), superficie_varianza=var(superficie),
superficie_rango=max(superficie)-min(superficie), precio_metro=sum(precio)/sum(superficie))
inmobiliarias <- mydata %>%
group_by(inmobiliaria) %>%
summarize(precio_media=mean(precio), precio_desviacion=sd(precio), precio_var=var(precio),
precio_mediana=median(precio), total_valores=length(precio),precio_maximo=max(precio),
precio_minimo=min(precio),precio_rango=max(precio)-min(precio), habitaciones_media=mean(habitaciones),
lavabos_mean=mean(lavabos), fotos_mean=mean(numfotos),
superficie_maximo=max(superficie), superficie_minimo=min(superficie), superficie_media=mean(superficie),
superficie_mediana=median(superficie), superficie_desviacion=sd(superficie), superficie_varianza=var(superficie),
superficie_rango=max(superficie)-min(superficie), precio_metro=sum(precio)/sum(superficie))
```

Note that the syntax is really similar to an SQL sentence. So for every group we are specially interested in the price and also the surface of those properties, we would like to know the means, sd, minimum and maximum, median values and the ranges.

As we have a lot of data here I decided to filter it a little bit and print some interesting information:

## DATA BY “ZONES”

Zones are like city quarters. Here we can see some interesting values

```
> select(zonas %>% arrange(-total_valores), poblacion, total_valores)
# A tibble: 167 x 2
poblacion total_valores
<chr> <int>
1 Zona Centre 60
2 Zona Eixample Sud-Migdia 47
3 Zona Centre-Barri Vell 45
4 Zona Eixample Nord 44
5 Zona La Devesa 28
6 Calella de Palafrugell 24
7 Zona Fenals 21
8 Palamós 19
9 Zona Puig Rom-Canyelles-Almadrava 17
10 Zona Santa Margarida-Salatar 16
# … with 157 more rows
> select(zonas %>% arrange(-precio_media), poblacion, precio_media)
# A tibble: 167 x 2
poblacion precio_media
<chr> <dbl>
1 Llafranc 7962.
2 Sant Martí Vell 7500
3 Sant Gregori 7200
4 Zona Urbanitzacions 6500
5 Fontanals de Cerdanya 5500
6 Calella de Palafrugell 5324.
7 Llambilles 4500
8 Sant Antoni de Calonge 3730
9 Zona Montgoda 3050
10 S´Agaró 2766.
# … with 157 more rows
> select(zonas %>% arrange(precio_media), poblacion, precio_media)
# A tibble: 167 x 2
poblacion precio_media
<chr> <dbl>
1 Zona El Molí-El Rieral 330
2 Vilallonga de Ter 375
3 Port de la Selva (El) 390
4 Colera 400
5 Sant Joan de les Abadesses 421.
6 Jonquera (La) 425
7 Ripoll 439.
8 Lles de Cerdanya 450
9 Zona Piverd-Vila-Seca-Bruguerol 450
10 Zona Parc Bosc-Castell 460
# … with 157 more rows
> select(zonas %>% arrange(-superficie_media), poblacion, superficie_media)
# A tibble: 167 x 2
poblacion superficie_media
<chr> <dbl>
1 Besalú 800
2 Llambilles 600
3 Vilademuls 589
4 Sant Aniol de Finestres 550
5 Sant Feliu de Buixalleu 450
6 Zona Urbanitzacions 432.
7 Zona Canyelles 395.
8 Llagostera 350
9 Fogars de la Selva 336.
10 Camprodon 329.
# … with 157 more rows
```

So the zones with more offers are “Zona centre”, the center of the town in Girona and “Barri vell” the old quarter of Girona. Llafranc is the most expensive zone by mean, and El molí rieral the cheapest one, also, everything looks like we can find the widest properties in Besalú.

## DATA BY COMPANIES

Another interesting thing to know is which companies offer most homes for renting, the most expensive ones (those I have to avoid) and the cheapest ones. The surface is also a point here.

```
> select(inmobiliarias %>% arrange(-total_valores), inmobiliaria, total_valores)
# A tibble: 231 x 2
inmobiliaria total_valores
<chr> <int>
1 NA 146
2 Alquilovers 75
3 130 Serveis Immobiliaris 26
4 Delfin Inmo 14
5 Cambra Propietat Urbana - Girona 12
6 AGENTHIA Serveis Immobiliaris 11
7 Ceigrup Finques Company Girona 11
8 Finques Catalunya 11
9 Finques Palafrugell 11
10 MARTIIMMOBLES 11
# … with 221 more rows
> select(inmobiliarias %>% arrange(-precio_media), inmobiliaria, precio_media)
# A tibble: 231 x 2
inmobiliaria precio_media
<chr> <dbl>
1 Finques Ginesta 12664.
2 Finques Catalonia 10533.
3 Finques Palafrugell 7773.
4 Estrabau Sa 6289.
5 Advor Immo 6000
6 InmoLobby Internacional 5325
7 Fr 5000
8 H.G. Serveis Immobiliaris 5000
9 Finques Miquel 4750
10 Dpt. Internacional 4500
# … with 221 more rows
> select(inmobiliarias %>% arrange(precio_media), inmobiliaria, precio_media)
# A tibble: 231 x 2
inmobiliaria precio_media
<chr> <dbl>
1 Rairot 375
2 Parés Assessors 380
3 Forcadell Grandes Cuentas 394
4 Cambra Propietat Urbana - Figueres 400
5 InmoMerkat.com 422
6 Agència Mediterrània 430
7 Finques Jocar 433.
8 Finques Nualart 443.
9 Inmo 10 450
10 Payet Vacances 450
# … with 221 more rows
> select(inmobiliarias %>% arrange(-superficie_media), inmobiliaria, superficie_media)
# A tibble: 231 x 2
inmobiliaria superficie_media
<chr> <dbl>
1 Finques Rustiques 800
2 Mipisonuevo.com 650
3 Dpt. Internacional 600
4 K.Teva 519
5 H.G. Serveis Immobiliaris 487
6 Immobiliaria Santamónica 450
7 Particular IF 439
8 Barcelona Realty Group 435
9 Grup GCB 415
10 Inmodet 372.
```

So it turns out that a lot of people decide to offer their house for renting personally instead of using a renting company. Also we can identify “Finques Ginesta” and “Finques Catalonia” as potentially luxury companies but it is also interesting to see that the most expensive renting companies are not strictly the ones that offer the widest spaces.

## DATA BY “COMARCA”

Finally another group that we have to look for is the data by “comarcas”. In Spain, comarcas are territorial limitations, similar to counties in the USA. As we got data from only 8 counties the groups may look interesting.

```
> select(comarcas %>% arrange(-total_valores), comarca, total_valores)
# A tibble: 8 x 2
comarca total_valores
<chr> <int>
1 Gironès 273
2 Baix Empordà 227
3 Selva 149
4 Alt Empordà 134
5 Cerdanya 46
6 Ripollès 22
7 Garrotxa 18
8 Pla de l´Estany 3
> select(comarcas %>% arrange(-precio_media), comarca, precio_media)
# A tibble: 8 x 2
comarca precio_media
<chr> <dbl>
1 Baix Empordà 2355.
2 Cerdanya 1071.
3 Pla de l´Estany 1043.
4 Gironès 983.
5 Selva 895.
6 Alt Empordà 800.
7 Garrotxa 756.
8 Ripollès 662.
> select(comarcas %>% arrange(precio_media), comarca, precio_media)
# A tibble: 8 x 2
comarca precio_media
<chr> <dbl>
1 Ripollès 662.
2 Garrotxa 756.
3 Alt Empordà 800.
4 Selva 895.
5 Gironès 983.
6 Pla de l´Estany 1043.
7 Cerdanya 1071.
8 Baix Empordà 2355.
> select(comarcas %>% arrange(-superficie_media), comarca, superficie_media)
# A tibble: 8 x 2
comarca superficie_media
<chr> <dbl>
1 Pla de l´Estany 436
2 Garrotxa 161.
3 Selva 128.
4 Baix Empordà 127.
5 Cerdanya 122.
6 Ripollès 122.
7 Gironès 107.
8 Alt Empordà 106.
```

As we only have 8 categories, it may be a good opportunity to try a bar graph, to represent those values.

## PRICE BY COMARCA

## SURFACE(M^2) BY COMARCA

## OFFERS BY COMARCA

So it is easy to find a home in “Gironés” and “Baix Empordà”, but when it comes to “Baix Empordà” it looks like it is also expensive to find a home there.

## IDENTIFYING PATTERNS

At this point we can easily be sure that, according to the information we are dealing with, there is no clear regression, also our distributions seem to have a lack of normality.

But if we look closer to the graphs generated by pairs, we can see how it seems to be some correlation between the surface of the property and its price but only for those properties with prices from ~400 to ~ 1200 EUR/Month.

Then after that range, prices tend to vary in a crazier way, the surface tends to be less important compared to the price so other features such as the luxury (equipment) or the location of the property may take a big role in the price.

But this is just our guessing, let’s inspect our data a little bit better and listen to its story.

In here I decided to apply the KMEANS algorithm to extract some clusters, to see if those match or not with my guessings.

```
> ds <- select(mydata, 6,11,12)
> ds <- na.omit(ds)
> dcluster <- kmeans(ds, 2, nstart = 20)
> dcluster
K-means clustering with 2 clusters of sizes 21, 851
Cluster means:
superficie habitaciones precio
1 139.2381 3.428571 12316.667
2 118.3725 2.862515 1016.834
```

Now with some more clusters

```
> dcluster
K-means clustering with 3 clusters of sizes 41, 16, 815
Cluster means:
superficie habitaciones precio
1 234.9268 3.536585 4791.7073
2 142.8125 3.625000 13784.3750
3 112.5669 2.828221 867.4429
```

We see how increasing the number of clusters do not releav anything so special.

```
K-means clustering with 5 clusters of sizes 14, 157, 645, 36, 20
Cluster means:
superficie habitaciones precio
1 155.71429 3.857143 14360.7143
2 196.54777 3.732484 1451.2102
3 88.53333 2.570543 691.8233
4 298.47222 4.194444 3273.8889
5 138.60000 2.950000 6550.0000
```

So as we see, the majority of our values identify in the 400-1200 category we guessed before, then we can say that we can have some degrees of luxury but for sure we have less values in there and thus I am personally not so interested in them as I’m far from being rich.

Let’s examine the correlation in our 450-1200 category:

```
mydata <- filter(mydata, precio > 450 &precio < 1200)
regresion = lm(data.matrix(mydata["superficie"]) ~ data.matrix(mydata["precio"]))
plot(superficie ~ precio, mydata)
abline (regresion, lwd=1, col ="red" )
print(regresion)
```

```
> cor.test(data.matrix(mydata["superficie"]), data.matrix(mydata["precio"]), method=c("pearson", "kendall", "spearman"))
Pearson's product-moment correlation
data: data.matrix(mydata["superficie"]) and data.matrix(mydata["precio"])
t = 11.311, df = 581, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3558844 0.4891418
sample estimates:
cor
0.4248115
>
```

We can see that as we reduce a little bit our dataset to our more general category, surface tends to become more relevant.

If we repeat the process but this time with prices from 1200 we can clearly see the difference:

```
mydata <- con$find('{}')
mydata <- filter(mydata, precio >1200)
regresion = lm(data.matrix(mydata["superficie"]) ~ data.matrix(mydata["precio"]))
plot(superficie ~ precio, mydata)
abline (regresion, lwd=1, col ="red" )
print(regresion)
```

As we can see that makes no sense at all

```
> cor.test(data.matrix(mydata["superficie"]), data.matrix(mydata["precio"]), method=c("pearson", "kendall", "spearman"))
Pearson's product-moment correlation
data: data.matrix(mydata["superficie"]) and data.matrix(mydata["precio"])
t = -1.4009, df = 173, p-value = 0.163
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.25031929 0.04311155
sample estimates:
cor
-0.1059089
>
```

So we find no regression here. On the other hand if we do a little bit more of research we can see that when properties tend to be expensive the company who offers them plays a big role in their prive so there are expensive renting companies thay may be associated with more luxury equipments.

```
mydata <- con$find('{}')
mydata["precio_inm"] <- NA
mydata <- filter(mydata, precio >1200 )
mydata$precio_inm <- ave(mydata$precio,mydata$inmobiliaria, FUN=mean)
regresion = lm(data.matrix(mydata["precio"]) ~ data.matrix(mydata["precio_inm"]))
plot(precio ~ precio_inm, mydata)
abline (regresion, lwd=1, col ="red" )
cor.test(data.matrix(mydata["precio"]), data.matrix(mydata["precio_inm"]), method=c("pearson", "kendall", "spearman"))
```

```
> cor.test(data.matrix(mydata["precio"]), data.matrix(mydata["precio_inm"]), method=c("pearson", "kendall", "spearman"))
Pearson's product-moment correlation
data: data.matrix(mydata["precio"]) and data.matrix(mydata["precio_inm"])
t = 22.482, df = 173, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8197547 0.8966694
sample estimates:
cor
0.8631362
>
```

So if we do the test, we can see that when it comes to expensive properties the price is more related to the company that offers it, when it comes to avg properties the price is more related to the surface and also the location of the property.

I would like to leave it here now and continue in the following post, to avoid an extra large and hard to read publication 🙂

Good luck with finding a home!