Found springboard blog https://www.springboard.com/blog/ which is interesting and also the article “19 Free Public Data Sets for Your Data Science Project” https://www.springboard.com/blog/free-public-data-sets-data-science-project/ . Also found this site http://insideairbnb.com/get-the-data.html that has lots of data for Airbnb and I can do data analysis from there. It will be really interesting and helpful because it is real data from Airbnb ! I downloaded the data from Athens and there are so many things to see and do!

Also I watched some videos on Data analysis Course by Jeff Leek up to video 17 and I learned to download data in R directly from websites, which is a very useful thing to know. Finally I continued the tutorial “Head Start for Data Scientist” on Kaggle, up to exploratory data analysis. I think this is the best way to learn data science because you are actually doing it and learn from other users and Kernels.

Finally did exercise 10 -Unit 3, from the book “An introduction to statistical learning”. Here is the code I used

#Excercise 10, Page 123

setwd(“C:/Users/User/Desktop/Άγγελος/R/Statlearning/Data”)

library(ISLR)

data(“Carseats”)

summary(Carseats)

reg<-lm(Sales~Price+Urban+US,data=Carseats)

summary(reg)

contrasts(Carseats$Urban)

contrasts(Carseats$US)

#(b)

#In the regression the Price has a negative coefficient which indicates that the

#price has an effect on sales(It’s difficult to sell more expensive houses)

#Also R has created a UrbanYes variable which takes the values of 1 if the house is

#in a urban area and 0 otherwise. We can see that it has a negative effect on

#sales ,but it is not significant in the regression model we created.

#Finally R has created a USYes variable which takes the values of 1 if the house is

#in US and 0 otherwise.This shows that sales are affected

#by the location of the house, if the house is in US this makes it easier

#to be sold.

#(c)

#equation form y= 13.043-0.054459*x1-0.021916*x2+ 1.200573*x3+ε

#(d)

#we can reject the null hypothesis for the intercept,Price,and US

#(e)

reg2<-lm(Sales~Price+US,data=Carseats)

summary(reg2)

#(f)

#The first model had a 41.52 R-squared and the second 62.43. The second

#model fits the data better.

#(g)

confint(reg2)

#(h)

plot(reg2)

plot(hatvalues(reg2))

which.max(hatvalues(reg2))

#We can see that there is a high leverage observation that affects the model

#which is observation no.43