Found springboard blog https://www.springboard.com/blog/ which is interesting and also the article “19 Free Public Data Sets for Your Data Science Project” https://www.springboard.com/blog/free-public-data-sets-data-science-project/ . Also found this site http://insideairbnb.com/get-the-data.html that has lots of data for Airbnb and I can do data analysis from there. It will be really interesting and helpful because it is real data from Airbnb ! I downloaded the data from Athens and there are so many things to see and do!
Also I watched some videos on Data analysis Course by Jeff Leek up to video 17 and I learned to download data in R directly from websites, which is a very useful thing to know. Finally I continued the tutorial “Head Start for Data Scientist” on Kaggle, up to exploratory data analysis. I think this is the best way to learn data science because you are actually doing it and learn from other users and Kernels.
Finally did exercise 10 -Unit 3, from the book “An introduction to statistical learning”. Here is the code I used
#Excercise 10, Page 123
setwd(“C:/Users/User/Desktop/Άγγελος/R/Statlearning/Data”)
library(ISLR)
data(“Carseats”)
summary(Carseats)
reg<-lm(Sales~Price+Urban+US,data=Carseats)
summary(reg)
contrasts(Carseats$Urban)
contrasts(Carseats$US)
#(b)
#In the regression the Price has a negative coefficient which indicates that the
#price has an effect on sales(It’s difficult to sell more expensive houses)
#Also R has created a UrbanYes variable which takes the values of 1 if the house is
#in a urban area and 0 otherwise. We can see that it has a negative effect on
#sales ,but it is not significant in the regression model we created.
#Finally R has created a USYes variable which takes the values of 1 if the house is
#in US and 0 otherwise.This shows that sales are affected
#by the location of the house, if the house is in US this makes it easier
#to be sold.
#(c)
#equation form y= 13.043-0.054459*x1-0.021916*x2+ 1.200573*x3+ε
#(d)
#we can reject the null hypothesis for the intercept,Price,and US
#(e)
reg2<-lm(Sales~Price+US,data=Carseats)
summary(reg2)
#(f)
#The first model had a 41.52 R-squared and the second 62.43. The second
#model fits the data better.
#(g)
confint(reg2)
#(h)
plot(reg2)
plot(hatvalues(reg2))
which.max(hatvalues(reg2))
#We can see that there is a high leverage observation that affects the model
#which is observation no.43