Did the applied exercises 8 & 9 from the book “An introduction to statistical learning “ from chapter 3. The exercises are very useful and the questions guide you in a smart way to find the answers. At the end is the code I used .
Watched up to video 13 on Jeff Leek series on YouTube and through the videos , I found the article “Managing a statistical analysis project – guidelines and best practices” https://www.r-statistics.com/2010/09/managing-a-statistical-analysis-project-guidelines-and-best-practices/ as well as the “simply statistics” blog https://simplystatistics.org/ which seems to be pretty interesting. These videos are general about data analysis, but they’re a good intro. I’ll run all of them to have a general idea of what data analysis is and how it is done.
Will explore sites in the following days like Kaggle, Analytics Vidhya, MachineLearningMastery and KD Nuggets which are some of the active communities where data scientists all over the world enrich each other’s learning. Also I’ll check this https://www.freecodecamp.org/news/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e/ for online courses.
Finally, walked a bit more through the “Head Start for Data Scientist” tutorial in Kaggle.
Code in R from the book “An introduction to statistical learning “
#Excercise 8, Page 121
setwd(“C:/Users/User/Desktop/Άγγελος/R/Statlearning/Data”)
mydata<-read.csv(“Auto.csv”)
mydata$horsepower<-as.numeric(mydata$horsepower)
linear<-lm(mpg~horsepower,data=mydata)
summary(linear)
str(mydata)
predict(linear,data.frame(horsepower=c(98)),interval=”confidence”)
predict(linear,data.frame(horsepower=c(98)),interval=”prediction”)
attach(mydata)
plot(mpg,horsepower)
plot(mpg,horsepower,col=”red”,pch=17)
abline(linear,lwd=2)
par(mfrow=c(2,2))
plot(linear)
plot(hatvalues(linear))
which.max(hatvalues(linear))
#Excercise 9, Page 121
rm(linear)
setwd(“C:/Users/User/Desktop/Άγγελος/R/Statlearning/Data”)
mydata<-read.csv(“Auto.csv”)
par(mfrow=c(1,1))
mydata$horsepower<-as.numeric(mydata$horsepower)
plot(mydata)
cor(mydata[,-c(9)])
mlinear<-lm(mpg~.-name,data=mydata)
summary(mlinear)
#for every year there is an increase in mpg of about 7.734e-01
plot(mydata$year,mydata$mpg)
test<-lm(mydata$mpg~mydata$year,data=mydata)
abline(test,lwd=2)
par(mfrow=c(2,2))
plot(mlinear)
#we can see that the leverage point in observation 14 is very high
#also we can see a funnel shape in the residual plot , indicating heteroscedacity
#(non-constant variances in the errors)
#it can be solved transforming the response with log or square root
#interaction effects
intmlinear<-lm(mpg~.-name+displacement:weight,data=mydata)
summary(intmlinear)
summary(mlinear)
#it appears that the interaction (displacement-weight) is statistically significant
#and the regression has a higher R squared of 0.8575 instead of 0.822 we had before,
#without the interactions.
#lets try log transformaton in the predictor
logmlinear<-lm(log(mpg)~.-name+displacement:weight,data=mydata)
summary(logmlinear)
plot(logmlinear)
#we can see now that our residuals now appear to have constant variance and also
#a higher R-squared of 0.8884