27-11-2019

It’s been a while since I wrote in my blog, mainly because there were some family health issues. However I enrolled in a course called Machine learning from A-Z  from Udemy, https://www.udemy.com/course/machinelearning/  for just 10 euros  and I’ve made progress on that. It’s the first time I used Python and it has a different philosophy than R, in many cases it is more convenient for data analysis but I need to practice a lot to get used to it. Also today I had a job interview and I got the position! I hope it will help me on my way to learn data analysis and improve my skills!

19-11-2019

Today and yesterday I did mainly chapter 4 of the book “Introduction to statistical learning” up to page 153. Also I searched a bit more the online courses and I’m thinking of taking the course from Udemy which is cheap and I think it will be good for a beginner like me. The name of the course is Machine Learning A-Z™: Hands-On Python & R In Data Science https://www.udemy.com/course/machinelearning/ and at the moment it is offered at a low price of 13euros. Those 2 videos helped me a lot on the subject of online courses https://www.youtube.com/watch?v=MGnXMu9GFig and https://www.youtube.com/watch?v=_XbttSk3ALs .

17-11-2019

Those are some recommendations on online courses from the web. I watched the video How to Become a Data Scientist here https://www.youtube.com/watch?v=jMvhFNGGT_0&list=PLZ62S2cTm5Yp9K1g5VrmMzgPhx_BMLwi5&index=10&t=0s   and also this video from the same channel on YouTube -Best Online Data Science Courses https://www.youtube.com/watch?v=MGnXMu9GFig  

Some of the recommendations are:

1.Introduction to Python * (This is not free anymore, watched first lecture)*https://www.datacamp.com/courses/intro-to-python-for-data-science?utm_source=learnpython_com&utm_campaign=learnpython_tutorials

2.Google’s Python Class https://developers.google.com/edu/python/

Coursera

3.Data Science Specialization Johns Hopkins  https://www.coursera.org/specializations/jhu-data-science

4.Applied Data Science with Python University of Michigan https://www.coursera.org/specializations/data-science-python

Udemy

5.Python for Data Science and Machine Learning Bootcamp https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/

Dataquest

6.Learn Data Science https://www.dataquest.io/

Also finished tutorial “Head Start…” in Kaggle. I need to learn some theory around random forests, logistic regression and other prediction models. Also read some other tutorials and some of them get in so much depth.

Watched this video about random forests -Machine learning – Random forests https://www.youtube.com/watch?v=3kYujfDgmNk&list=PLZ62S2cTm5Yp9K1g5VrmMzgPhx_BMLwi5&index=4

Also there are other videos I need to watch about random forests. I found very helpful the tutorials from Statquest,  on YouTube https://www.youtube.com/watch?v=J4Wdy0Wc_xQ

Generally, I thought it would be a good idea to take an online course and a certificate with that. A cheap one to start with, maybe from Udemy ,  I will take some time to investigate.

15-11-2019

Read from the book “An introduction to statistical learning” up to page 138. This book is more practical and does not get in details about the mathematics used, but it emphasizes why things are done.

Watched many videos of the channel jbstatistics on YouTube , especially on Simple Linear Regression and ANOVA and it’s really helpful.

At 16-11-2019 I learned a lot about ANOVA trying to apply ANOVA in the Airbnb dataset in R. The assumptions are important and they need to be checked every time we run ANOVA. Also checked some programs in Greece regarding an MSc in Data Analysis and thinking about applying to one of them. The cost for the full time program is approximately 6000e. So better start saving 🙂

13-11-2019

Found springboard blog https://www.springboard.com/blog/  which is interesting and also the article “19 Free Public Data Sets for Your Data Science Project” https://www.springboard.com/blog/free-public-data-sets-data-science-project/ . Also found this site http://insideairbnb.com/get-the-data.html that has lots of data for Airbnb and I can do data analysis from there. It will be really interesting and helpful because it is real data from Airbnb ! I downloaded the data from Athens and there are so many things to see and do!

Also I watched some videos on Data analysis Course by Jeff Leek up to video 17 and I learned to download data in R directly from websites, which is a very useful thing to know. Finally I continued the tutorial “Head Start for Data Scientist” on Kaggle, up to exploratory data analysis. I think this is the best way to learn data science because you are actually doing it and learn from other users and Kernels.

Finally did exercise 10 -Unit 3, from the book “An introduction to statistical learning”. Here is the code I used

#Excercise 10, Page 123

setwd(“C:/Users/User/Desktop/Άγγελος/R/Statlearning/Data”)

library(ISLR)

data(“Carseats”)

summary(Carseats)

reg<-lm(Sales~Price+Urban+US,data=Carseats)

summary(reg)

contrasts(Carseats$Urban)

contrasts(Carseats$US)

#(b)

#In the regression the Price has a negative coefficient which indicates that the

#price has an effect on sales(It’s difficult to sell more expensive houses)

#Also R has created a UrbanYes variable which takes the values of 1 if the house is

#in a urban area and 0 otherwise. We can see that it has a negative effect on

#sales ,but it is not significant in the regression model we created.

#Finally R has created a USYes variable which takes the values of 1 if the house is

#in US and 0 otherwise.This shows that sales are affected

#by the location of the house, if the house is in US this makes it easier

#to be sold.

#(c)

#equation form y= 13.043-0.054459*x1-0.021916*x2+ 1.200573*x3+ε

#(d)

#we can reject the null hypothesis for the intercept,Price,and US

#(e)

reg2<-lm(Sales~Price+US,data=Carseats)

summary(reg2)

#(f)

#The first model had a 41.52 R-squared and the second 62.43. The second

#model fits the data better.

#(g)

confint(reg2)

#(h)

plot(reg2)

plot(hatvalues(reg2))

which.max(hatvalues(reg2))

#We can see that there is a high leverage observation that affects the model

#which is observation no.43

12-11-2019

Did the applied exercises 8 & 9 from the book “An introduction to statistical learning “ from chapter 3. The exercises are very useful and the questions guide you in a smart way to find the answers. At the end is the code I used .

 Watched up to video 13 on Jeff Leek series on YouTube  and through the videos , I found the article “Managing a statistical analysis project – guidelines and best practices” https://www.r-statistics.com/2010/09/managing-a-statistical-analysis-project-guidelines-and-best-practices/  as well as the “simply statistics” blog https://simplystatistics.org/ which seems to be pretty interesting. These videos are general about data analysis, but they’re a good intro. I’ll run all of them to have a general idea of what data analysis is and how it is done.

Will explore sites in the following days like Kaggle, Analytics Vidhya, MachineLearningMastery and KD Nuggets which are some of the active communities where data scientists all over the world enrich each other’s learning. Also I’ll check this https://www.freecodecamp.org/news/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e/ for online courses.

Finally, walked a bit more through the “Head Start for Data Scientist” tutorial in Kaggle.

Code in R from the book “An introduction to statistical learning “

#Excercise 8, Page 121

setwd(“C:/Users/User/Desktop/Άγγελος/R/Statlearning/Data”)

mydata<-read.csv(“Auto.csv”)

mydata$horsepower<-as.numeric(mydata$horsepower)

linear<-lm(mpg~horsepower,data=mydata)

summary(linear)

str(mydata)

predict(linear,data.frame(horsepower=c(98)),interval=”confidence”)

predict(linear,data.frame(horsepower=c(98)),interval=”prediction”)     

attach(mydata)

plot(mpg,horsepower)

plot(mpg,horsepower,col=”red”,pch=17)

abline(linear,lwd=2)

par(mfrow=c(2,2))

plot(linear)

plot(hatvalues(linear))

which.max(hatvalues(linear))

#Excercise 9, Page 121

rm(linear)

setwd(“C:/Users/User/Desktop/Άγγελος/R/Statlearning/Data”)

mydata<-read.csv(“Auto.csv”)

par(mfrow=c(1,1))

mydata$horsepower<-as.numeric(mydata$horsepower)

plot(mydata)

cor(mydata[,-c(9)])

mlinear<-lm(mpg~.-name,data=mydata)

summary(mlinear)

#for every year there is an increase in mpg of about 7.734e-01

plot(mydata$year,mydata$mpg)

test<-lm(mydata$mpg~mydata$year,data=mydata)

abline(test,lwd=2)

par(mfrow=c(2,2))

plot(mlinear)

#we can see that the leverage point in observation 14 is very high

#also we can see a funnel shape in the residual plot , indicating heteroscedacity

#(non-constant variances in the errors)

#it can be solved transforming the response with log or square root

#interaction effects

intmlinear<-lm(mpg~.-name+displacement:weight,data=mydata)

summary(intmlinear)

summary(mlinear)

#it appears that the interaction (displacement-weight) is statistically significant

#and the regression has a higher R squared of 0.8575 instead of 0.822 we had before,

#without the interactions.

#lets try log transformaton in the predictor

logmlinear<-lm(log(mpg)~.-name+displacement:weight,data=mydata)

summary(logmlinear)

plot(logmlinear)

#we can see now that our residuals now appear to have constant variance and also

#a higher R-squared of 0.8884

11-11-2019

11-11-2019

Read pages 104 -120 from the book “An introduction to statistical learning”. I run the commands in R from the lab starting from p.109. It will be very useful to do the linear regression exercises, especially the applied.

I also started a new tutorial in Kaggle from Titanic dataset named “Head Start for Data Scientist” https://www.kaggle.com/hiteshp/head-start-for-data-scientist which has lots of useful information and also online courses you can take in data science. Also I checked this course in coursera https://www.coursera.org/learn/machine-learning which I want to begin and see if I can manage to complete it . It is about machine learning ,so in the next days I will know if it is too advanced for me or not. Maybe I can study both the book and the course and focus there and sometimes run through the tutorials at Kaggle. But it will probably be better to watch the introductory videos on YouTube by Jeff Leek and then start the course.

My First Post!

First post!

So many things today: I watched YouTube playlist “coursera: data analysis by Jeff leek” https://www.youtube.com/playlist?list=PLXBDYmaCbeL8efhOZS4g9W6Z3m9_hFSnT
Very good series for a begginer, watched up to up to video 6
Also, I downloaded this pdf http://vita.had.co.nz/papers/tidy-data.pdf (Tidy data) and read a bit up to chapter 3.
Found the writer of the previous pdf, Hadley Wickham at https://github.com/
*Q:How to run code on R from https://github.com/hadley/data-baby-names ? Search github a bit, I can learn a lot from there.
Also I watched videos Coin Flipping Robot https://www.youtube.com/watch?v=y-n-5Gdv-74 and How random is a coin toss? https://www.youtube.com/watch?v=AYnJv68T3MM Video showed 51% chance!
I already have a basic statistical background , however I can check the page “open intro statistics” https://www.openintro.org/stat/textbook.php .Finally, I checked the article R Coding Style Guide https://www.r-bloggers.com/%F0%9F%96%8A-r-coding-style-guide/ ,which gives some rules about programming in R!

I have to organise my study a bit, it’s difficult to stay focused in a particular subject every time, but it is vital. It has been about 2 months that I study data analysis and refresh my memory in statistics and R language. I feel I’ve learned a lot since then, but of course I’m still an amateur on data analysis. In another post I’ll make a list of the books I’ve read or want to read about the subject. So that’s it for today!

My First Blog Post

Data analysis progress

Be yourself; Everyone else is already taken.

— Oscar Wilde.

This is my first post here and my very first blog! I wanted to keep track on my progress at the study of data analysis, which I find interesting. Hopefully others will be able to read some of the sites I found useful in order to learn, or maybe find some answers to their questions . My target is to write almost every day what I have studied and what are my questions. I have a statistical background, but I need to work a lot in order to be competitive and gain knowledge. As I said this is my very first blog, so it will take some time to improve! I can’t wait to get started!

Design a site like this with WordPress.com
Get started