Featured

My First Blog Post

Data analysis progress

Be yourself; Everyone else is already taken.
— Oscar Wilde.

This is my first post here and my very first blog! I wanted to keep track on my progress at the study of data analysis, which I find interesting. Hopefully others will be able to read some of the sites I found useful in order to learn, or maybe find some answers to their questions . My target is to write almost every day what I have studied and what are my questions. I have a statistical background, but I need to work a lot in order to be competitive and gain knowledge. As I said this is my very first blog, so it will take some time to improve! I can’t wait to get started!

20-12-2020

I have been making progress with the seminar about Business Intelligence, SQL and Power BI and it is a very interesting seminar. Power BI is a powerful visualization tool that is quite easy to use and I want to apply it to real data. The cool thing is that graphs change automatically if you change or add data. I also learned some things about data warehouse and ETL processes.

Apart from that, I am not working with new data analysis projects now, because I have work to do for my other subjects in Musicology. The exams period ends at February and hopefully I won’t have more subjects to study after that (or maybe 2-3).

I also apply as much as I can to statistician/data analyst jobs and I have improved my CV, which now looks more professional and cleaner. I also learned some new things in Excel, a friend of mine sent me an interview test for excel and it helped me a lot.

I think that until February of 2021 I will have time just for the seminar, but after my exams in Musicology I want to continue my projects and maybe enroll in a statistics course, or maybe a programming course!

14-11-2020

Today I completed the third chapter from the book Introduction to Statistical Learning and did the exercises in R. It’s a really good book and I read it for a second time, I want to read the whole book again because the second time I go through it, I understand some concepts better. In combination with the machine learning course from Coursera (where I also want to view some videos for a second time), I think someone has a good idea about machine learning in general.

I was also watching a video on YouTube about the decision to be a software engineer and not data scientist, because data science is a hype nowadays and in the near future software engineers will have better career opportunities and replace data scientists, or something like that. I believe even if this is true, statistics will always be in demand, as well as coding. So their combination is really strong. If I got a job until summer in something relevant with statistics, then I would choose a Msc with a better idea of what is good for my career. A Msc in Data science is something that I want to do for sure, but the best thing would be to get a job first. I must not think that this is a hard time for someone to find a job because of Covid restrictions but I need to continue to apply.

Another concept for choosing a Msc is “Flexibility” and it is really important. So even if I need to change a bit my career goals, I will be able to do so. I believe Statistics will always be in demand, especially in the future. So from this angle it is more flexible and “safer” than Data science. As I said, maybe a combination of coding and statistics, is maybe better than a Msc in Data Science.

The following days I want to continue with the book EOSL and also with the SQL, BI and Visualization seminar, in which I am currently enrolled. And continue to apply every day!

31-10-2020

These days I spent many hours trying to understand how scraping works and I understood the basics. It’s difficult to find a code that fits all the webpages for scraping data, and if the HTML code is not written in a good way, things get difficult. However I understood some basic stuff by watching tutorials in YouTube and I also learned to use Wordcloud in Python, which is really fun. I also learned to use choropleth maps and used real data to plot it on various maps!

There are so many things I’ve done these days that it is difficult for me to mention them all here. I have started a seminar for SQL and BI by the university I studied statistics, I have completed Dataquest Data Analyst in Python path (not sure if I’ve mentioned it again here) and did a couple of projects. The project I now want to upload will have to do with a combination of scraping and Wordcloud (maybe extracting the posts from my blog). Generally, my portofolio is already looking much better!

I also think a lot about Msc in Data science and it would be nice if I did one outside Greece. However my bachelor grade is low, so the probabilities are low to get accepted in a program outside Greece in a public university. I could never believe that sometime I would have this desire, so I never tried to have good grades. But you never know, sometimes when we want something a lot we do everything to make it happen. I cannot change the past, but I can do whatever I can now to gain knowledge and build my Portofolio.

It will be important to make a plan now that I have finished Dataquest to make progress in Data Science and do projects. As I said I am currently enrolled in the SQL and BI seminar from my university, but I think I have time if I plan my time smartly. My other activities are Musicology and learning German!

Another good news is that a company replied to my application for a customer analyst role and they gave me an assignment to fill. I hope they will call me, if not it’s ok, I continue my work. It is very important to work smartly and have a goal. My goal is to find a data analyst job and by building a really interesting portofolio and doing projects, I can stand out from the competition.

Finally, it’s nearly a year that I have this interest for data analysis and I have made a big progress. I will continue to do so and expand my knowledge daily!

03-10-2020

These days I’ve completed the NY schools project and I’ve managed to make a summary of my data analysis projects in GitHub pages and it looks very nice. It is something I will constantly update with more and better projects. I also worked on my CV more and decided to take an online seminar in the University of Economics and Business, where I graduated. I think it will help my CV look better. It has to do with SQL and BI, so it will hopefully be very useful. The price is a bit high but it is an investement so I don’t need to worry about that!

Some interesting observations from the lessons I did on Dataquest today:

Sample standard deviation usually underestimates the population standard deviation. The small correction we add to the sample standard deviation (dividing by n-1 instead of n) is called Bessel’s correction. Here is a paper for variability in categorical Variables.

28-09-2020

Worked these days with Dataquest and command lines. Also with histograms in Python and distributions like Normal and Uniform. An interesting observation about boxplots:

A value is an outlier if:

It’s larger than the upper quartile by 1.5 times the difference between the upper quartile and the lower quartile (the difference is also called the interquartile range).
It’s lower than the lower quartile by 1.5 times the difference between the upper quartile and the lower quartile (the difference is also called the interquartile range).

This site https://fivethirtyeight.com/ presents cool graphs and statistics and it would be helpful to occasionally read some articles!

Also I worked with some command lines like the following:

cd /home/dq/practice/wildcards

mkdir html_files archive data

mv *html html_files

mv 201[!9]* archive

mv *csv data

mv /sqlite-autoconf-3210000/tea/win/you_found_it.b64 /home

Also continued with presenting the NY schools project in a jupyter notebook in order to upload it in my Github profile. I try to understand everything I do and not just copy the code.

23-09-2020

I’m really excited today, because I used Basemap library in Python as part of a project on Dataquest. It will be really cool to be able to plot statistics on actual maps on my own projects. Here is a tutorial about using the library, which I found very helpful.

I will also check this notebook and try to apply it myself.

And this is probably a treasure. It contains projects by someone from who we can learn a lot! His projects are based on Dataquest and Datacamp online courses! This is absolutely amazing work! Here is the project I’m working on and he has uploaded.

Also this data on Github details the deaths of Marvel comic book characters between the time they joined the Avengers and April 30, 2015, the week before Secret Wars 1. Really funny!

Finally this is a great way to sum rows . Below is the code:

def clean_deaths(row):

num_deaths = 0

columns = [‘Death1’, ‘Death2’, ‘Death3’, ‘Death4’, ‘Death5’]

for c in columns:

death = row[c]

if pd.isnull(death) or death == ‘NO’:

continue

elif death == ‘YES’:

num_deaths += 1

return num_deaths

true_avengers[‘Deaths’] = true_avengers.apply(clean_deaths, axis=1)

21-09-2020

Some interesting things I did the last few days:

List Comprehensions

The function below can be written with a single line of code:

ints = [1, 2, 3, 4]

times_ten = []

for i in ints:

times_ten.append(i * 10)

print(times_ten)

[10, 20, 30, 40]

It can be written like this:

times_ten = [(i * 10) for i in ints]

So on order to transform a loop to a list comprehension, in brackets we:

Start with the code that transforms each item.
Continue with our for statement (without a colon).

Lambda functions

To create a lambda function (temporary) equivalent of another function, we:

Use the lambda keyword, followed by
The parameter and a colon, and then
The transformation we wish to perform on our argument

I also refreshed my memory on some statistics topics :

There are four different scales of measurement: nominal, ordinal, interval, and ratio. The characteristics of each scale, pivot around three main questions:

Can we tell whether two individuals are different?
Can we tell the direction of the difference?
Can we tell the size of the difference?

What sets apart ratio scales from interval scales is the nature of the zero point.

And finally, I did some handling of missing values:

The technical name for filling in a missing value with a replacement value is called imputation.

Here are some pages that contain interesting data for analysis

Frontpage

https://www.reddit.com/r/datasets/

https://github.com/awesomedata/awesome-public-datasets

https://rs.io/100-interesting-data-sets-for-statistics/

And especially this http://www.data.gov.gr/ has a ton of data from Greece. Really interesting!

18-09-2020

This is a very useful page for practicing regular expressions. It needs a lot of practice to be comfortable with these. I did some practice on Dataquest but that is just an introductory step.

I remembered some concepts on statistics like simple random, stratified and cluster sampling from Dataquest, which does a pretty good job on explaining these topics. We can also find information here

When we describe a sample or a population (by measuring averages, proportions, and other metrics; by visualizing properties of the data through graphs; etc.), we do descriptive statistics.

When we try to use a sample to draw conclusions about a population, we do inferential statistics (we infer information from the sample about the population).

Finally, I’ve explored the notebooks on wine dataset and it’s cool that I can now understand the basic coding in R and also in Python from the users! It’s easy to implement a machine learning algorithm in both languages, the difficult part is understand how it functions.

I must keep in mind that learning data science is not a matter of months, but a matter of years. Like in every field, someone has to commit to be an expert. There are so many things to learn and certainly we cannot excel at everything, but consistency is the key to make progress steadily. One brick each day and soon there will be a wall!

15-09-2020

Here are some keypoints of what I did and discovered the last few days:

Here is a very good book of its kind https://www.edwardtufte.com/tufte/books_vdqi

From the description “The classic book on statistical graphics, charts, tables. Theory and practice in the design of data graphics, 250 illustrations of the best (and a few of the worst) statistical graphics, with detailed analysis of how to display data for precise, effective, quick analysis”.

Seaborn uses a technique called kernel density estimation , or KDE for short, to create a smoothed line chart over the histogram.

When we use the concat() function to combine dataframes with the same shape and index, we can think of the function as “gluing” dataframes together.

Unlike the concat function, the merge function only combines dataframes horizontally (axis=1) and can only combine two dataframes at a time.

An inner join returns only the intersection of the keys, or the elements that appear in both dataframes with a common key.

Outer join: includes all data from both dataframes

Left join: includes all of the rows from the “left” dataframe along with any rows from the “right” dataframe with a common key; the result retains all columns from both of the original dataframes.

I also started a project on kaggle, the dataset “Red wine quality” which I think it will help me understand and apply some of the concepts of linear regression and classification methods. I also want to read again the concepts of regression in the book “Introduction to statistical learning” to better understand them through practical applications.

11-09-2020

Today I continued the intermediate Python course in Dataquest and I will probably do it along with the statistics course.

This explains the difference between .iloc and .loc in pandas very clearly https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different#:~:text=loc%20gets%20rows%20(or%20columns,not%20present%20in%20the%20index.
Uncomment in Python: Ctrl + /

Also, I learned about frequency distributions in Python and differences between histograms and barplots (I refreshed my memory on these). Here are some of those differences:

Histograms help us visualize continuous values using bins while bar plots help us visualize discrete values.
The locations of the bars on the x-axis matter in a histogram, but they don’t in a simple bar plot.
Lastly, bar plots also have gaps between the bars, to emphasize that the values are discrete.

Also an interesting plot for exploratory analysis is scatter matrix plot (scatter_matrix function).

Finally, I uploaded this https://github.com/AngelosTheodorakis/Data_Analysis_Projects/tree/master/Visualizing%20Earnings%20Based%20On%20College%20Majors on Github , as part of a project on Dataquest, where I worked on some plots in Python (histograms, barplots etc.).It is the first project I upload in a Jupyter notebook format.