Worked today with Dataquest and did the statistics courses to refresh my memory. So some observations are the following:
- Simple random sampling is not a reliable sampling method when the sample size is small. Because sample means vary a lot around the population mean, there’s a good chance we’ll get an unrepresentative sample.
- When we do simple random sampling, we should try to get a sample that is as large as possible. A large sample decreases the variability of the sampling process, which in turn decreases the chances that we’ll get an unrepresentative sample.
To ensure we end up with a sample that has observations for all the categories of interest, we can change the sampling method. We can organize our data set into different groups, and then do simple random sampling for every group. We can group our data set by player position, and then sample randomly from each group. This sampling method is called stratified sampling, and each stratified group is also known as a stratum.
Also another question that I found useful in a Dataquest exercise (In python) is the following:
Question: Create a column in affordable_apps called affordability. It should have the value cheap if the price is lower than 5, and reasonable otherwise.
Answer: affordable_apps[“affordability”] = affordable_apps.apply(lambda x: “cheap” if x[“Price”] < 5 else “reasonable”, axis=1)