8 Fundamental Statistical Concepts for Data Science

Erik
9 min readJan 31, 2021

… explained in plain English

Photo by ThisisEngineering RAEng on Unsplash

Statistics is “a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data”. Throw programming and machine learning into the mix and you have a pretty good description of the core skills for data science.

https://twitter.com/nflhdstreamfree?lang=en
https://twitter.com/nflbitelive?lang=en
https://twitter.com/sb2021livefree?lang=en

https://twitter.com/nflbitelive/status/1355779786006650881

https://twitter.com/nflhdstreamfree/status/1355778736864583682

https://twitter.com/sb2021livefree/status/1355780183068913664

https://www.hybrid-analysis.com/sample/d0e8a485f06ade3c01e6d538034962a5dc15294c585f8dc7eba3f86d15b908a4

https://www.hybrid-analysis.com/sample/ba3708542e054ec5aa8a326b554741b9a6c5f91854069d4299c1f46eac6efb2e

https://www.hybrid-analysis.com/sample/f0d714da8339602eb97649d2a374926db10e9cc05718bff27fe3deb67749aa1f

https://www.peeranswer.com/question/60165b1244cf809f32ff226a

https://paiza.io/projects/UELdTD1JGCEt6WBJtTXh0g

https://www.posts123.com/post/1322917/etdfgtdetewrfwsgjghfgt

https://steemit.com/sb2021livefree/@miaoliviamiavia/sb2021livefree

http://www.onfeetnation.com/profiles/blogs/refseertgyuj

https://note.com/pomes90990/n/n8c752fb14854

https://blog.goo.ne.jp/htfhyfryrety/e/31d40ac630b10edd9d4c31978886208e

https://www.page2share.com/page/654438/awdawfdsfdghfdhgyh

http://paste4btc.com/Fi8loNWI

https://pastelink.net/2li0e

https://ideone.com/B0gThO

https://paste2.org/jz88FUIX

https://jsfiddle.net/wjx530yh/

https://notes.io/N1Q3

https://slexy.org/view/s21IVMsnWd

https://paste.tbee-clan.de/zYrT6#

Statistics is used in almost all aspects of data science. It is used to analyse, transform and clean data. Evaluate and optimise machine learning algorithms. It is also used in the presentation of insights and findings.

The field of statistics is extremely broad and determining what exactly you need to learn and in what order can be difficult. Additionally, a lot of material for learning this subject is very complex and in some cases can be quite difficult to digest. Particularly if you don’t have an advanced maths degree and are transitioning into data science from a field such as software engineering.

In the following article, I am going to introduce eight fundamental statistical concepts you need to be able to grasp when learning data science. These are not particularly advanced techniques but they are a selection of the basic requirements you need to know before moving onto learning more complex methods.

1. Statistical sampling

In statistics, the entire set of raw data that you may have available for a test or experiment is known as the population. For a number of reasons, you cannot necessarily measure the patterns and trends across the entire population. For that reason statistics allows us to take a sample, perform some computations on that set of data, and using probability and some assumptions we can with a certain degree of certainty understand trends for the entire population or predict future events.

Statistics allows us to take a sample, or portion of the population, perform some computations on that set of data and using probability and some assumptions we can with a certain degree of certainty understand trends for the entire population

Let’s say, for example, that we want to understand the prevalence of a disease such as breast cancer in the entire population of the United Kingdom. For practical reasons, it is not possible to screen the entire population. Instead, we may take a random sample and measure the prevalence among them. Assuming our sample is sufficiently randomised and representative of the entire population we can achieve a measure of prevalence.

2. Descriptive statistics

Descriptive statistics, as the name suggests, helps us to describe the data. In other words, it enables us to understand the underlying characteristics. It doesn’t predict anything, make assumptions or infer anything it simply provides a description of what the data sample we have looks like.

Descriptive statistics are derived from calculations, often called parameters. These include things like the:

  • Mean — the central value, commonly called the average.
  • Median — the middle value if we ordered the data from low to high and divide it exactly in half.
  • Mode- the value which occurs most often.

3. Distributions

Descriptive statistics are useful but they can often hide important information about the data set. For example, if a data set contains several numbers that are much larger than the others then the mean may be skewed and will not give us a true representation of the data.

A distribution is a chart, often a histogram, that displays the frequency with which each value appears in a data set. This type of chart gives us information about the spread and skewness of the data.

A distribution will usually form a curve-like graph. This may be skewed more to the left or right.

Distribution of blood volume from the “transfusion” data set. Image by author.

In some cases, the curve may not be as smooth.

Distribution of the frequency of blood donations. Image by author.

One of the most important distributions is the normal distribution, commonly referred to as the bell curve due to its shape. It is symmetrical in shape with most of the values clustering around the central peak and the further away values distributed equally on each side of the curve. Many variables in nature will form a normal distribution such as peoples heights and IQ scores.

A normal distribution. Image by author.

4. Probability

Probability, in simple terms, is the likelihood of an event occurring. In statistics, an event is the outcome of an experiment which could be something like the rolling of a dice or the results of an AB test.

Probability for a single event is calculated by dividing the number of events by the number of total possible outcomes. In the case of, say, rolling a six on a dice there are 6 possible outcomes. So the chance of rolling a six is 1/6 = 0.167, sometimes this is also expressed as a percentage so 16.7%.

Events can be either independent or dependent. With dependent events, a prior event influences the subsequent event. Let’s say we have a bag of M&M’s and we wanted to determine the probability of randomly picking a red M&M. If every time we removed the selected M&M from the bag the probability of picking red would change due to the effect of prior events.

Independent events are not affected by prior events. In the case of the bag of M&M’s if each time we selected one we put it back in the bag. The probability of selecting red would remain the same each time.

Whether an event is independent or not is important, as the way in which we calculate the probability of multiple events changes depending on the type.

The probability of multiple independent events is calculated by simply multiplying the probability of each event. In the example of the dice roll, say we wanted to calculate the chance of rolling a 6 three times. This would look like the following:

1/6 = 0.167 1/6 = 0.167 1/6 = 0.167

0.167 * 0.167 * 0.167 = 0.005

The calculation is different for dependent events, also known as conditional probability. If we take the M&M’s example, imagine we have a bag with only two colours red and yellow, and we know that the bag contains 3 red and 2 yellow and we want to calculate the probability of picking two reds in a row. On the first pick, the probability of picking a red is 3/5 = 0.6. On the second pick we have removed one M&M, which happened to be red, so our second probability calculation is 2/4 = 0.5. The probability of picking two reds in a row is therefore 0.6 * 0.5 = 0.3.

5. Bias

As we have previously discussed in statistics we frequently use samples of data to make estimates about the whole data set. Similarly, for predictive modelling, we will use some training data and attempt to build a model that can make predictions about new data.

Bias is the tendency of a statistical or predictive model to over or underestimates a parameter. This is often due to the method used to obtain a sample or the way that errors are measured. There are several types of bias commonly found in statistics. Here is a brief description of two of them.

  1. Selection bias — this occurs when the sample is selected in a non-random way. In data science, an example may be stopping an AB test early when the test is running or selecting data for training a machine learning model from one time period which could mask seasonal effects.
  2. Confirmation bias — this occurs when the person performing some analysis has a predetermined assumption about the data. In this situation, there can be a tendency to spend more time examining variables that are likely to support this assumption.

6. Variance

As we discussed earlier in this article the mean in a sample of data is the central value. Variance measures how far each value in the data set is from the mean. Essentially it is a measurement of the spread of numbers in a data set.

Standard deviation is a common measure of variation for data that has a normal distribution. It is a calculation that gives a value to represent how widely distributed the values are. A low standard deviation indicates that the values tend to lie quite close to the mean, whilst a high standard deviation indicates that the values are more spread out.

If the data does not follow a normal distribution then other measures of variance are used. Commonly the interquartile range is used. This measurement is derived by first ordering the values by rank and then dividing the data points into four equal parts, called quartiles. Each quartile describes where 25% of the data points lie according to the median. The interquartile range is calculated by subtracting the median for the two central quarters, also known as Q1 and Q3.

A boxplot provides a useful visualisation of the interquartile range. Image by author.

7. Bias/Variance tradeoff

The concepts of bias and variance are very important for machine learning. When we build a machine learning model we use a sample of data known as the training data set. The model learns patterns in this data and generates a mathematical function that is able to map the correct target label or value (y) to a set of inputs (X).

When generating this mapping function the model will use a set of assumptions to better approximate the target. For example, the linear regression algorithm assumes a linear (straight line) relationship between the input and the target. These assumptions generate bias in the model.

As a computation, bias is the difference between the mean prediction generated by the model and the true value.

If we were to train a model using different samples of training data we would get a variation in the predictions that are returned. Variance in machine learning is a measure of how large this difference is.

In machine learning bias and variance make up the overall expected error for our predictions. In an ideal world, we would have both low bias and low variance. However, in practice minimizing bias will usually result in an increase in variance and vice versa. The bias/variance trade-off describes the process of balancing these two errors to minimise the overall error for a model.

8. Correlation

Correlation is a statistical technique that measures relationships between two variables. Correlation is assumed to be linear (forming a line when displayed on a graph) and is expressed as a number between +1 and -1, this is known as the correlation coefficient.

A correlation coefficient of +1 denotes a perfectly positive correlation (when the value for one variable increases the value of the second variable also increases), a coefficient of 0 denotes no correlation and a coefficient of -1 denotes a perfect negative correlation.

Statistics is a broad and complex field. This article is meant as a brief introduction to some of the most commonly used statistical techniques in data science. Often data science courses assume prior knowledge of these basic concepts or start with descriptions that are overly complex and difficult to grasp. I hope this article will act as a refresher for a selection of basic statistical techniques used in data science prior to moving onto more advanced topics.

--

--