Statistics for Data Science

Data Science and Statistics are very interrelated. If you want to become a data scientist, having at least basic understanding of statistics is mandatory. Although there are many libraries, for example sklearn and tensorflow, that hide all the mathematical complexities from the user, it is always beneficial if one knows the core concepts and can understand the working of the black-box like libraries. In this blog post, we’ll teach you about some basic statistics for data science.

Statistics for Data Science: Probability Mass Function

Probability Mass Function (PMF) is a concept of discrete random variables.
It denotes the probability of a discrete random variable being exactly equal to some value.

The value of the random variable that has the largest PMF value is called Mode.

Statistics for Data Science: Probability Density Function

Probability Density Function is a concept of continuous random variables.

Probability Density Function is a function, the value of which at any sample (or point) in the sample space is interpreted as a relative likelihood that the value of the random variable would equal the value of the chosen sample. Note that the absolute likelihood at any particular value is 0 for a continuous random variable because the continuous random variable can possess infinite values.

Statistics for Data Science: Cumulative Distribution Function

By definition, the Cumulative Distribution Function (CDF) of a random variable X is the probability that the random variable X will take a value less than or equal to x.

FX(x) = P(X <= x)

For the case of a continuous distribution, the cumulative distribution function gives the area under the probability density function graph of the random variable X from minus infinity to x.
Hence, probability of X lying in the range (a,b] can be calculated as:
P(a < X <= b) = FX(b) – FX(a)

Cumulative density function is a nondecreasing and right-continuous function.

Statistics for Data Science: Binomial Distribution

Consider an experiment in which only two outcomes are possible — success and failure. Let p be the probability of success in each experimental trial, and hence 1-p is the probability of failure each time. Let the experiment be performed n times.

Binomial distribution is a discrete probability distribution of the number of successes (each success has a probability p) in a sequence of n independent experimental trials, where the outcome of each experimental trial can be a success or failure.

The Probability Mass Function of Binomial distribution is: nCkpk(1-p)(n-k)

The Cumulative Distribution Function of Binomial distribution is: I1-p(n-k, 1+k)

Statistics for Data Science: Poisson Distribution

Poisson distribution is used to calculate the number of times an event occurs in a continuous time interval.

The probability mass function of Poisson distribution is:

e = Euler’s constant, ie. 2.718
λ = Expected value of the random variable
x = Number of success of the event

This blog post highlighted some important concepts of statistics which are useful in data science. Hope you found it useful. Happy learning!

Comments are closed.

Magoosh blog comment policy: To create the best experience for our readers, we will only approve comments that are relevant to the article, general enough to be helpful to other students, concise, and well-written! 😄 Due to the high volume of comments across all of our blogs, we cannot promise that all comments will receive responses from our instructors.

We highly encourage students to help each other out and respond to other students' comments if you can!

If you are a Premium Magoosh student and would like more personalized service from our instructors, you can use the Help tab on the Magoosh dashboard. Thanks!