Data Science and Statistics are very interrelated. If you want to become a data scientist, having at least basic understanding of statistics is mandatory. Although there are many libraries, for example sklearn and tensorflow, that hide all the mathematical complexities from the user, it is always beneficial if one knows the core concepts and can understand the working of the black-box like libraries. In this blog post, we’ll teach you about some basic statistics for data science.

## Statistics for Data Science: Probability Mass Function

*Probability Mass Function (PMF)* is a concept of discrete random variables.

It denotes the probability of a discrete random variable being exactly equal to some value.

The value of the random variable that has the largest PMF value is called *Mode*.

## Statistics for Data Science: Probability Density Function

*Probability Density Function* is a concept of continuous random variables.

Probability Density Function is a function, the value of which at any sample (or point) in the sample space is interpreted as a *relative likelihood* that the value of the random variable would equal the value of the chosen sample. Note that the *absolute likelihood* at any particular value is 0 for a continuous random variable because the continuous random variable can possess infinite values.

## Statistics for Data Science: Cumulative Distribution Function

By definition, the *Cumulative Distribution Function (CDF)* of a random variable X is the probability that the random variable X will take a value less than or equal to x.

F_{X}(x) = P(X <= x)

For the case of a continuous distribution, the cumulative distribution function gives the area under the probability density function graph of the random variable X from minus infinity to x.

Hence, probability of X lying in the range (a,b] can be calculated as:

P(a < X <= b) = F_{X}(b) – F_{X}(a)

Cumulative density function is a nondecreasing and right-continuous function.

Also,

## Statistics for Data Science: Binomial Distribution

Consider an experiment in which only two outcomes are possible — success and failure. Let* p* be the probability of success in each experimental trial, and hence *1-p* is the probability of failure each time. Let the experiment be performed *n* times.

Binomial distribution is a discrete probability distribution of the number of successes (each success has a probability p) in a sequence of n independent experimental trials, where the outcome of each experimental trial can be a success or failure.

The Probability Mass Function of Binomial distribution is: ^{n}C_{k}p^{k}(1-p)^{(n-k)}

The Cumulative Distribution Function of Binomial distribution is: I_{1-p}(n-k, 1+k)

## Statistics for Data Science: Poisson Distribution

Poisson distribution is used to calculate the number of times an event occurs in a continuous time interval.

The probability mass function of Poisson distribution is:

Where,

e = Euler’s constant, ie. 2.718

λ = Expected value of the random variable

x = Number of success of the event

This blog post highlighted some important concepts of statistics which are useful in data science. Hope you found it useful. Happy learning!

## Comments are closed.