You can perform statistical analysis with the help of Excel. It is used by most of the data scientists who require the understanding of statistical concepts and behaviour of the data. But when the data set is huge or you need some specialized data analysis model such as linear or regression, you should go for advanced tools such as Python, R programming. Here, we will go through the basic concept of statistical analysis and will apply the concepts to our own data.
Before starting, you need to check whether Excel Analysis ToolPak is enabled in Excel or not (it is an add-in provided by Microsoft Excel). To check whether it is enabled or not, go to Excel → Data and check whether data analysis option is there or not on the top right corner. If it is not there, go to Excel → File → Options → Add-in and enable the Analysis ToolPak by selecting the Excel Add-ins option in manage tab and then, click GO. This will open a small window; select the Analysis ToolPak option and enable it.
These are the tests you can perform using Excel Statistical Analysis.
You can find descriptive analysis by going to Excel→ Data→ Data Analysis → Descriptive statistics. It is the most basic set of analysis that can be performed on any data set. It gives you the general behaviour and pattern of the data. It is helpful when you a have a set of data and want to have the summary of that dataset. This will show the following statistic data for the chosen dataset.
- Mean, Standard error and Median
- Median, Mode and Standard Deviation
- Sample Variance
- Kurtosis and Skewness
- Range, Minimum, Maximum, Sum and Count
ANOVA (Analysis Of Variance)
It is a data analysis method which shows whether the mean of two or more data set is significantly different from each other or not. In other words, it analyses two or more groups simultaneously and finds out whether any relationship is there among the groups of data set or not. For example, you can use ANOVA if you want to analyse the traffic of three different cities and find out which one is more efficient in handling the traffic (or if there are no significant differences among the traffic).
You will find three types of ANOVA in the Excel
- ANOVA single factor
- ANOVA two factor with replication
- ANOVA two factor without replication
If you have three groups of datasets and want to check whether there is any significant difference between these groups or not, you can use ANOVA single factor.
If the P-value in the ANOVA summary table is greater than 0.05, you can say that there is a significant difference between the groups.
Moving average is usually applicable for time series data such as stock price, weather report, attendance in class etc. For example, it is heavily used in stock price as a technical indicator. If you want to predict the stock price of today, the last ten days data would be more relevant than the last 1 year. So, you can plot the moving average of the stock having a 10-day time period and you can then predict the price to some extent. The same applies to the temperature of a city. The recent temperature of a city can be calculated by taking the average of last few weeks rather than previous months.
Rank and Percentile
It calculates the ranking and percentile in the data set. For example, if you are managing a business of several products and want to find out which product is contributing to a higher revenue, you can use this rank method in Excel.
In the left table, we have our data on the revenues of different products. And we want to rank this data of products based on their revenue.
With the help of rank and percentile, we can get the table shown on the right. You can observe that now the data is sorted and respective rank is also marked with each data.
Percentile shows the category in which the data belongs, such as top 50%, top 30% etc. In the summary table, the rank of product 7 is 4. As the total number of data is 7, we can easily say that it belongs to the top 50% of the data.
Regression is a process of establishing a relationship among many variables. Usually, we establish a relationship between dependent variables and independent variables. For example, cases when you want to see if there is any increase in the revenue of product, which is not due to increase in the advertisement.
This is the window you will get once you click regression option in data analysis. Here, you have to provide a dependent variable in input Y range and an independent variable in Input X range. In our example, we have revenue of the product as a dependent variable and spending on the advertisement as the independent variable (If you have a label in the data, you can mark the checkbox of the labels). And at last, provide the range of cells where you want to see the output.
Random Number Generator
Although you can find a simple function to generate a series of random numbers, this option in data analysis gives you more flexibility in the random number generation process. It gives us more control over the generated data.
This is the screenshot of the random number generation window. In the number of variables field, it requires the number of columns in which you need the random number to be generated. And in the number of random variables field, it requires the number of rows where you need random data to be filled. If I give a number of variable 10 and the number of random variables as 3, it will give the following result.
You can also select the type of data being generated such as Uniform, Normal, Bernoulli, Binomial, Poisson etc.
This option is the data analysis tool which is used for creating samples from a huge population. You can randomly select data from the dataset or select every nth item from the set. For example, if you want to measure the effectiveness of a call centre employee in a call centre, you can use this tool to randomly select few data every month and listen to their recorded calls and give a rating based on the selected call.
This is a screenshot of the sampling option in data analysis. In input range, you have to give the reference of input population data set (check the Labels checkbox if your data has labels). Next, you have to specify the sampling method. If you are selecting periodic sampling method, you have to give the number as a period. So, it will create a sample from the population taking the nth data from the population. For example, if your period is 5, it will select every 5th value from the dataset and create a sample. And if you are choosing at random, it will randomly select an n number of the data from the population dataset. The second method can give a closer idea to the actual population as the data is being selected randomly but there are chances of duplicate data in the sample dataset. And lastly, you have to specify the output location from the output options.
The statistical analysis tool is one of the most important features in Excel. It is always recommended to use this option whenever you get any kind of data to have a better understanding of it. As you saw the above examples, you can get a general sense of the data using descriptive analysis, calculate the moving average of a data to predict the future data of the dataset, rank each data of the dataset and find the percentile of data, test the two groups of the data, generate some specific kind of large random data and find relations between the two data, using the regression. There are some more options available which we are not covering here but if you wish, you may read about these on the official website of Microsoft Excel.