A Beginner Guide For Data Scientist

[ad_1]

Hypothetical Testing is an application of your statistical model to the questions from the real world. In the hypothetical testing, you first assume the result as an assumption. It is called the null hypothesis.
After the assumption, you hold an experiment for testing this hypothesis. Then after based on the results of the experiment. You will either reject or fail to reject the null hypothesis.

For example, based on the experiment result of testing. You rejected the null hypothesis. You can say the data supports another mutually exclusive alternate hypothesis.
The statement rejects or fails to reject is important to understand. In the hypothetical testing, we never prove a hypothesis. We only reject or fail to reject the hypothesis.

How to convert Real World Problem to Hypothesis?

Step 1: At the starting of the experiment you will assume the null hypothesis is true. Based on the experiment you will reject or fail to reject the experiment.

Step 2: If the data you have collected is unable to support the null hypothesis only then you look for the alternative hypothesis.

Step 3: If the testing is true then we can say the hypothesis will reflect the assumption.

Let’s understand more about it with the real life example.

Suppose there are a claims that “ A product has an average weight of 5.6 kg”.

Null Hypothesis: Average Weight is equal to 5.6 Kg. H(0) = mu
Alternate Hypothesis: Average Weight is not equal to 5.6 Kg. H(1) != 5.6

If we are testing a claim to be true and you can assume the test opposite that is you will test the claim opposite.
For example. This Machine Learning Course improves than the coding skills.
Null Hypothesis: Old Coding Skills >= New Coding Skills
Alternate Hypothesis: Old Coding Skills < New Coding Skills.

Keep in mind that the null hypothesis contains an equality sign. (= ,<= ,>= ) and the Alternate hypothesis contains (!=,<,>).

How to reject the Null Hypothesis?

After assuming the null hypothesis you run an experiment and record all the results. Let’s assume that Our null hypothesis is valid. Then if the probability of observing these results is very small (< or inside the 0.05) then you will reject the null hypothesis. Here 0.05 is the level of the significance. ($latex alpha&s=2$). If the significance level is not mentioned in the statement then you will assume default 0.05.

Level of the significance ($latex alpha&s=2$) is the area inside our null hypothesis.

Hypothesis Example of a Fair Coin.

Let’s assume that the null hypothesis that a fair coin has a head on one side and tail on the other side. If we run an experiment and flip that coin 20 times in a row, the null hypothesis is that all our heads.
Here the level of the significance  0.05 and is the area inside the tail of our null hypothesis.

If the $latex alpha&s=2$ is 0.05 for the null hypothesis then its alternative hypothesis will be less than the null hypothesis mean that is less than 0.05. Then you will consider the left side of the normal distribution and its area is 0.05. H(1): < null. In the same way, if the $latex alpha&s=2$is 0.05 and the alternative hypothesis is more than the null then you will consider the right side of the normal distribution. The probability of that is the area of the curve that is 0.05.H(1): > H(0).

And the last if the alternative hypothesis is not equal to the null, then the two tails will share the same area of the probability curve. H(1) =! Null. It means 0.025 area of the left tail and 0.025 area of the right tail.

These areas in the Hypothesis area the critical values or also known as z scores.

Before testing the Hypothesis you should clear these terms.

Mean and Proportion

In a population whenever we want to find the average and or some specific values, then you are dealing with means. And when you say something like a percentage or most or least then you are dealing with the proportions.

The formulae for the z score is when you have mean and population alpha ($latex sigma&s=2$) is:

And you are dealing with proportions then use the following formulae
z population mean Hypothesis

There are two ways you can test for the hypothesis.

  1. Traditional Test
  2. P value test.

Traditional Test

You will take the level of the significance to determine the critical values and will use it to compare the test statistics with the critical values.

P value test

In the P value test First, you take the test statistics to find the P-value and then you will use it to compare it with the level of the significance (p).

If the p-value is low then you will reject the H0 null hypothesis. And if the p is high then you will fail to reject the H0

These are the example you can understand with each testing method with an example.

 

Hypothetical Testing for Mean

Suppose an E-commerce company wants to increase their sales by improving their website performance. Currently, the download time for the website is 3.125 and it’s mean ($latex mu&s=2$) and the standard deviation($latex sigma&s=2$) is 0.700. The level of the significance ($latex alpha&s=1$) is 0.01. A 40 new pages sample is tested and it has meantime($latex bar{x} &s=2$) is 2.875. Are the results faster than before?

Step by Steps Method for testing?

Step 1: Find all the values before the testing.

Mean, $latex mu&s=2$ = 3.125

Standard Deviation,$latex sigma&s=2$ = 0.70

Level of the Significance, $latex alpha&s=2$ = 0.01

Sample Size, n =40

Sample Mean, $latex bar{x} &s=2$ = 2.875

Step 2 : State the Null Hpothesis and the Alternative Hypothesis

Null Hypothesis

$latex H_{0}:mu geq &s=2$ 3.125
Alternative Hypothesis

$latex H_{1}:mu <&s=2$ 3.125

Step 3 : Set the level of the significance.

Here it is $latex sigma&s=2$ = 0.70

Step 4: Determine the type of the test.

Here the null hypothesis is  $latex H_{0}:mu geq &s=2$, then  the Alternate Hypothesis will be $latex H_{1}:mu <&s=2$. Thus we will choose the left tail for testing by ignoring the right tail, two tail.

type of the test

The traditional method for testing Hypothesis is finding the z score (critical Value) by the using the below formulae

z population mean Hypothesis

On solving by putting all the values you will get Z = -2.259. Then from the z table, look value for $latex alpha&s=1$ =0.01 . you will get Z = -2.325.

Thus you will fail to reject the Null Hypothesis as the Z value (-2.259 ) is greater than the Z value at the level of the significance $latex alpha&s=1$ (-2.325).

For the P value test, you will find the P value from Z table lookup on -2.56, then you will get P =0.0119. In this example, P > 0.01, thus we fail to reject the null hypothesis. And you cannot say that the new pages of the website are statistically faster.

 

Hypothetical Testing for the Proportion

An E-commerce company want to survey their 400 customers and finds that 58% of the Samples are teenagers. Then most of the customers are teenagers. Is it Fair?

Step 1: Find all the values and the proportion before the testing.

In this, you find proportion according to the statement like here 58% are teenagers. So for the null hypothesis, you can choose any percentage less than 58%. But to make easy calculation I will choose 50% proportion.

Sample Size, n =400.

Step 2 : State the Null Hypothesis and the Alternative Hypothesis

Null Hypothesis

$latex H_{0}: P leq &s=2$ 0.5
Alternative Hypothesis

$latex H_{1}:P >&s=2$ 0.5

Step 3: Set the Significance Level.

Here in this example, it is not mentioned, therefore you will use the default that is 0.05.

Step 4: Determine the type of test.

In the $latex H_{1}:P >&s=2$ 0.5, the alternate hypothesis is using greater than so you will consider the right side tail of the normal distribution.

Step 5: Calculate the Test Statistics using the following formulae

Here,

$latex hat{p} = 0.58 &s=2$, Actual Proportion

$latex p_{0} =0.50 &s=2$, Sample Proportion

On the solving ,you will get Z = 3.2

Step 5: Now look up for the Z value at the  $latex alpha =0.05 &s=2$ , you will get Z =1.645. The value of the Z for tested sample is 3.2 and it is greater than the alternate hypothesis. So you will reject the null hpotheis and can say most customers are teenagers. 

Conclusion

Hypothesis Testing is the best method for analyzing the population on the larget set of the sample data. Researcher always uses it in finalization of their analysis by testing and rejecting their hypothesis. You can also apply these testing in any real world or daily life problems.

If you have liked this tutorial and want to ask something on this topic please contact us. You can also give some suggestion. Don’t forget to subscribe to get more articles on Hypothesis and Statistics.

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.



[ad_2]

Source link

What is a Probability Distribution ? Determine its Type for Your Data

[ad_1]

Probability Distribution is an important topic that each data scientist should know for the analysis of the data. It defines all the related possibility outcomes of a variable. In this, the article you will understand all the Probability Distribution types that help you to determine the distribution for the dataset. There are two types of distribution.

  1. Discrete Distribution
  2. Continuous Distribution

In the discrete Distribution, the sum of the probabilities of all the individuals is equal to one. In the normal distribution, there is a probability curve and area under the probability curve is equal to one.

Types of Discrete Distribution

Discrete Distribution is also known as Probability Mass functions. The following are the types of Discrete Distribution

  1. Uniform Distribution
  2. Binomial Distribution
  3. Poisson Distribution

Uniform Distribution

It is a type of discrete distribution and all the events have the same probability outcome ( Uniform ). For example, if you roll a die then the sample space for a die is {1,2,3,4,5,6} and probability of getting each number on the die is 1/6 that is 0.166. So here the sample space has discrete values that we are using. You will also notice that the range between two numbers and probability is same. When you add all the probabilities, then you get the 1.

Rolling a Fair Dice
Uniform Distribution

Binomial Distribution

In this distribution, you have had two discrete outcomes of a trial that are mutually exclusive. Mutually exclusive means the outcome of one event will not depend on the outcome of the other event. Below are the example of the Binomial outcomes.

  • Head or Tail
  • On or Off
  • Sick or healthy
  • Success or failure

Bernoulli Trial

It is a random experiment in which there are two outcomes. One is a success and the other is a failure.

When you make the same trial for n times that is series of n trials then it becomes Binomial Distribution and probability of success (p) is constant and all the trials do not depend on one another. These two conditions must have to satisfy to becomes a Bernoulli Trial.

In a binomial distribution, you have to calculate the probability mass function. Suppose you have n trials and with each trial probability of getting success in p. The probability mass function for the x observation outcomes in n trials is the below;

Binomial Distribution formula
Binomial Distribution formula

x, related to the number of trials n

How to calculate Binomial Distribution in Excel?

You can easily calculate binomial distribution using the following syntax.
=BINOM.DIST(x,n,p,FALSE)

Where,
x, number of the observation
N, the total number of the trials
p, the probability of success.

Calculate Binomial Distribution on Python

from scipy.stats import binom
binom.pmf(x,n,p)

Poisson Distribution

In the binomial distribution, we focus on the success of the number of trials. But in the Poisson distribution, we focus on the number of success per continuous unit. Like the number of success per unit time.

Calculation of Poisson Probability Mass Function.

Before calculating Poisson probability mass function, you have to calculate the mean expected value ( mu) and is assigned to  (lambda), that is the number of occurrences per interval. You should remember that here interval is the continuous unit.

The formulae for the Poisson Probability Mass function is:

Poisson-Distribution-Formula
Poisson Distribution Formula
Mean Expected Value
Mean Expected Value

 

In a probability distribution, you should also know the term cumulative mass function. And it is the sum of all the discrete probabilities. For example in a Poisson distribution probability of success in fewer than 4 events are.

cmf possion distribution

Real World Example of Poisson Distribution

The number of the Orders in a particular time interval. Many E-commerce companies use it for finding the number of orders received during an interval.

How to calculate Poisson distribution in excel?

You can calculate Poisson distribution on the Excel using the following function.

= POISSON.DIST(x,lambda,FALSE)

Here, x is the number of successes.

False and True are set if you want to find the cumulative mass function. If its True then the function will find the cumulative mass function. In distribution, CMF is in cases like at least, greater than, no zero e.t.c.

Calculate Poisson Distribution on Python

from scipy.stats import poisson
poisson.pmf(x,lamda) # exactly
poisson.cdf(x,lamda) # for cumulative mass function

Continuous Distribution

It is a continuous probability distribution function and also called as probability density functions. Continuous Probability distribution has three types.

  1. Normal Distribution
  2. Exponential Distribution
  3. Beta Distribution

Normal Distribution

In the normal distribution, all the data points or data sources are aligned to the central values such as the mean and the curve form like the Bell Curve.

bell curve
Bell Curve

Keep in mind that in discrete distributions sum off all the probabilities (cumulative probability functions ) is equal to one. But in the normal distributions ( Probability density function ), the area of the bell curve is 1.

Using the normal distribution curve we can only tell the probabilities over a certain range of outcomes.

The mean, mode and median all are equal in the normal distribution.

Standard Normal Distribution

We can say a Normal Distribution is standard Normal Distribution when mean(mu) is 0 and sigma is equal to 1.  We can say from the SND graph that all the 68.27% of the values lies between -sigma and to +sigma. In the same way, 95.45 % values lie between -2sigma to +2sigma.

standard normal distribution
Standard Normal Distribution

There is also a case when the normal distribution is symmetric to a certain value of the mean ( mean not zero ) and sigma not equal to one. Normal Distribution is very useful to study the population. If we have a population that approximates a normal distribution then we can find its mean and standard deviation. And also it’s inferences on the population.

In the real-life example, you will mostly model the normal distribution. Then You can easily convert the values to fit for the standard normal distribution for calculating a percentile. The formulae for the normal distribution is:

Normal Distribution Formula
Normal Distribution Formula

You know that standard normal distribution has meant is 0 and sigma is 1. Therefore you have to find the z score to convert the normal distribution to standardized normal distribution.

The formulae for the z score is

z score

Real Life Example of the Normal Distribution

  • Measurement of the People Height and Weight
  • Measuring the Blood pressure
  • Test Scores – Percentile
  • Errors Measurement.

How to calculate Normal Distribution in Excel?

The following excel function can be used for finding the normal distribution.

If you have the z score, then you can find the probability using the formulae.

= NORMSDIST(B2)

If you have the p score, then you can find the z score using the formulae.

= NORMSINV(B2)

Calculate Normal Distribution in python.

from scipy.stats import stats
# If you have the z score, then you can find the probability using the formulae.
stats.norm.cdf(z)
#If you have the p score, then you can find the z score using the formulae.
stats.norm.ppf(p)

Conclusion

Distribution concept is an important concept for the data scientist. Especially Normal Distribution. You will see in many real-life examples Distribution is used.  Its concepts are useful in sampling data from the population dataset. So We hope you have understood the topic.

If you like to do some improvement in this article. Please contact us. You can also subscribe us to get updates on data science.

Thanks

Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.



[ad_2]

Source link

Learn 15 Sampling Methods for Data Scientist

[ad_1]

The statistic is the science that deals with developing and studying method for collecting, analyzing and interpreting the data.  In another way you can say in statistics, we have data and want to know something from the data. While you go through the statistics you will use the term Population and Sample Again and again. Therefore you must clearly understand these terms before knowing all  the types of sampling method.

Population ( Country )

In this, you always consider every member of the group you want to study.

Sample ( State )

The sample is the subset of the Population. It means you take a random member from the population.

Parameter and Statistic

A parameter is the characteristics of the population and you will use it in statistical analysis. Whereas statistics is the characteristic of the sample and we use it to define the statistical inferences to sample in describing the population.

You should also know the term Variable. It is a characteristic that describes the member of the sample.  For example Age, Salary, Gender, and Place. In addition, It can be discreet and continuous. As for example, Gender and Place are discrete and Age and Salary are continuous.

Sampling

Sampling is a technique to reflect the results of the entire population by studying the results of each sample taken from the population.

Sampling Figure
Image Source: Wikipedia

There are two types of sampling methods

  1. Probability Sampling Method
  2. Non Probability Sampling Method

Probability Sampling Method

In probability sampling, we take members of the population that have equal or non zero probability. It means each member have equal chances of selection for reflecting the population.

Types of Probability Sampling Method

  1. Simple Sampling
  2. Systematic Sampling
  3. Stratified Sampling
  4. Cluster Sampling

Simple Sampling

In the simple sampling, all the members of the population have equal chances of selection. It means you can randomly select any member for sampling.

simple sampling visual representation
Image Source:Elgin Community College

Systematic Sampling

From the name you can think what is the Systematic Sampling. In this sampling, you pick up the members from the population through a well-defined system to make a Sample. For example, you have to select the top 3 students from each class. In this You have a given condition and you have to pick up sample according to that condition.

Systematic sampling
Image Source: Wikipedia

Stratified Sampling

Stratified sampling is more convenient than Simple Sampling. In this, you first Stratified to make an ordered or categorized samples from the population called as Strata. In fact, It is a well defined and organized network. Now you can choose members from each stratum for making a sample.

Stratified sampling
Image Source: Wikipedia

Cluster Sampling

It is a very important sampling technique. In this sampling, you divide the population into groups call as clusters. Now you make a simple random sample (stage 1)  by selecting a member from each cluster or group.

Cluster sampling
Image Source: Wikipedia

Now if you make another cluster from the simple random subsample and then selecting a member from each cluster fo then it will be stage 2. It is called Multi-stage sampling and this is only done to reduce the cost of sampling.This sampling method is generally used in marketing purpose.

Non Probability Sampling Method

In the non-probability sampling, each member of the population does not have an equal chance of selection. It is also called Non-representative sampling.

Types of Non-Probability Sample Method

  1. Quota Sampling
  2. Convenience Sampling
  3. Judgment Sampling
  4. Purposive Sampling

Quota Sampling

In the quota sampling, you categorized the data into some weightage. You choose members from the population to make sample keeping in mind the weightage.

Example:

Let’s say you have 100 people. There 2% of people are Upper Class, 10% on Medium Class and 30% in the lower class. Then in Quota Sampling, you will select 2% members from Upper Class and 10% on Medium class and 30% from lower class from the population that is from 100 people.

Convenience Sampling

In convenience sampling, you select members from the population according to your convenience.

Example

In a box there 100 colored balls, 50 is red, 10 is green and 5 is Yellow colored balls. Let’s say for my convenience I will not choose Yellow colored balls for sampling. I will choose only from 50 red and 10 green balls. It is called Convenience Sampling, choosing samples as my convenience.

Purposive Sampling

In a Purposive Sampling, you select members for sampling on the basis of the objective of the study. It is very useful when you want to reach a particular or targeted sample quickly. It is also known as Judgemental Sampling

Types of Purposive Sampling

Purposive Sampling has the following major types.

  1. Maximum Variation/Heterogeneous Purposive Sample
  2. Homogeneous Purposive Sample
  3. Typical Case Sampling –
  4. Extreme/Deviant Case Sampling
  5. Critical case Sampling
  6. Total Population Sampling
  7. Expert Sampling

Maximum Variation/Heterogeneous Purposive Sample

From the name you can somewhat understand what it is. In this sampling technique, we select results from the more varied or Heterogeneous cases for a particular event or phenomena. Its main task to find the insights of an event or issue.

Example:

A researcher wants to know the insights of an issue in the society. He/She will ask a different kind of persons in the society for finding the views on that issue through polls.

Homogeneous Purposive Sample

In this sampling, you select on the basis of the set of characteristics from the population.

Example:

You have made a supplement for the fitness. In addition, you want to know how this protein powder is useful on bodybuilding and its quality. So you will only ask that person that is using that protein powder. Here characteristic is bodybuilders that are using the protein powder.

Typical Case Sampling

It is called as case sampling. You select sample on the cases. It means you relate the sampling results to the selected cases and find the relation between them.

Example:

You want to study the effects of changed curriculum on the average students then you will make a sample from the average students (case) to find the relationship.

Extreme/Deviant Case Sampling

This sampling is used when the researcher wants to study the effects of the outliers on the results previously obtained. How the results diverge from the normal results.

Critical case Sampling

Here you make sampling on a specific case and the researcher made an exception that this result will be same as the like cases.

Total Population Sampling

In this researcher study the set of characteristics of the entire population. It is mostly used to generate reviews of an event and phenomena and to identify the same characteristics groups in the entire population.

Expert Sampling

From the name you can understand what it is? You are an expert on a field or expertise and then start sampling on the basis of your knowledge.

Conclusion

All these are sampling methods. As a researcher, you should know these sampling techniques before trying to accumulate sample data. Data Scientist does a vast analysis of the data and therefore these methods help them to know insights of the data sample and its effect.

We hope you liked the article on “Types of Sampling Method” . If you have any query and suggestion you can reach us by contacting us. In the meantime, you can subscribe us and like our Data Science Learner Page

Thanks

Data Science Learner Team

 

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.



[ad_2]

Source link