Ans:- Statistics have two main branches, namely:
- Descriptive
Statistics: This usually summarizes the data from the sample
by making use of an index like mean or standard deviation. The methods
which are used in the descriptive statistics are displaying, organizing,
and describing the data.
- Inferential Statistics: These conclude from data that are subject to random variations like observation mistakes and other sample variations.
Q2). Enumerate various fields where statistics can be used?
Ans:- Statistics are usually used in many different kinds of research
fields. The lists of files in which statistics are used are :
- Science
- Technology
- Biology
- Computer
Science
- Chemistry
- Business
It is also used in the following areas:
- Providing
comparison
- Explaining
action which has already occurred
- Predicting
the future result
- Estimation
of quantities that are not known
Data Science Training - Using R and Python
- Detailed
Coverage
- Best-in-class
Content
- Prepared
by Industry leaders
- Latest
Technology Covered
Q3). What is the difference between Data Science and Statistics?
Ans:- Data Science is a science that is led by data. It includes the
interdisciplinary fields of scientific methods, algorithms, and even the
process for extracting insights from the data. The data can be either
structured or unstructured. There are many similarities between data science
and data mining as both useful abstract information from the data. Now, data
science also includes mathematical statistics and computer science and its applications.
It is by the combination of statistics, visualization, and applied mathematics
and computer science that data science can convert a vast amount of data into
insights and knowledge. Thus, statistics from the main part of data science it
is a branch of mathematical commerce with the collection, analysis,
interpretation, organization, and presentation of data.
Q4). What is the meaning of correlation and covariance in
statistics?
Ans:- Both correlation and covariance are basically two concepts of
mathematics that are widely used in statistics. They not only help in
establishing the relations between two random variables but also help in
measuring the dependency between the two. Although the work between these two
mathematical terms is similar, they are quite different from each other.
- Correlation: It is considered as the best
technique for measurement and also for estimation of the quantitative
relationship between the two variables. Correlation measures how
efficiently two variables are related.
- Covariance: In this, two terms vary together,
and it is a measure that shows the extent to which two random variables
can change in a cycle. It forms a statistical relationship between a pair
of random variables, where any change in one variable reciprocates by a corresponding
change in another variable.
Q5). What is Bayesian?
Ans:- Bayesian rests on the data which is observed in reality and
further considers the probability distribution on the hypothesis.
Q6). What is Frequentist?
Ans:- Frequentists rest on the hypothesis of choice and further consider
the probability distribution on the data, whether it is observed or not.
Q7). What is the Likelihood?
Ans:- The probability of some of the observed outcomes under specific
parameter values is regarded as the likelihood of the set of parameter values
under certain observed outcomes.
Q8). What is P-value?
Ans:- In terms of statistical significance testing, the p-value
represents the probability of obtaining a test value, which is as extreme as
the one which had been observed originally. The underlying condition is that
the null hypothesis is true.
Q9). Explain P-value with the help of an example?
Ans:- Let us suppose the experimental results showing the coin turning
heads 14 in 20 flips in total. Here is what is derived:
- Null
hypothesis (Ho): a fair coin
- Observation
0: 14
heads out of 20 flips
- P-value
of observation 0 given HO= Prob (? 14
heads or ? 14 tails) = 0.115
We can see above that the p-value overshoots the value of 0.05, so
the observation is in line with the null hypothesis-that means the observed
result of 14 heads in 20 flips can be related to the chance alone- as it comes
within the range of what would happen 95% of the time is this was a real case.
In the example, we failed to reject the null hypothesis at the level of 5 %.
The coin did not have an even fall and the shift from the expected outcome is
slight to be reported as “not statistically significant at 5% level).
Some more probability and statistics viva questions and answers.
Q10). What do you mean by sampling?
Ans:- Sampling is considered as part of the statistical practice which
is concerned with the selection of an unbiased or random subset of single
observations in a population of individuals which are directed to yield some
knowledge about the population of concern.
Q11). What are the various methods of sampling?
Ans:- Sampling can be done in 4 broad methods:
- Randomly
or in a simple yet random method
- Systematically
or taking every kth member of the population
- Cluster
when the population is considered in groups or clusters
- Stratified
i.e. when the exclusive groups or strata, a sample from a group)
samplings.
Q12). What do you mean by Mode?
Ans:- The mode is defined as that element of the data sample, which
appears most often in the collection.
X= [ 1 5 5 6 3 2]
Mode (x) % return 5, happen most.
Q13). What do you mean by Median?
Ans:- Median is often described as that numerical value that separates
the higher half of the sample, which can be either a group or a population or
even a probability distribution from the lower half. The median can usually be
found by a limited list of numbers when all the observations are arranged from
the lowest to the highest value and picking the middle one.
Q14). What are the four main things we should know before
studying data analysis?
Ans:- Here is the list of those four main things, one should know
before studying data analysis:
- Inferential
statistics
- Descriptive
statistics
- Distributions
normal and sampling both
- Hypothesis
testing
Q15). Most common characteristics used in descriptive
statistics?
Ans:- Center, spread, shape, and outlier are the most common
characteristics used in descriptive statistics.
- The
Center is in the middle of the data. Mean, Median and Mode are the most
commonly used as measures.
- Spread
how the data is dispersed. Rane, IQR, Variance, and Standard Deviation are
the most commonly used as measures.
- Shape,
the shape of the data can be symmetric or skewed.
- Outliner,
an outlier is an abnormal value.
Q16). What is the meaning of standard deviation?
Ans:- It represents how far are the data points from the mean
(σ) = √(∑(x-µ)2 / n)
Variance is the square of standard deviation
Q17). What is an outlier? And, mention one method to find
outliers?
Ans:- An outlier is an abnormal value (This is at an abnormal
distance from the rest of the data points).
Here is the 5-number summary that can be used to identify the
outlier:
Widely used – Any data point that lies outside the 1.5 * IQR
Lower bound = Q1 - (1.5* IQR)
Upper bound = Q3 + (1.5 * IQR)
Q18). What is the difference between a combination and a
permutation?
Ans:- A permutation of n elements is any arrangement of those n elements
in a definite order. There are n factorial ways to manage n elements. The total
number of permutations of n things taken r-at-a-time is defined as the number
of r-tuples that can be taken from different elements.
Combinations refer to the number of ways to choose r out of n
objects where order does not matter a lot. The total number of combinations of
n things taken r-at-a-time is defined as the number of subsets with r elements
of a set with n elements.
Q19). Statistics Questions: What is the Pareto principle?
Ans:- The Pareto principle is also known as the 80/20 rule. It
states that 80% of the effects come from 20% of the causes. For example, 80% of
sales is the output of 20% of customers.
Q20). How do you assess the statistical significance of an
insight?
Ans:- To determine the statistical significance, you need to perform
hypothesis testing. The first step of the process begins with stating the null
hypothesis and alternative hypothesis. In the second step, you need to
calculate the p-value, the probability of obtaining the observed outputs of a
test assuming that the null hypothesis is true. In the last step, you will need
to set the level of the significance and if the p-value is less than the alpha,
you will reject the null.
Q21). Statistics Question and Answers: What do you mean by
skewness?
Ans:- Skewness is described as data asymmetry, which is centered around
a mean. If skewness is negative, the data is spread more on the left of the
mean to the right. If skewness is seen as positive, then the data is moving
more to the right.
Data Science Training - Using R and Python
- No
cost for a Demo Class
- Industry
Expert as your Trainer
- Available
as per your schedule
- Customer
Support Available
Q22). Statistics Viva Question: What is the meaning of Covariance?
Ans:- Covariance is a measure of how two variables move in sync
with each other.
y 2= [1 3 4 5 6 7 8]
cov ( x,y2) % return 2*2 matrix, diagonal
represents variance.
Q23). Statistics Viva Question: What is the One-Sample test?
Ans:- T-test refers to any statistical hypothesis test in which the
statistic of the test follows a Student’s t distribution if the null hypothesis
is supported.
[h, p, ci] = test (y2,0)% return 1
0.0018ci = 2.6280 7.0863
Q24). Statistics Viva Question: What do you mean by Alternative
Hypothesis?
Ans:- The Alternative-hypothesis, which is represented by H1 is the
statement that holds true if the null hypothesis is false.
Q25). Statistics Question and Answer: Statistics Questions: What
do you mean by Significance Level?
Ans:-The probability of rejection of the null hypothesis when it is
known as the significance level a, and very common choices are ?=0.05 and
?=0.01.
Q26). Statistics Viva Question: Give Examples of Central Limit
Theorem?
Ans:- Let us suppose that the population of the men has normally
distributed weights, with a mean of 173lb and a standard deviation of 30 lb and
one has to find the probability
- If
one man is randomly selected, the weight is greater than 180 lb
- If
36 different men are randomly selected, the mean weight is more than 180
lb.
The solution will be:
z= (x-µ)/?= (180-173)/30=0.23
For normal distribution P(Z>0.23)= 0.4090
? x?= ?/?n=20/?36=5
z=(180-17)/5=1.40
P(Z>1.4) =0.0808
Q27). Statistics Question and Answers: What is Binary Search?
Ans:- In any binary search, the array has to be arranged either in
ascending or descending order. In every step, the search key value is compared
with the key value of the middle element of the array by the algorithm. If both
the keys match, a matching element is discovered, and the index or the position
is returned. Else, if the search key falls below the key of the middle element,
then the algorithm will repeat the action on the sub-array which falls to the
left of the middle element of the array if the search key is more than the
sub-array to the right.
Q28). Statistics Questions: Can you throw more light on the Hash
Table?
Ans:- A hash table refers to a data structure that is used for
implementation in an associative way in a structure that can map keys to
values. A hash table makes use of a hash function for computing an index into
an array of buckets or slots from which the correct value can be obtained.
Q29). Statistical Viva Questions and Answers: Explain the difference
between ‘long’ and ‘wide’ format data?
Ans:- In the wide format, the repeated responses of the subject will
fall in a single row, and each response will go in a separate column. In
the long format, every row makes a one-time point per subject. The data in the
wide-format can be recognized by the fact that the columns are basically
represented by the groups.
Q30). Statistics Viva Questions and Answers: What is the meaning
of normal distribution?
Ans:- Data is usually distributed in many ways which incline to left or
right. There are high chances that data is focused around a middle value
without any particular inclination to the left or the right. It further reaches
the normal distribution and forms a bell-shaped curve.
The normal distribution has the following properties:
- Unimodal
or one-mode.
- Both
the left and right halves are symmetrical and are mirror images of each
other.
- It
is bell-shaped with a maximum height at the center.
- Mean,
mode, and even the median are all present at the center.
- Asymptotic
Q31). Data Science Statistics Viva Questions: What is the primary
goal of A/B testing?
Ans: - A/B testing refers to a statistical
hypothesis with two variables A and B. The primary goal of A/B testing is the
identification of any changes to the web page for maximizing or increasing the
outcome of interest. A/B testing is a fantastic method for finding the most
suitable online promotional and marketing strategies for the business. It is
basically used for testing everything from website copy to even the emails made
for sales and also search ads.
Q32). Statistics Viva Question: What is the meaning of statistical
power of sensitivity, and how is it calculated?
Ans:- The statistical power of sensitivity refers to the validation of
the accuracy of a classifier, which can be Logistic, SVM, Random Forest, etc.
Sensitivity is basically Predicted True Events/Total Events. True events are
the ones that are true and also predicted as true by the model.
- Data
Science Training - Using R and Python
- Personalized
Free Consultation
- Access
to Our Learning Management System
- Access
to Our Course Curriculum
- Be
a Part of Our Free Demo Class
Q33). Statistics Question and Answer: What is the central limit
theorem and why is it so important?
Ans:- Central limit theorem is quite powerful and states that the
distribution of the sample means almost a normal distribution.
For example, you take a sample from a data set and calculate the
mean of that sample. Once repeated multiple times, you would plot all your
means and their frequencies onto a graph and see that a bell curve, also known
as a normal distribution. The mean of this distribution will closely resemble
that of the original data.
The significance of the central limit theorem is quite high
because it is used in hypothesis testing and also to calculate confidence
intervals.
Q34). Statistics Viva Question: What general conditions must
be satisfied for the central limit theorem to hold?
Ans:- The data should be sampled randomly.
The sample values must be independent of each other.
The sample size should be sufficiently large, generally, it needs
to be greater or equal than 30
Q35). Statistics Viva Question: How would you describe what a
‘p-value’ is to a non-technical person?
Ans:- The easiest way to describe a p-value to a non-technical person is
to convenience through an example. In practice, if the p-value is less than the
alpha, say of 0.05, then there is a probability of less than 5% that the result
could have happened by chance. In the same way, a p-value of 0.05 is the same as
saying 5% of the time.
Q36). Statistics Viva Question: What is the difference between
observational and experimental data?
Ans:- The major difference between observational and experimental data.
Observational data comes from observational studies when you actually observe
certain variables and try to determine I there is any correlation.
The resource of experimental data is experimental studies when you
try to control some variables and hold them to figure out if there is any
casualty.
Q37). Data Science Statistics Viva Questions: What do we mean by –
making a decision based on comparing p-value with significance level?
Ans:- Here, we mean to say that
If the p-value is greater than the critical value, then we failed
to reject the H0.
But if the p-value is lower than the critical value,e then we need
to reject them.
Q38). What is the meaning of selection bias?
Ans:- It is the phenomenon of choosing individuals, groups of people, or
data for analysis in a way that proper randomization could not be achieved,
ultimately creating a sample that is not presenting the population.
It is important to understand selection bias because it can
effectively skew results and come up with false insights about a particular
population group.
Q39). Statistics Questions: What is the meaning of six sigma in
statistics?
Ans:- It is a quality control method to produce an error or defect-free
data set. Standard deviation is also known as Sigma. The more the standard
deviation, the less likely that process performs with the right accuracy. Here,
a six sigma model works great, and it is reliable enough to produce defect-free
work.
Q40). Statistics Viva Question: What is the Binomial Distribution
Formula?=
Ans:- Here is the Binomial Distribution Formula:
b(x; n, P) = nCx * Px * (1 – P)n – x
- B
stands for binomial probability
- X
stands for the total number of success
- P
stands for probability of success on an individual trial
- N stands for the number of trials