Statistics 101 — Part 1 (Basics)

Vishnu Kakaraparthi
6 min readSep 29, 2018

Types of Data:

Nominal: Data whose categories have no implied ordering.

Other Example: Political affiliations of a population

Ordinal: Data that has a specified order, but no specified distance metric.

Other Example: Beverage sizes at McDonald’s (Small, Medium, Large)

Interval: Data that has measurable distances.

Example: periods of time (second, minute, etc.) — the zero point is arbitrary.

Ratio: Same as interval, but include a zero point

Example: Celsius scale, height above sea level

Important Terms and terminologies

Population: The entire group one desires information about.

Sample: A subset of the population taken because the entire population is usually too large to analyze Its characteristics are taken to be representative of the population.

Mean Median Mode

Mean/Average: The sum of all the values in the sample divided by the number of values in the sample/population μ is the mean of the population; is the mean of the sample.

Example:
Data: 13, 18, 13, 14, 13, 16, 14, 21, 13
Mean
(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15

Median: The value separating the higher half of a sample/population from the lower half Found by arranging all the values from lowest to highest and taking the middle one (or the mean of the middle two if there are an even number of values).

Example:
Data (Odd Number): 13, 13, 13, 13, 14, 14, 16, 18, 21
13, 13, 13, 13, (14), 14, 16, 18, 21
Median: 14
Data (Even Number): 13, 13, 13, 13, 13, 14, 14, 16, 18, 2113, 13, 13, 13, (13, 14,) 14, 16, 18, 21
Median: (13+14)/2 = 13.5

Mode: The mode is the number that is repeated more often than any other.

Example:Data : 13, 13, 13, 13, 14, 14, 16, 18, 21
Mode = 13

Variance: Measures dispersion around the mean Determined by averaging the squared differences of all the values from the mean Variance of a population is σ². Can be calculated by subtracting the square of the mean from the average of the squared scores:

Standard Deviation: Square root of the variance Also measures dispersion around the mean but in the same units as the values (instead of square units with variance) σ is the standard deviation of the population and s is the standard deviation of the sample.

Standard Error: An estimate of the standard deviation of the sampling distribution) the set of all samples of size n that can be taken from a population Reflects the extent to which a statistic changes from sample to sample.

The table below shows the best measure of central tendency:

Mean, Median and Mode for Grouped Frequencies

Covariance: Covariance is a measure of how much two random variables vary together. It’s similar to variance, but where variance tells you how a single variable varies, covariance tells you how two variables vary together.

Correlation: Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas return a value between -1 and 1.

  • A correlation coefficient of 1 means that for every positive increase in one variable, there is a positive increase of a fixed proportion in the other. For example, shoe sizes go up in (almost) perfect correlation with foot length.
  • A correlation coefficient of -1 means that for every positive increase in one variable, there is a negative decrease of a fixed proportion in the other. For example, the amount of gas in a tank decreases in (almost) perfect correlation with speed.
  • Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related.

Percentile and Quartiles:

Percentile: The rth percentile the value such that r percent of the data fall at or below that value.

Example 1:If you score in the 75th percentile, then 75% of the population 
scored lower than you.
Example 2:22, 34, 68, 75, 79, 79, 81, 83, 84, 87, 90, 92, 96, 99If your score was the 75, in what percentile did you score?There were 14 scores reported and there were 4 scores at or below
yours. We divide
(4/14)*100 = 28.57
So you scored in the 29th percentile.
  1. The second quartile (Q2) is the median or the 50th percentile. (Median splits the data into two halves).
  2. The first quartile (Q1) is the median of the data that falls below the median. This is the 25th percentile
  3. The third quartile (Q3) is the median of the data falling above the median. This is the 75th percentile

The interquartile range is IQR = Q3 — Q1

Box Plot:

Null Hypothesis: The null hypothesis, H0 is the commonly accepted fact; it is the opposite of the alternate hypothesis (H1). Researchers work to reject, nullify or disprove the null hypothesis. Researchers come up with an alternate hypothesis, one that they think explains a phenomenon, and then work to reject the null hypothesis.

Example: Null hypothesis, H0: The world is flat.Alternate hypothesis: The world is round.

p-value: A p-value is used in hypothesis testing to help you support or reject the null hypothesis. The p-value is the evidence against a null hypothesis. The smaller the p-value, the strong the evidence that you should reject the null hypothesis.

  • If p > .10 → “not significant”
  • If p ≤ .10 → “marginally significant”
  • If p ≤ .05 → “significant”
  • If p ≤ .01 → “highly significant.”

R-square: R-square measures the accuracy of the model. It is the explainable portion of the dependent variable using a set of independent variables.

SST = Sum of Squares Total

SSE = Sum of Squared Errors

SSR = Sum of Squared Regression

R² = 1 — (SSE/SST) or (SSR/SST)

Multi-collinearity: Multicollinearity generally occurs when there are high correlations between two or more predictor variables. In other words, one predictor variable can be used to predict the other. This creates redundant information, skewing the results in a regression model. Examples of correlated predictor variables (also called multicollinear predictors) are a person’s height and weight, age and sales price of a car, or years of education and annual income.

--

--

Vishnu Kakaraparthi

Data Scientist with experience in solving many real-world business problems across different domains interested in writing articles and sharing knowledge.