## Tutorial on Data Science and Machine Learning

## Types of Data

Discrete Numerical Data: This is Integer based.

Example: Population of cities

Continuous Numerical Data: Infinite number of possible values ie., it may contain fractions.

Example: Height of a person

Categorical Data: Doesn’t have inherent numeric meaning.

Example: Gender, Product category

Ordinal Data: Mixure of Numerical and Categorical

Example: Movie Ratings on a scale of 1-5

## Mean, Median, Mode

Mean: This is average of all the sample values. Sum / Number of samples

Example: 2, 5, 3, 7, 4 , 8, 4, 1, 9

Mean => (2+5+3+7+4+8+4+1+9)/9 = 4.89

Median: The mid value of all the samples sorted

Example: 2, 5, 3, 7, 4 , 8, 4, 1, 9

Sorted: 1,2,3,4,4,5,7,8,9

Median is 4

If the number of sample values are even, then take the average of the two mid values

Median is considered better than Mean when there are outliers in the samples as Mean would be skewed.

Mode: The sample value that has most number of occurances

Example: 2, 5, 3, 7, 4 , 8, 4, 1, 9

Mode: 4 as it is appearing two times where as all other sample values are appearing only once.

## Mean, Median and Mode in Python

**Program**

import numpy as np

from scipy import stats as st

age = [20, 40, 30, 60, 85, 64, 23, 56, 78, 56, 34, 56, 78, 34, 67, 65]

print (“Mean:”,np.mean(age))

print (“Median:”, np.median(age))

print (“Mode:”, st.mode(age))

**Output**Mean: 52.875

Median: 56.0

Mode: ModeResult(mode=array([56]), count=array([3]))

## Histogram

In a set of values, identify the frequency of occuring of each of the values and plot them as a bar graph by each bar representing frequency of each value.

Ex: Below is the list of age of the people attending an event

25, 30, 45, 60, 25, 45, 34, 56, 25, 30, 30, 25, 45, 60, 25

You can make a table of ranges of ages and number of values in each range

Age Range and Frequency

20-29: 5

30-39: 4

40-49: 3

50-59: 1

60-69: 2

If you plot these value on a bar graph, taking age range on X and Frquency on Y, that gives a histogram.

## Program: Plot a Histogram

import numpy as npimport matplotlib.pyplot as plt

#Generate ages instead of hard coding, we will get more meaningful values

#40 is centered values

#5 standard deviation

#1000 number of values

ages = np.random.normal(40, 5, 1000)

plt.hist(ages, 50)

plt.show()

**Output:**

## Variance

According to wikipedia , It measures how far a set of (random) numbers are spread out from their average value. In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.

## Standard Deviation

According to wikipedia , In statistics, the standard deviation (SD, also represented by the lower case Greek letter sigma σ or the Latin letter s) is a measure that is used to quantify the amount of variation or dispersion of a set of data values.

A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

SD can also be treated as square root of Variance.

## Program: Standard Deviation and Variance in Python

import numpy as npimport matplotlib.pyplot as plt

#Generate ages instead of hard coding, we will get more meaningful values

#40 is centered value

#5 standard deviation

#1000 number of values

ages = np.random.normal(40, 5, 1000)

#Now ages is a list. How to get Standard deviation and variance of the list

print(“Standard Deviation:”, ages.std())

print(“Variance:”, ages.var())

**Output**

Standard Deviation: 5.24306013408

Variance: 27.4896795696

## Probability Density Function (PDF)

Referring to wikipedia: Probability of random variable for a speific value in continuous data is almost ‘0’. However, there will be a +ve value for probability of the random variabe falling within a particular range of values.

Probability Density Function is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking any one value.

## Probability Mass Function (PMF)

Referring to wikipedia: This is used for set of discrete values. Probability mass function gives probability that a discrete random variable is exactly to some value.

## Draw a Uniform Distribution Curve

**Program**

#Draw a uniform distribution

import numpy as np

import matplotlib.pyplot as plt

#Get a list of random values and use uniform function

#start value, end value and number of points

values = np.random.uniform(-10.0, 20.0, 100000)

#Plot a histogram with 50 bars

plt.hist(values, 50)

#Show the histogram

plt.show()

**Output:**

## Draw a Normal Distribution Curve: using Probability Density Function

**Program**

from scipy.stats import norm

import matplotlib.pyplot as plt

import numpy as np

#Get random values between -5 and 5 with interval of 0.1

x = np.arange(-5, 5, 0.1)

#Use normal probability density function to get the histogram

plt.plot(x, norm.pdf(x))

#Show the histogram

plt.show()

**Output:**

## Draw a Binomial Distribution Curve: using Probability Mass Function

**Program**

#Binomial Distribution

from scipy.stats import binom

import matplotlib.pyplot as plt

import numpy as np

n, p = 10, 0.5

x = np.arange(0, 10, 0.001)

plt.plot(x, binom.pmf(x, n, p))

plt.show()

**Output:**

## Poisson Probability Mass Function

**Program**

#A Restaurant gets 200 guests on average per day.

#What is the probabiity of getting 220 on a day

from scipy.stats import poisson

import matplotlib.pyplot as plt

import numpy as np

mu = 200

x = np.arange(140, 270, 0.5)

plt.plot(x, poisson.pmf(x, mu))

plt.show()

**Output:**

## Percentile

A percentile (or a centile) is a measure used in statistics to indicate how much % is below a value.

For example: A student got 80th peercentile score in an exam means, 80% of the students got score below that sutdent.

50th percentile is equalent to Median. That is the mid value among all.

## Percentile in Python

**Program**

import matplotlib.pyplot as plt

import numpy as np

vals = np.random.normal(50, 4, 10000)

print (“50th percentile:”, np.percentile(vals,50))

print (“10th percentile:”, np.percentile(vals,10))

print (“90th percentile:”, np.percentile(vals,90))

**Output: (this may vary as we used random values)**

50th percentile: 50.0183934715

10th percentile: 44.8104131231

90th percentile: 55.2289663083

## Moments in Statistics

1st Moment is same as Mean

2nd Momemn is Variance

3rd Moment is Skew

4th Moment is Kurtosis

Skew and Kurtosis indicate shape and sharpness of the curve of a histogram.

Skew may be -ve or +ve.

Higher the kurtosis, sharper the curve

Comments are closed.