Oracle HCM Cloud Placements and Training
Data Science

# Data Science and Machine Learning Tutorial

## Types of Data

Discrete Numerical Data: This is Integer based.
Example: Population of cities

Continuous Numerical Data: Infinite number of possible values ie., it may contain fractions.
Example: Height of a person

Categorical Data: Doesn’t have inherent numeric meaning.
Example: Gender, Product category

Ordinal Data: Mixure of Numerical and Categorical
Example: Movie Ratings on a scale of 1-5

## Mean, Median, Mode

Mean: This is average of all the sample values. Sum / Number of samples
Example: 2, 5, 3, 7, 4 , 8, 4, 1, 9
Mean => (2+5+3+7+4+8+4+1+9)/9 = 4.89

Median: The mid value of all the samples sorted
Example: 2, 5, 3, 7, 4 , 8, 4, 1, 9
Sorted: 1,2,3,4,4,5,7,8,9
Median is 4
If the number of sample values are even, then take the average of the two mid values
Median is considered better than Mean when there are outliers in the samples as Mean would be skewed.

Mode: The sample value that has most number of occurances
Example: 2, 5, 3, 7, 4 , 8, 4, 1, 9
Mode: 4 as it is appearing two times where as all other sample values are appearing only once.

## Mean, Median and Mode in Python

Program
import numpy as np
from scipy import stats as st
age = [20, 40, 30, 60, 85, 64, 23, 56, 78, 56, 34, 56, 78, 34, 67, 65]
print (“Mean:”,np.mean(age))
print (“Median:”, np.median(age))
print (“Mode:”, st.mode(age))

Output Mean: 52.875
Median: 56.0
Mode: ModeResult(mode=array(), count=array())

## Histogram

In a set of values, identify the frequency of occuring of each of the values and plot them as a bar graph by each bar representing frequency of each value.
Ex: Below is the list of age of the people attending an event
25, 30, 45, 60, 25, 45, 34, 56, 25, 30, 30, 25, 45, 60, 25
You can make a table of ranges of ages and number of values in each range

Age Range and Frequency
20-29: 5
30-39: 4
40-49: 3
50-59: 1
60-69: 2

If you plot these value on a bar graph, taking age range on X and Frquency on Y, that gives a histogram.

## Program: Plot a Histogram

import numpy as np
import matplotlib.pyplot as plt

#Generate ages instead of hard coding, we will get more meaningful values
#40 is centered values
#5 standard deviation
#1000 number of values
ages = np.random.normal(40, 5, 1000)
plt.hist(ages, 50)
plt.show()

Output: ## Variance

According to wikipedia , It measures how far a set of (random) numbers are spread out from their average value. In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.

## Standard Deviation

According to wikipedia , In statistics, the standard deviation (SD, also represented by the lower case Greek letter sigma σ or the Latin letter s) is a measure that is used to quantify the amount of variation or dispersion of a set of data values.
A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.
SD can also be treated as square root of Variance.

## Program: Standard Deviation and Variance in Python

import numpy as np
import matplotlib.pyplot as plt

#Generate ages instead of hard coding, we will get more meaningful values
#40 is centered value
#5 standard deviation
#1000 number of values
ages = np.random.normal(40, 5, 1000)
#Now ages is a list. How to get Standard deviation and variance of the list

print(“Standard Deviation:”, ages.std())
print(“Variance:”, ages.var())

Output
Standard Deviation: 5.24306013408
Variance: 27.4896795696

## Probability Density Function (PDF)

Referring to wikipedia: Probability of random variable for a speific value in continuous data is almost ‘0’. However, there will be a +ve value for probability of the random variabe falling within a particular range of values.
Probability Density Function is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking any one value.

## Probability Mass Function (PMF)

Referring to wikipedia: This is used for set of discrete values. Probability mass function gives probability that a discrete random variable is exactly to some value.

## Draw a Uniform Distribution Curve

Program
#Draw a uniform distribution
import numpy as np
import matplotlib.pyplot as plt

#Get a list of random values and use uniform function
#start value, end value and number of points
values = np.random.uniform(-10.0, 20.0, 100000)
#Plot a histogram with 50 bars
plt.hist(values, 50)
#Show the histogram
plt.show()

Output: ## Draw a Normal Distribution Curve: using Probability Density Function

Program
from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np

#Get random values between -5 and 5 with interval of 0.1
x = np.arange(-5, 5, 0.1)

#Use normal probability density function to get the histogram
plt.plot(x, norm.pdf(x))

#Show the histogram
plt.show()

Output: ## Draw a Binomial Distribution Curve: using Probability Mass Function

Program
#Binomial Distribution
from scipy.stats import binom
import matplotlib.pyplot as plt
import numpy as np

n, p = 10, 0.5
x = np.arange(0, 10, 0.001)

plt.plot(x, binom.pmf(x, n, p))
plt.show()

Output: ## Poisson Probability Mass Function

Program
#A Restaurant gets 200 guests on average per day.
#What is the probabiity of getting 220 on a day
from scipy.stats import poisson
import matplotlib.pyplot as plt
import numpy as np

mu = 200
x = np.arange(140, 270, 0.5)
plt.plot(x, poisson.pmf(x, mu))
plt.show()

Output: ## Percentile

A percentile (or a centile) is a measure used in statistics to indicate how much % is below a value.
For example: A student got 80th peercentile score in an exam means, 80% of the students got score below that sutdent.

50th percentile is equalent to Median. That is the mid value among all.

## Percentile in Python

Program
import matplotlib.pyplot as plt
import numpy as np

vals = np.random.normal(50, 4, 10000)
print (“50th percentile:”, np.percentile(vals,50))
print (“10th percentile:”, np.percentile(vals,10))
print (“90th percentile:”, np.percentile(vals,90))

Output: (this may vary as we used random values)
50th percentile: 50.0183934715
10th percentile: 44.8104131231
90th percentile: 55.2289663083

## Moments in Statistics

1st Moment is same as Mean
2nd Momemn is Variance
3rd Moment is Skew
4th Moment is Kurtosis

Skew and Kurtosis indicate shape and sharpness of the curve of a histogram.
Skew may be -ve or +ve.
Higher the kurtosis, sharper the curve    