************************************
************************************
***THE NORMAL CURVE (chapter 5)
************************************
************************************

************************************
***OPENING COMMANDS
************************************
***Clear memory
clear all

***Start saving results window
log using "C:\course\progs\Stata05.log", replace text // Windows
log using "/course/progs/Stata05.log", replace text // Macintosh

************************************
***GRAPH COMMAND TO GENERATE NORMAL DISTRIBUTION
************************************
***Plot two normal distributions
***IQ scores for females and males
graph twoway (function y=normalden(x,100,10), range(40 160) lcolor(maroon) lw(medthick)) ///
             (function y=normalden(x,100,20), range(40 160) lcolor(navy) lw(medthick)), ///
  title("Normal density of IQ scores for females and males", color(black)) ///
  xtitle("IQ Units", size(medlarge)) ytitle("") xlabel(40(10)160) ///
  xscale(lw(medthick)) yscale(lw(medthick)) ///
  legend(order(1 "Females" 2 "Males")) graphregion(fcolor(white))

************************************
***AREA UNDER THE NORMAL CURVE
***"normal" shows area below Z
************************************
***Survey in a community
***Age = 35.5
***Standard deviation = 10

************************************
***What's the probability of finding someone
***who is younger than 44 years of age?

*Estimate Z = (x - mean) / standard deviation
di (44-35.5)/10

*Area below Z=0.85
display normal(0.85)
di normal(0.85)

************************************
***What's the probability of finding someone
***who is older than 40 years of age?

*Estimate Z = (x - mean) / standard deviation
di (40-35.5)/10

*Area above Z=0.45
di 1-normal(0.45)

************************************
***What's the probability of finding someone
***who is younger than 22 years of age?

*Estimate Z = (x - mean) / standard deviation
di (22-35.5)/10

*Area below Z=-1.35
di normal(-1.35)

************************************
***What's the probability of finding someone
***who is between 32 and 42 years of age?

*Estimate Z = (x - mean) / standard deviation
di (32-35.5)/10
di (42-35.5)/10

*Area between Z=-0.35 and Z=0.65
di normal(0.65)-normal(-0.35)

************************************
***What's the probability of finding someone
***who is between 42 and 46 years of age?

*Estimate Z = (x - mean) / standard deviation
di (42-35.5)/10
di (46-35.5)/10

*Area between Z=0.65 and Z=1.05
di normal(1.05)-normal(0.65)

************************************
***What's the probability of finding someone
***who is above 50 years of age?

*Estimate Z = (x - mean) / standard deviation
di (50-35.5)/10

***Area above Z=1.45
di 1-normal(1.45)

************************************
***DISTRIBUTION OF INCOME
***GENERAL SOCIAL SURVEY
************************************
***Open 2016 GSS
use "C:\course\data\GSS2016.dta", clear // Windows
use "/course/data/GSS2016.dta", clear // Macintosh

***Histogram of income
hist conrinc, norm percent

***Boxplot of income
graph hbox conrinc

***Quantile-normal plot of income
qnorm conrinc

***Power transformation
***q<1 (reduce positive skew)
***log(y): q=0
gen lnconrinc = ln(conrinc)

***Histogram of log of income
hist lnconrinc, norm percent

***Boxplot of log of income
graph hbox lnconrinc

***Quantile-normal plot of log income
qnorm lnconrinc

************************************
***What's the probability of finding someone
***who makes more than $50,000 per year?

***Original income variable (conrinc)
***This variable does not have a normal distribution
sum conrinc

***Log of income (lnconrinc)
***This variable has a distribution closer to normal
sum lnconrinc

*Mean = 9.95
*Standard deviation = 1.16

*$50,000 in log scale
di ln(50000)

*Estimate Z = (x - mean) / standard deviation
di (10.82-9.95)/1.16

***Area above Z=0.75
di 1-normal(0.75)

************************************
***CLOSING COMMANDS
************************************
***Save data
save "Stata05.dta", replace

***Save log
log close