*********************************** ************************************ ***ORDINARY LEAST SQUARE REGRESSION ************************************ ************************************ ************************************ ***CLEAR MEMORY ************************************ clear all ************************************ ***WINDOWS ************************************ ***Start saving results window log using "C:\course\programs\Stata05.log", replace text ***Shortcut for folders global data = "C:\course\data" global output = "C:\course\output" ************************************ ***MACINTOSH ************************************ ***Start saving results window log using "/course/programs/Stata05.log", replace text ***Shortcut for folders global data = "/course/data" global output = "/course/output" ************************************ ***OPENING COMMANDS ************************************ ***Tell Stata to not pause for "more" messages set more off ***Change directory cd "$data" ***Open 2018 ACS (only Texas) use "ACS2018TX.dta", clear ************************************ ***GENERATE VARIABLES ************************************ ***Sex gen female=. replace female=0 if sex==1 // Male replace female=1 if sex==2 // Female label define female 0 "Male" 1 "Female" label values female female ***Race/ethnicity gen raceth=. replace raceth=1 if race==1 & hispan==0 // White replace raceth=2 if race==2 & hispan==0 // Black replace raceth=3 if hispan>=1 & hispan<=4 // Hispanic replace raceth=4 if (race==4 | race==5 | race==6) & hispan==0 // Asian replace raceth=5 if race==3 & hispan==0 // Native American replace raceth=6 if (race==7 | race==8 | race==9) & hispan==0 // Other label define raceth 1 "White" 2 "African American" 3 "Hispanic" /// 4 "Asian" 5 "Native American" 6 "Ohter races" label values raceth raceth ***Age egen agegr = cut(age), at(0,16,20,25,35,45,55,65,100) label define agegr 0 "0-15" 16 "16-19" 20 "20-24" 25 "25-34" /// 35 "35-44" 45 "45-54" 55 "55-64" 65 "65-100" label values agegr agegr ***Educational attainment gen educgr=. replace educgr=1 if educ>=0 & educ<=5 // Less than high school replace educgr=2 if educ==6 // High school replace educgr=3 if educ==7 | educ==8 // Some college replace educgr=4 if educ==10 // College replace educgr=5 if educ==11 // 5+ years of college, graduate school label define educgr 1 "Less than high school" 2 "High school" /// 3 "Some college" 4 "College" 5 "Graduate school" label values educgr educgr ***Marital status gen marital=. replace marital=1 if marst==1 | marst==2 // Married replace marital=2 if marst>=3 & marst<=5 // Separated, divorced, widowed replace marital=3 if marst==6 // Never married, single label define marital 1 "Married" 2 "Separated, divorced, widowed" 3 "Never married" label values marital marital ***Wage and salary income gen income=. replace income=incwage if incwage!=999999 ***Migration status gen migrant=. replace migrant=1 if migrate1d==10 | migrate1d==23 // same house or within PUMA replace migrant=2 if migrate1d>=24 & migrate1d<=32 // internal migrant replace migrant=3 if migrate1d==40 // international migrant label define migrant 1 "Non-migrant" 2 "Internal migrant" 3 "International migrant" label values migrant migrant ***Internal migration status (domestic migration) gen dommig=. replace dommig=0 if migrant==1 // non-migrant replace dommig=1 if migrant==2 // internal migrant label define dommig 0 "Non-migrant" 1 "Internal migrant" label values dommig dommig tab migrant dommig, m ***International migration status gen intmig=. replace intmig=0 if migrant==1 // non-migrant replace intmig=1 if migrant==3 // international migrant label define intmig 0 "Non-migrant" 1 "International migrant" label values intmig intmig tab migrant intmig, m ************************************ ***COMPLEX SAMPLE DESIGN ************************************ svyset cluster [pweight=perwt], strata(strata) ************************************ ***ORDINARY LEAST SQUARES (OLS) REGRESSION ************************************ ***Sample size count ***Keep only observations with non-missing values keep if female!=. & raceth!=. & age!=. & agegr!=. & /// educgr!=. & marital!=. & income!=. & income!=0 & migrant!=. count ***Drop observations with missing values ***Same as above drop if female==. | raceth==. | age==. | agegr==. | /// educgr==. | marital==. | income==. | income==0 | migrant==. count ************************************ ***OLS WITH INCOME, AGE, AND EDUCATION ************************************ ***Use complex survey design svy: reg income age educgr ***Standardized regression coefficients ***(i.e., standardized partial slopes, beta-weights) ***It does not allow the use of complex survey design ***Use pweight to maintain sample size and estimate robust standard errors reg income age educgr [pweight=perwt], beta ***Use aweight to estimate adjusted R-squared ***pweight and complex survey design omit sum of squares and adjusted R-squared reg income age educgr [aweight=perwt] ************************************ ***DETERMINING NORMALITY ************************************ ***Histogram of wage and salary income hist income [fweight=perwt] if income!=0, percent normal ylabel(0(2.5)12.5) xtitle(Wage and salary income) ***Boxplot of wage and salary income graph hbox income if income!=0 [fweight=perwt], ytitle(Wage and salary income) ***Quantile-normal plot of wage and salary income qnorm income if income!=0, ytitle(Wage and salary income) ***Skewness and kurtosis sum income if income!=0 [fweight=perwt], d sum income if income!=0 [aweight=perwt], d ***Power transformation ***q<1 (reduce positive skew) ***log(y): q=0 gen lnincome = ln(income) ***Histogram of log of wage and salary income hist lnincome [fweight=perwt], percent normal xtitle(Natural logarithm of wage and salary income) ***Boxplot of log of wage and salary income graph hbox lnincome [fweight=perwt], ytitle(Natural logarithm of wage and salary income) ***Quantile-normal plot of log of wage and salary income qnorm lnincome, ytitle(Natural logarithm of wage and salary income) ***Skewness and kurtosis sum lnincome [fweight=perwt], d sum lnincome [aweight=perwt], d ************************************ ***OLS WITH NATURAL LOGARITHM OF INCOME, AGE, AND EDUCATION ************************************ ***Use complex survey design svy: reg lnincome age educgr ***Automatically see exponential of coefficients svy: reg lnincome age educgr, eform(Exp. Coef.) ***Standardized regression coefficients ***(i.e., standardized partial slopes, beta-weights) ***It does not allow the use of complex survey design ***Use pweight to maintain sample size and estimate robust standard errors reg lnincome age educgr [pweight=perwt], beta ***Use aweight to estimate adjusted R-squared ***pweight and complex survey design omit sum of squares and adjusted R-squared reg lnincome age educgr [aweight=perwt] ************************************ ***Interpret coefficients with log of income ************************************ ***When x increases by 1, ***y increases by 100*[exp(coefficient)-1] percent, ***controlling for the effects of all other independent variables ***Example of coefficient for age di exp(0.0225) ***Percentage interpretation di 100*(exp(0.0225)-1) ***When coefficient has a small magnitude, ***we can use 100*coefficient di 100*(0.0225) ***Example of coefficient for years of education di exp(0.3382) di 100*(exp(0.3382)-1) di 100*(0.3382) ************************************ ***PREDICTED VALUES AND RESIDUAL ANALYSIS ************************************ ************************************ ***OLS with income, age, and education ************************************ svy: reg income age educgr ***Predicted income of someone with 45 years of age and college education di -31880.99 + (796.34)*(45) + (16863.33)*(4) ***Save predicted values as a new variable after the estimation of a regression model predict predincome label variable predincome "" ***Scatterplot of predicted income by age twoway (scatter predincome age) ***Scatterplot of predicted income by education twoway (scatter predincome educgr) (lfit predincome educgr) ***Save residual values as a new variable after the estimation of a regression model predict resincome, res label variable resincome "" ***Scatterplot of residuals by predicted income scatter resincome predincome, yline(0) ************************************ ***OLS with natural logarithm of income, age, and education ************************************ svy: reg lnincome age educgr ***Predicted log of income of someone with 45 years of age and college education di 8.3488 + (0.0225)*(45) + (0.3382)*(4) ***Exponential of predicted log of income di exp(10.7141) ***Save predicted values as a new variable after the estimation of a regression model predict predlnincome label variable predlnincome "" ***Generate variable with exponential of predicted log of income gen exppredlnincome = exp(predlnincome) ***Scatterplot of predicted log of income by age twoway (scatter predlnincome age) ***Scatterplot of exponential of predicted log of income by age twoway (scatter exppredlnincome age) ***Scatterplot of predicted log of income by education twoway (scatter predlnincome educgr) (lfit predlnincome educgr) ***Scatterplot of exponential of predicted log of income by education twoway (scatter exppredlnincome educgr) ***Save residual values as a new variable after the estimation of a regression model predict reslnincome, res label variable reslnincome "" ***Scatterplot of residuals by predicted log of income scatter reslnincome predlnincome, yline(0) ***Browse data browse age educgr income predincome resincome lnincome predlnincome reslnincome, nolabel ************************************ ***OLS WITH SQUARED INDEPENDENT VARIABLE (AGE + AGE SQUARED) ************************************ ***Generate variable with mean income by age bysort age: egen mincage=mean(income) if income!=0 sum mincage, d ***Line graph of mean income by age twoway line mincage age [fweight=perwt], /// ytitle("Mean wage and salary income") ylabel(0(20000)80000) ***Generate age squared variable gen agesq = age * age ***OLS with natural logarithm of income, age, and age squared svy: reg lnincome age agesq ***Save predicted values as a new variable after the estimation of a regression model predict predlnincome2 label variable predlnincome2 "" ***Line graph of predicted log of income by age line predlnincome2 age, sort ***Generate variable with exponential of predicted log of income gen exppredlnincome2 = exp(predlnincome2) ***Line graph of exponential of predicted log of income by age line exppredlnincome2 age, sort ***Save residual values as a new variable after the estimation of a regression model predict reslnincome2, res label variable reslnincome2 "" ***Scatterplot of residuals by predicted log of income scatter reslnincome2 predlnincome2, yline(0) ***Scatterplot of residuals by exponential of predicted log of income scatter reslnincome2 exppredlnincome2, yline(0) ************************************ ***DUMMY VARIABLES ************************************ ************************************ ***Age ************************************ ***Age does not have a normal distribution hist age [fweight=perwt], percent normal ***Utilize age group variable ***16-19; 20-24; 25-34; 35-44; 45-54; 55-64; 65+ table agegr, contents(min age max age count age) ***Generate dummy variables for age (manually) gen agegr16=0 replace agegr16=1 if agegr==16 tab agegr agegr16, m gen agegr20=0 replace agegr20=1 if agegr==20 tab agegr agegr20, m gen agegr25=0 replace agegr25=1 if agegr==25 tab agegr agegr25, m gen agegr35=0 replace agegr35=1 if agegr==35 tab agegr agegr35, m gen agegr45=0 replace agegr45=1 if agegr==45 tab agegr agegr45, m gen agegr55=0 replace agegr55=1 if agegr==55 tab agegr agegr55, m gen agegr65=0 replace agegr65=1 if agegr==65 tab agegr agegr65, m ***Generate dummy variables for age (automatically) tab agegr, gen(agegr) tab agegr agegr1, m tab agegr agegr2, m tab agegr agegr3, m tab agegr agegr4, m tab agegr agegr5, m tab agegr agegr6, m tab agegr agegr7, m ***Choose reference category for age ***Use the category with the largest sample size as the reference (25–34) tab agegr, m ***Or category with large sample and meaningful interpretation for your problem (age group with the highest average income: 45–54) table agegr, c(mean income) ************************************ ***Education ************************************ ***Education does not have a normal distribution hist educ [fweight=perwt], percent normal ***Utilize education group variable ***Less than high school; high school; some college; college; graduate school tab educgr, m ***Generate dummy variables for education (automatically) tab educgr, gen(educgr) tab educgr educgr1, m tab educgr educgr2, m tab educgr educgr3, m tab educgr educgr4, m tab educgr educgr5, m ***Choose reference category for education ***Use the category with the largest sample size as the reference (high school) tab educgr, m ************************************ ***OLS with natural logarithm of income and dummy independent variables ************************************ ***45-54 as reference group (agegr5): combination of large sample size and meaningful interpretation tab agegr table agegr, c(mean income) ***High school as reference group (educgr2): largest sample size tab educgr ***Regression using dummies previously generated svy: reg lnincome agegr1 agegr2 agegr3 agegr4 agegr6 agegr7 educgr1 educgr3 educgr4 educgr5 ***Regression with dummies and reference indicated within "reg" command ***"i" inform dummy variables ***"b#" indicate reference category svy: reg lnincome ib45.agegr ib2.educgr ***Automatically see exponential of coefficients svy: reg lnincome ib45.agegr ib2.educgr, eform(Exp. Coef.) ************************************ ***Interpret coefficients with log of income ************************************ ***When x increases by 1, ***y increases by 100*[exp(coefficient)-1] percent, ***controlling for the effects of all other independent variables ***Example of coefficient for 16-19 age group ***compared to 45-54 age group di exp(-2.2230) ***Percentage interpretation di 100*(exp(-2.2230)-1) ***Since the coefficient for 16-19 age group has a large magnitude, ***we cannot use the approximation of 100*coefficient di 100*(-2.2230) ***Example of coefficient for educgr4 (college) ***compared to educgr2 (high school) di exp(0.5445) ***Percentage interpretation di 100*(exp(0.5445)-1) ***Since the coefficient for college has a large magnitude, ***we cannot use the approximation of 100*coefficient di 100*(0.5445) ************************************ ***Standardized regression coefficients ************************************ ***Standardized regression coefficients ***(i.e., standardized partial slopes, beta-weights) ***It does not allow the use of complex survey design ***Use pweight to maintain sample size reg lnincome ib45.agegr ib2.educgr [pweight=perwt], beta ************************************ ***Predicted values and residual analysis ************************************ svy: reg lnincome ib45.agegr ib2.educgr ***Save predicted values as a new variable after the estimation of a regression model predict predlnincome3 label variable predlnincome3 "" ***Generate variable with exponential of predicted log of income gen exppredlnincome3 = exp(predlnincome3) ***Save residual values as a new variable after the estimation of a regression model predict reslnincome3, res label variable reslnincome3 "" ***Scatterplot of residuals by predicted log of income scatter reslnincome3 predlnincome3, yline(0) ***Scatterplot of residuals by exponential of predicted log of income scatter reslnincome3 exppredlnincome3, yline(0) ************************************ ***FULL OLS MODEL ************************************ ***Reference: sex (men = 0) tab female ***Reference: race/ethnicity (white = 1) tab raceth ***Reference: age group (45-54 = 45) tab agegr table agegr, c(mean income) ***Reference: education group (high school = 2) tab educgr ***Reference: marital status (married = 1) tab marital ***Reference: migration status (non-migrant = 1) tab migrant ***OLS regression svy: reg lnincome i.female ib45.agegr ib2.educgr i.raceth i.marital i.migrant ***Automatically see exponential of coefficients svy: reg lnincome i.female ib45.agegr ib2.educgr i.raceth i.marital i.migrant, eform(Exp. Coef.) ***Standardized coefficients reg lnincome i.female ib45.agegr ib2.educgr i.raceth i.marital i.migrant [pweight=perwt], beta ***Line graphs for predicted values don't look good with all these categorical variables predict predlnincome4 line predlnincome4 agegr, sort gen exppredlnincome4 = exp(predlnincome4) line exppredlnincome4 age, sort ***Let's explore the Spost13 commands ************************************ ***PREDICTED VALUES WITH SPOST13 COMMANDS ***From Long and Freese (2014) ************************************ ***If your Stata doesn't have the Spost13 commands, ***type "net install spost13_ado.pkg" to install it. *net install spost13_ado.pkg ***Full OLS model svy: reg lnincome i.female ib45.agegr ib2.educgr i.raceth i.marital i.migrant ************************************ ***Predicted values by age group - ONLY WOMEN ************************************ ***References: Male (female=1), White (raceth=1), 45-54 (agegr=45), ***High school (educgr=2), Married (marital=1), Non-migrant (migrant=1) mgen, stub(F) at(agegr=(16 20 25 35 45 55 65) female=1 /// raceth=1 educgr=2 marital=1 migrant=1) allstats ***Predicted income gen Fpredlnincome = exp(Fxb) ***Standard error in dollars gen Fsedollar = exp(Fse) ***Label for age group label values Fagegr agegr ***Graph: Log of income graph bar Fxb, over(Fagegr) /// ytitle("Predicted log of income") ***Graph: Income graph bar Fpredlnincome, over(Fagegr) /// ylabel(0(10000)50000) /// ytitle("Predicted income") ************************************ ***Predicted values by age group - ONLY MEN ************************************ ***References: Male (female=0), White (raceth=1), 45-54 (agegr=45), ***High school (educgr=2), Married (marital=1), Non-migrant (migrant=1) mgen, stub(M) at(agegr=(16 20 25 35 45 55 65) female=0 /// raceth=1 educgr=2 marital=1 migrant=1) allstats ***Predicted income gen Mpredlnincome = exp(Mxb) ***Standard error in dollars gen Msedollar = exp(Mse) ***Label for age group label values Magegr agegr ***Graph: Log of income graph bar Mxb, over(Magegr) /// ytitle("Predicted log of income") ***Graph: Income graph bar Mpredlnincome, over(Magegr) /// ylabel(0(10000)50000) /// ytitle("Predicted income") ************************************ ***Predicted values by age group and sex ************************************ ***References: Male (female=0), White (raceth=1), 45-54 (agegr=45), ***High school (educgr=2), Married (marital=1), Non-migrant (migrant=1) mgen, stub(A) at(agegr=(16 20 25 35 45 55 65) female=(0 1) /// raceth=1 educgr=2 marital=1 migrant=1) allstats ***Predicted income gen Apredlnincome = exp(Axb) ***Standard error in dollars gen Asedollar = exp(Ase) ***Create interaction between age group and sex gen agesex=. replace agesex=1 if Aagegr==16 & Afemale==1 // 16-19 female replace agesex=2 if Aagegr==16 & Afemale==0 // 16-19 male replace agesex=3 if Aagegr==20 & Afemale==1 // 20-24 female replace agesex=4 if Aagegr==20 & Afemale==0 // 20-24 male replace agesex=5 if Aagegr==25 & Afemale==1 // 25-34 female replace agesex=6 if Aagegr==25 & Afemale==0 // 25-34 male replace agesex=7 if Aagegr==35 & Afemale==1 // 35-44 female replace agesex=8 if Aagegr==35 & Afemale==0 // 35-44 male replace agesex=9 if Aagegr==45 & Afemale==1 // 45-54 female replace agesex=10 if Aagegr==45 & Afemale==0 // 45-54 male replace agesex=11 if Aagegr==55 & Afemale==1 // 55-64 female replace agesex=12 if Aagegr==55 & Afemale==0 // 55-64 male replace agesex=13 if Aagegr==65 & Afemale==1 // 65+ female replace agesex=14 if Aagegr==65 & Afemale==0 // 65+ male tab agesex Aagegr, m tab agesex Afemale, m ***Label for age group and sex variable label define agesex 1 "Female, 16-19" 2 "Male, 16-19" 3 "Female, 20-24" 4 "Male, 20-24" /// 5 "Female, 25-34" 6 "Male, 25-34" 7 "Female, 35-44" 8 "Male, 35-44" /// 9 "Female, 45-54" 10 "Male, 45-54" 11 "Female, 55-64" 12 "Male, 55-64" /// 13 "Female, 65+" 14 "Male, 65+" label values agesex agesex ***Graph of predicted income by age and sex graph bar Apredlnincome, over(agesex, label(angle(45))) /// ylabel(0(10000)50000) /// ytitle("Predicted income") ************************************ ***Suggestion: export these predicted values to Excel ***Then, make better-looking graphs with dots for point estimates with confidence intervals ************************************ sort agesex browse Axb-agesex ************************************ ***Residual analysis ************************************ svy: reg lnincome i.female ib45.agegr ib2.educgr i.raceth i.marital i.migrant ***Save predicted values as a new variable after the estimation of a regression model predict predlnincome5 label variable predlnincome5 "" ***Generate variable with exponential of predicted log of income gen exppredlnincome5 = exp(predlnincome5) ***Save residual values as a new variable after the estimation of a regression model predict reslnincome5, res label variable reslnincome5 "" ***Scatterplot of residuals by predicted log of income scatter reslnincome5 predlnincome5, yline(0) ***Scatterplot of residuals by exponential of predicted log of income scatter reslnincome5 exppredlnincome5, yline(0) ************************************ ***TEST OF COLLINEARITY WITH VARIANCE INFLATION FACTOR (VIF) ************************************ ***This is a factor that estimated the increase in variance ***due to issues of multicollinearity in the linear regression. ***Collinearity increases standard errors, ***i.e. it generates smaller statistical tests (smaller t-test) ***VIF > 5 indicates multicollinearity ***VIF > 10 indicates almost perfect multicollinearity ***OLS model with pweight, because VIF doesn't allowed complex survey design reg lnincome i.female ib45.agegr ib2.educgr i.raceth i.marital i.migrant [pweight=perwt] ***Calculate variance inflation factors (VIFs) for the independent variables ***specified in previous the linear regression model vif ***Variance equals the standard error squared. ***VIF equals to 1.50 for 16-19 age group means that ***standard error of this variable is 1.23 times higher (square root of VIF) ***than what it would have been if this variable was not correlated ***to any of the other independent variables in the model. ***Estimate the square root of VIF di sqrt(1.50) ***Example with age squared reg lnincome i.female age agesq ib2.educgr i.raceth i.marital i.migrant [pweight=perwt] ***Estimate VIF from previous model vif ***Square root of VIF is high for age and age squared. ***In this case, this is not a problem because ***we intentionally included these variables to estimate ***the quadratic association of experience in the labor market with earnings di sqrt(38.90) di sqrt(35.77) ************************************ ***EXPORT RESULTS TO WORD/EXCEL WITH OUTREG2 COMMAND ************************************ ***If your Stata doesn't have the outreg2 command, ***type "ssc install outreg2" to install it. *ssc install outreg2 ************************************ ***Sex, age group, education group ************************************ svy: reg lnincome i.female ib45.agegr ib2.educgr ***Export to Excel outreg2 using "$output\OLS.xls", replace excel // Windows outreg2 using "$output/OLS.xls", replace excel // Macintosh ************************************ ***Sex, age group, education group, race/ethnicity ************************************ svy: reg lnincome i.female ib45.agegr ib2.educgr i.raceth ***Export to Excel outreg2 using "$output\OLS.xls", append excel // Windows outreg2 using "$output/OLS.xls", append excel // Macintosh ************************************ ***Sex, age group, education group, race/ethnicity, marital status ************************************ svy: reg lnincome i.female ib45.agegr ib2.educgr i.raceth i.marital ***Export to Excel outreg2 using "$output\OLS.xls", append excel // Windows outreg2 using "$output/OLS.xls", append excel // Macintosh ************************************ ***Sex, age group, education group, race/ethnicity, marital status, migration status ************************************ svy: reg lnincome i.female ib45.agegr ib2.educgr i.raceth i.marital i.migrant ***Export to Excel outreg2 using "$output\OLS.xls", append excel // Windows outreg2 using "$output/OLS.xls", append excel // Macintosh ************************************ ***Standardized coefficients ************************************ ***Outreg2 doesn't allow pweight to estimate standardized coefficients reg lnincome i.female ib45.agegr ib2.educgr i.raceth i.marital i.migrant [aweight=perwt], beta ***Export to Excel outreg2 using "$output\OLS.xls", append excel stat(beta) // Windows outreg2 using "$output/OLS.xls", append excel stat(beta) // Macintosh ************************************ ***CLOSING COMMANDS ************************************ ***Save data save "Stata05.dta", replace ***Save log log close