I am a young and new Stata user trying to learn how to draw conclusions from datasets.
I have a small set of survey responses (attached) that I am trying to analyze.
As I am self-taught, could a more experienced user review my work/reasoning and offer corrections/suggestions?
I am interested to know:
I will explore descriptive statistics to become familiar with the data.
I am interested in exploring the following variables:
I will focus on the following variables:
I see there are negative values for missing responses which I will recode to be able to run correlations and regressions
I am interested in how strongly related these variables are.
A Pearson’s correlation indicates a moderate positive correlation between Q14 and Q15 (r = .6, p < .00005), with social media as a news source explaining 36% of the variation in trusting the news.
I will further explore this relationship by building the following models:
I run regressions using robust standard errors to control for heteroskedasticity. The model explains nearly 40% of the variance in trust and shows a statistically significant relationship between Q15 and Q14 (p < .00005). The RMSE value (.61) indicates that the model can predict the data fairly accurately.
The degree to which news is acquired from social media, ideology, race, gender, and generation (Generation X and Baby Boomers) are statistically significant in explaining trust.
None of the predictors have VIF > 10 or 1/VIF < .1, suggesting that there is no multicollinearity.
I run a correlation matrix for all variables in the model. There is a strong relationship between using social media as a news source and trusting the news, more moderate correlations between generation and ideology, and inverse relationships for gender and race.
I can conclude that the degree to which one trusts social media as a news source depends on their race, gender, age, and ideology.
I have a small set of survey responses (attached) that I am trying to analyze.
As I am self-taught, could a more experienced user review my work/reasoning and offer corrections/suggestions?
I am interested to know:
- What errors I am making
- What issues I am not addressing
- How my thinking can be more sophisticated
- Whether conclusions can be drawn from this regression model
I will explore descriptive statistics to become familiar with the data.
Code:
codebook sum misstable sum
Code:
tab1 Q2 Q11 Q14 Q15, miss bysort Q11: tab Q2 Q14 bysort Q15: tab Q2 Q11
Code:
sum Q14, detail sum Q15, detail tab Q14 Q15, chi2 lrchi2
Code:
replace Q14 = . if Q14 < 0 replace Q15 = . if Q15 < 0
Code:
pwcorr Q14 Q15, sig star(.05)
I will further explore this relationship by building the following models:
Code:
set showbaselevels on asdoc reg Q15 i.Q14, robust nest append asdoc reg Q15 i.Q14 i.ideocat, robust nest append asdoc reg Q15 i.Q14 i.ideocat i.gencat i.racecat i.gender, robust nest append estat vif
The degree to which news is acquired from social media, ideology, race, gender, and generation (Generation X and Baby Boomers) are statistically significant in explaining trust.
None of the predictors have VIF > 10 or 1/VIF < .1, suggesting that there is no multicollinearity.
Code:
pwcorr Q15 Q14 gencat ideocat racecat gender, sig star(.05)
I can conclude that the degree to which one trusts social media as a news source depends on their race, gender, age, and ideology.
