OLS regression

HADYSYAM JUNAIDI

Join Date: Aug 2016

Posts: 10
#1

OLS regression

08 Aug 2016, 06:19

CurrentIy, I am doing an analysis on the extent of accrual accounting disclosures of the three financial years for 26 organisations. The dependent variable (i.e. accrual accounting disclosures) is measured by using dichotomous scoring. While, the five independent variables consist of two dummy variable (1,0), two categorical variables which are labelled 1, 2 & 3 respectively and one continuous variable (i.e. revenue) which is transformed into a natural logarithm. In this regard, I am contemplating to deploy STATA software for running the OLS regression.

My question is whether it is possible to run OLS regression if the independent variables are characterised by more than 2 dummy/ categorical variables. Does it have any impact on normality, heteroscedasticity and serial correlation impacts?

Kindly advise.
Tags: None
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

08 Aug 2016, 06:26

AFAIK that's all fine.
Comment
Mari Meir

Join Date: Jul 2016

Posts: 61
#3

08 Aug 2016, 06:31

Your variable is a dummy, so you are talking about the probability of adopting the accounting disclosures given a number of other variables.
If you had only dummies as independent variables, an OLS (which in this case is called linear probability model) would be ok. With other types of variables, it can be argued that LPM is not the best model (see for example Horrace, W. C., and R. L. Oaxaca. 2006. “Results on the Bias and Inconsistency of Ordinary Least Squares for the Linear Probability Model.” Economic Letters, 90, 321-327.). As you also have categorical and continuous, you should think about Probit or Logit (some people disagree, but this is what I would do).

About your other questions, you need to test for heteroscedascity and serial correlation. If they are present, you need to check if there are remedies (or, if not, how to account for it). The presence of such issues will depend on the design of your experiment and on the collection method of your data.
Comment
Mari Meir

Join Date: Jul 2016

Posts: 61
#4

08 Aug 2016, 06:33

PS. I understand you have 26 organizations x 3 years = less than 100 observations. If that is the case, you need to take that into account too.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#5

08 Aug 2016, 09:17

Hadysiam:
welcome to the list.
Since you're seemingly dealing with panel data, why not considering -xtlogit- (or -xtreg- for linear probability model)?

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
HADYSYAM JUNAIDI

Join Date: Aug 2016

Posts: 10
#6

08 Aug 2016, 22:01

I have a few questions before considering to deploy STATA:

(i) It seems that in some accounting disclosure studies, they have used OLS panel regression with clustered robust standard error. May I know what is the advantage of adopting this method?

(ii) For the two categorical variables which are labelled 1, 2 & 3 (e.g. 3 is given to full accrual features of the computerized accounting system, 2- partial accrual features & 1- non accrual features), do I have to create dummy variables before running the OLS linear regression?

(iii) In the case of dependent variable (the dichotomous scoring grant of '1' is awarded if an accrual accounting item is disclosed, and '0' if otherwise, do I have to perform transformation of variable if the assumption of normality is not met. What type of transformation is used upon performing the 'ladder enroll' step. Does the smallest chi-square (e.g. reciprocal cube) is chosen as the function of the transformation. Cooke. 1988. "Regression Analysis in Accounting Disclosure Studies". Accounting and Business Research, 28 (3), 209-224 suggests a possible transformation in disclosure studies is the log of the odd ration of the dependent variable.

(iv) Before performing the OLS panel regression with clustered robust standard error, the tests outline below I presume need to be executed, which are:
- White test's and Breusch-Pagan test to validate the evidence of heteroscedasticity,
- Shapiro-Wilk test and Kernel density estimate for the assumption of normality, and
- Variation Inflation Factor for multicollinearity diagnostics

(v) If my PhD study, the study population only involves 26 organizations x 3 years financial statements, which is less than 100 organisation. Does it have any impacts on the result of regression analysis?

Please enlighten me on the issue mentioned above as I am still novice in the academia and STATA as well.

Kind regards,

Hadysyam
Comment
Mari Meir

Join Date: Jul 2016

Posts: 61
#7

09 Aug 2016, 03:34

Regarding your question number (ii), the answer is "no, you don't need to". You can do

Code:

i.categoricalvariable

if you want to consider your first category as the base category.

If you want to consider another category as the base category, you just need to type

Code:

b2.categoricalvariable b3.categoricalvariable

for the second and the third category as base, respectively.

for your question (iii) if you are talking about the error term, assumption of normality is an assumption.

for your question (v) you need to search for works dealing with small samples and check what they recommend.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#8

09 Aug 2016, 07:29

Hadysyam:
(I): there's no actual gain in preferring OSL vs -xtreg-, unless the F-test at the foot of the -xtreg- outcome table lacks statistical significance (please, see examples under -xtreg- entry in Stats .PDF manual);
(iv): -regress postestimation- tests should be performed after OLS is run. Besides, if you impose clustered standard errors, you cannot investigate heteroskedasticity via -estat hettest-;
(v) with 26 clusters, you should not exceed 3 predictors (rule of thumb: 1 predictor every 10 clusters).

Kind regards,
Carlo
(Stata 19.0)
Comment
HADYSYAM JUNAIDI

Join Date: Aug 2016

Posts: 10
#9

10 Aug 2016, 21:12

Dear Carlo & Mari,

Thank you for your reply. Once I have purchased the Small Stata version, I will revert to you all for further advice.

I believe a book entitled 'A Gentle Introduction to Stata, Fifth Edition by Acock, 2016' will help me much to master the arts of handling Stata.

Regards,

Hadysyam
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#10

11 Aug 2016, 08:03

Hadysyam:
I would also recommend Cameron and Trivedi's textbook "Micro econometrics using Stata" (for further details, take a look at Stata bookstore).

Kind regards,
Carlo
(Stata 19.0)
Comment
HADYSYAM JUNAIDI

Join Date: Aug 2016

Posts: 10
#11

23 Aug 2016, 00:15

Dear all Statalist members,

First of all, I would like to apologize for the lengthy posting. As I highlighted before, I am doing an analysis on the extent of accrual accounting disclosures of the 26 local authorities in Malaysia by concentrating on the annual financial statements which have now expanded to 4 financial years (104 observations). The dependent variable (SLADI) is the ratio of accrual accounting disclosures, measuring by the dichotomous score of 0 and 1. There are 5 independent variables which are (i) technology infrastructure (TI), labelled 1, 2 and 3, (ii) personnel qualification (QP), labelled 1, 2 and 3, (iii) size (SZ), which is a natural logarithm of revenue, (iv) audit size (AI), dummy variable of 0 and 1, (v) regulations (RG), dummy variable of 0 and 1.

The descriptive statistics of dependent variable are as follows, which indicate non-normality of residuals:
Median .3182
Mean .3636192
Std. Dev. .1519788
Variance .0230976
Skewness 3.153982
Kurtosis 11.00086

As such, transformation of variable has been performed by firstly executing the 'ladder sladi' command to help in the process:
Transformation formula chi2(2) P(chi2)
------------------------------------------------------------------
cubic sladi^3 62.84 0.000
square sladi^2 62.72 0.000
identity sladi 62.43 0.000
square root sqrt(sladi) 62.14 0.000
log log(sladi) 61.69 0.000
1/(square root) 1/sqrt(sladi) 61.03 0.000
inverse 1/sladi 60.12 0.000
1/square 1/(sladi^2) 57.41 0.000
1/cubic 1/(sladi^3) 53.47 0.000

* Do I have to select the smallest chi-square?

Since the dependent variable is a dichotomy, and to avoid the multivariate OLS become an ineffective estimation technique, many previous studies (e.g: Ahmed and Nicholls, 1994 and Cooke 1998) have performed a logit transformation of the dependent variable of which I have also done in the analysis. The results are as follows:
Median -1.145075
Mean -1.061349
Std. Dev. .2744756
Variance .0753369
Skewness 3.114155
Kurtosis 10.83984

In order to select the suitable model for linear panel regression, the following steps have been conducted and the results are generated as follows:
1. Pooled OLS
. regress lsladi ti lsz qp ai rg

2. Pooled OLS versus Random Effect
. xtreg lsladi ti lsz qp ai rg, re

* p-value is <0.05, thus, random effect model is chosen over OLS, which has organisation-specific effects in the data

3. Breausch & Pagan LM test
. xttest0

Prob > chibar2 = 0.0000

4. Random versus Fixed Effects Model: Hausman Test
. xtreg lsladi ti lsz qp ai rg, fe

note: rg omitted because of collinearity

Fixed-effects (within) regression Number of obs = 104
Group variable: code Number of groups = 26

R-sq: Obs per group:
within = 0.0042 min = 4
between = 0.2336 avg = 4.0
overall = 0.2319 max = 4

F(4,74) = 0.08
corr(u_i, Xb) = -0.5142 Prob > F = 0.9888

------------------------------------------------------------------------------
lsladi | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ti | .0025058 .0062604 0.40 0.690 -.0099684 .0149799
lsz | -.0071901 .0168941 -0.43 0.672 -.0408524 .0264722
qp | -.0014392 .0147954 -0.10 0.923 -.0309198 .0280414
ai | -.0002862 .0065554 -0.04 0.965 -.0133481 .0127757
rg | 0 (omitted)
_cons | -.956881 .2544114 -3.76 0.000 -1.463807 -.4499553
-------------+----------------------------------------------------------------
sigma_u | .28428561
sigma_e | .01412062
rho | .99753891 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(25, 74) = 1004.71 Prob > F = 0.0000

* What should be done on the 'rg' which is omitted due to collinearity?

. est store fixed

. xtreg lsladi ti lsz qp ai rg, re

Random-effects GLS regression Number of obs = 104
Group variable: code Number of groups = 26

R-sq: Obs per group:
within = 0.0004 min = 4
between = 0.9921 avg = 4.0
overall = 0.9901 max = 4

Wald chi2(5) = 2676.26
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

------------------------------------------------------------------------------
lsladi | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ti | .0025326 .0056746 0.45 0.655 -.0085894 .0136547
lsz | -.0017625 .0052271 -0.34 0.736 -.0120074 .0084824
qp | .0136659 .009894 1.38 0.167 -.005726 .0330579
ai | .0007057 .0059986 0.12 0.906 -.0110513 .0124627
rg | 1.000169 .023057 43.38 0.000 .9549783 1.04536
_cons | -1.137271 .0663371 -17.14 0.000 -1.267289 -1.007253
-------------+----------------------------------------------------------------
sigma_u | .02611504
sigma_e | .01412062
rho | .77377484 (fraction of variance due to u_i)
------------------------------------------------------------------------------

. hausman fixed

---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| fixed . Difference S.E.
-------------+----------------------------------------------------------------
ti | .0025058 .0025326 -.0000269 .0026442
lsz | -.0071901 -.0017625 -.0054276 .0160652
qp | -.0014392 .0136659 -.0151051 .0110006
ai | -.0002862 .0007057 -.0009919 .0026439
------------------------------------------------------------------------------
b = consistent under Ho and Ha; obtained from xtreg
B = inconsistent under Ha, efficient under Ho; obtained from xtreg

Test: Ho: difference in coefficients not systematic

chi2(4) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 2.03
Prob>chi2 = 0.7298

* p-value is >0.05, thus the study use random effect model

5. Diagnostic checks:

(i) Multicollinearity
. regress lsladi ti lsz qp ai rg
. vif

Variable | VIF 1/VIF
-------------+----------------------
qp | 5.26 0.190003
lsz | 4.97 0.201375
ti | 1.71 0.584359
rg | 1.48 0.677186
ai | 1.17 0.857526
-------------+----------------------
Mean VIF | 2.92

(ii) Heteroskedasticity
. xtreg lsladi ti lsz qp ai rg, fe
. xttest3

Modified Wald test for groupwise heteroskedasticity
in fixed effect regression model

H0: sigma(i)^2 = sigma^2 for all i

chi2 (26) = 5.4e+09
Prob>chi2 = 0.0000

(iii) Serial correlation
. xtserial lsladi ti lsz qp ai rg

Wooldridge test for autocorrelation in panel data
H0: no first-order autocorrelation
F( 1, 25) = 5.028
Prob > F = 0.0341

* The diagnostic checks indicate heteroskedasticity and serial correlation problems as both p-values <0.05

6. To retify: perform OLS with Heteroskedasticity and Serial Correlation Robust Standard Error
. regress lsladi ti lsz qp ai rg, cluster (code)

Linear regression Number of obs = 104
F(5, 25) = 11362.50
Prob > F = 0.0000
R-squared = 0.9904
Root MSE = .02751

(Std. Err. adjusted for 26 clusters in code)
------------------------------------------------------------------------------
| Robust
lsladi | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ti | .0019701 .0077007 0.26 0.800 -.0138897 .0178299
lsz | -.007022 .0063009 -1.11 0.276 -.0199989 .0059549
qp | .0271579 .0165082 1.65 0.112 -.0068414 .0611571
ai | .0031823 .0098827 0.32 0.750 -.0171714 .0235361
rg | .9947972 .0248116 40.09 0.000 .9436968 1.045898
_cons | -1.079265 .0766559 -14.08 0.000 -1.23714 -.9213888
------------------------------------------------------------------------------

From the above results, I have a few questions that are highly sought from the statalist members.
1. What is the best test to assess the assumption of normality? Does the test only confined to dependent variable? If the result still show a lack of normality in the residual errors after performing the data transformation (e.g. log transformation), does it affect the OLS regression results?

2. Does the measurement of variables (1 continuous variable, 2 categorical variables and 2 dummy variables) impact the OLS analysis?

3. Lastly, based on the steps shown above, am I on the right track using STATA command?

Thank you and I really hope that I will get a favourable reply.

Regards,

Hadysyam
Comment
HADYSYAM JUNAIDI

Join Date: Aug 2016

Posts: 10
#12

25 Aug 2016, 20:27

Dear all,

May I get some comments from Statalist members on the above queries.

Your valuable inputs are highly sought so as to address my predicament on conducting the multivariate analysis.

Regards,

Hadysyam
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#13

26 Aug 2016, 00:06

Hadysyam:
your query has not received any reply so far because, I assume, it is too long.
You should be better off with re-posting a shorter version of it, focusing on one or two topics.
I would also recommend you to read the FAQ about how to post more effectively and how to report what you typed and what Stata gave you back via CODE delimiters. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
HADYSYAM JUNAIDI

Join Date: Aug 2016

Posts: 10
#14

01 Sep 2016, 02:02

Dear all Statalist members,

Sorry for the lengthy posting. The data is a balanced panel data with N=26 and T=4, resulting in a total of 104 observations.

The dependent variable is measured as a ratio which bound to lie between the range to 0 and 1. Based on my understanding from many previous disclosure studies, when the dependent variable has values that fall between 0 to 1, then the multivariate OLS model becomes an ineffective estimation technique. To counter this, I have applied the natural logarithmic transformation to reduce the effect of skewness. However, the result still indicate non-normal distribution of Skewness = 3.114155 and Kurtosis = 10.83984.

On the other hand, the measurement of independent variables consist of 2 categorical variables (measurement: 1, 2 and 3), 1 continuous variable (natural logarithm of total revenue) and 2 dummy variable (measurement of 0 and 1).

Arising from the measurement of dependent and independent variables as stated above, what is the best method in analyzing the effect of independent variables on dependent variable (i.e. the extent of accrual accounting disclosure).

Regards,

Hadysyam
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#15

01 Sep 2016, 02:18

Hadysiam:
thanks for providing more details (posting what you typed and what Stata gave you back remains the best approach to let other listers know about what you're after, though).
Some remarks about your updates:
- if you have long N, short T panel dataset, you would be better off with -xtreg-, as your -depvar- is a continuos one;
- it is not among OLS prerequisites that -depvar- should follow a normal distribution (whereas residual should);
- I'm not clear with the role of the dummies: do you mean that -depvar- is the ratio between the 2 dummies (mesurement of 0 and 1)? Or what else?

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment