Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • OLS regression

    CurrentIy, I am doing an analysis on the extent of accrual accounting disclosures of the three financial years for 26 organisations. The dependent variable (i.e. accrual accounting disclosures) is measured by using dichotomous scoring. While, the five independent variables consist of two dummy variable (1,0), two categorical variables which are labelled 1, 2 & 3 respectively and one continuous variable (i.e. revenue) which is transformed into a natural logarithm. In this regard, I am contemplating to deploy STATA software for running the OLS regression.

    My question is whether it is possible to run OLS regression if the independent variables are characterised by more than 2 dummy/ categorical variables. Does it have any impact on normality, heteroscedasticity and serial correlation impacts?

    Kindly advise.

  • #2
    AFAIK that's all fine.

    Comment


    • #3
      Your variable is a dummy, so you are talking about the probability of adopting the accounting disclosures given a number of other variables.
      If you had only dummies as independent variables, an OLS (which in this case is called linear probability model) would be ok. With other types of variables, it can be argued that LPM is not the best model (see for example Horrace, W. C., and R. L. Oaxaca. 2006. “Results on the Bias and Inconsistency of Ordinary Least Squares for the Linear Probability Model.” Economic Letters, 90, 321-327.). As you also have categorical and continuous, you should think about Probit or Logit (some people disagree, but this is what I would do).

      About your other questions, you need to test for heteroscedascity and serial correlation. If they are present, you need to check if there are remedies (or, if not, how to account for it). The presence of such issues will depend on the design of your experiment and on the collection method of your data.

      Comment


      • #4
        PS. I understand you have 26 organizations x 3 years = less than 100 observations. If that is the case, you need to take that into account too.

        Comment


        • #5
          Hadysiam:
          welcome to the list.
          Since you're seemingly dealing with panel data, why not considering -xtlogit- (or -xtreg- for linear probability model)?
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            I have a few questions before considering to deploy STATA:

            (i) It seems that in some accounting disclosure studies, they have used OLS panel regression with clustered robust standard error. May I know what is the advantage of adopting this method?

            (ii) For the two categorical variables which are labelled 1, 2 & 3 (e.g. 3 is given to full accrual features of the computerized accounting system, 2- partial accrual features & 1- non accrual features), do I have to create dummy variables before running the OLS linear regression?

            (iii) In the case of dependent variable (the dichotomous scoring grant of '1' is awarded if an accrual accounting item is disclosed, and '0' if otherwise, do I have to perform transformation of variable if the assumption of normality is not met. What type of transformation is used upon performing the 'ladder enroll' step. Does the smallest chi-square (e.g. reciprocal cube) is chosen as the function of the transformation. Cooke. 1988. "Regression Analysis in Accounting Disclosure Studies". Accounting and Business Research, 28 (3), 209-224 suggests a possible transformation in disclosure studies is the log of the odd ration of the dependent variable.

            (iv) Before performing the OLS panel regression with clustered robust standard error, the tests outline below I presume need to be executed, which are:
            - White test's and Breusch-Pagan test to validate the evidence of heteroscedasticity,
            - Shapiro-Wilk test and Kernel density estimate for the assumption of normality, and
            - Variation Inflation Factor for multicollinearity diagnostics

            (v) If my PhD study, the study population only involves 26 organizations x 3 years financial statements, which is less than 100 organisation. Does it have any impacts on the result of regression analysis?

            Please enlighten me on the issue mentioned above as I am still novice in the academia and STATA as well.

            Kind regards,

            Hadysyam

            Comment


            • #7
              Regarding your question number (ii), the answer is "no, you don't need to". You can do

              Code:
              i.categoricalvariable
              if you want to consider your first category as the base category.

              If you want to consider another category as the base category, you just need to type

              Code:
              b2.categoricalvariable
              b3.categoricalvariable
              for the second and the third category as base, respectively.

              for your question (iii) if you are talking about the error term, assumption of normality is an assumption.

              for your question (v) you need to search for works dealing with small samples and check what they recommend.

              Comment


              • #8
                Hadysyam:
                ​(I): there's no actual gain in preferring OSL vs -xtreg-, unless the F-test at the foot of the -xtreg- outcome table lacks statistical significance (please, see examples under -xtreg- entry in Stats .PDF manual);
                (iv): -regress postestimation- tests should be performed after OLS is run. Besides, if you impose clustered standard errors, you cannot investigate heteroskedasticity via -estat hettest-;
                (v) with 26 clusters, you should not exceed 3 predictors (rule of thumb: 1 predictor every 10 clusters).
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Dear Carlo & Mari,

                  Thank you for your reply. Once I have purchased the Small Stata version, I will revert to you all for further advice.

                  I believe a book entitled 'A Gentle Introduction to Stata, Fifth Edition by Acock, 2016' will help me much to master the arts of handling Stata.

                  Regards,

                  Hadysyam

                  Comment


                  • #10
                    Hadysyam:
                    I would also recommend Cameron and Trivedi's textbook "Micro econometrics using Stata" (for further details, take a look at Stata bookstore).
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      Dear all Statalist members,

                      First of all, I would like to apologize for the lengthy posting. As I highlighted before, I am doing an analysis on the extent of accrual accounting disclosures of the 26 local authorities in Malaysia by concentrating on the annual financial statements which have now expanded to 4 financial years (104 observations). The dependent variable (SLADI) is the ratio of accrual accounting disclosures, measuring by the dichotomous score of 0 and 1. There are 5 independent variables which are (i) technology infrastructure (TI), labelled 1, 2 and 3, (ii) personnel qualification (QP), labelled 1, 2 and 3, (iii) size (SZ), which is a natural logarithm of revenue, (iv) audit size (AI), dummy variable of 0 and 1, (v) regulations (RG), dummy variable of 0 and 1.

                      The descriptive statistics of dependent variable are as follows, which indicate non-normality of residuals:
                      Median .3182
                      Mean .3636192
                      Std. Dev. .1519788
                      Variance .0230976
                      Skewness 3.153982
                      Kurtosis 11.00086

                      As such, transformation of variable has been performed by firstly executing the 'ladder sladi' command to help in the process:
                      Transformation formula chi2(2) P(chi2)
                      ------------------------------------------------------------------
                      cubic sladi^3 62.84 0.000
                      square sladi^2 62.72 0.000
                      identity sladi 62.43 0.000
                      square root sqrt(sladi) 62.14 0.000
                      log log(sladi) 61.69 0.000
                      1/(square root) 1/sqrt(sladi) 61.03 0.000
                      inverse 1/sladi 60.12 0.000
                      1/square 1/(sladi^2) 57.41 0.000
                      1/cubic 1/(sladi^3) 53.47 0.000

                      * Do I have to select the smallest chi-square?

                      Since the dependent variable is a dichotomy, and to avoid the multivariate OLS become an ineffective estimation technique, many previous studies (e.g: Ahmed and Nicholls, 1994 and Cooke 1998) have performed a logit transformation of the dependent variable of which I have also done in the analysis. The results are as follows:
                      Median -1.145075
                      Mean -1.061349
                      Std. Dev. .2744756
                      Variance .0753369
                      Skewness 3.114155
                      Kurtosis 10.83984

                      In order to select the suitable model for linear panel regression, the following steps have been conducted and the results are generated as follows:
                      1. Pooled OLS
                      . regress lsladi ti lsz qp ai rg

                      2. Pooled OLS versus Random Effect
                      . xtreg lsladi ti lsz qp ai rg, re

                      * p-value is <0.05, thus, random effect model is chosen over OLS, which has organisation-specific effects in the data

                      3. Breausch & Pagan LM test
                      . xttest0

                      Prob > chibar2 = 0.0000


                      4. Random versus Fixed Effects Model: Hausman Test
                      . xtreg lsladi ti lsz qp ai rg, fe

                      note: rg omitted because of collinearity

                      Fixed-effects (within) regression Number of obs = 104
                      Group variable: code Number of groups = 26

                      R-sq: Obs per group:
                      within = 0.0042 min = 4
                      between = 0.2336 avg = 4.0
                      overall = 0.2319 max = 4

                      F(4,74) = 0.08
                      corr(u_i, Xb) = -0.5142 Prob > F = 0.9888

                      ------------------------------------------------------------------------------
                      lsladi | Coef. Std. Err. t P>|t| [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                      ti | .0025058 .0062604 0.40 0.690 -.0099684 .0149799
                      lsz | -.0071901 .0168941 -0.43 0.672 -.0408524 .0264722
                      qp | -.0014392 .0147954 -0.10 0.923 -.0309198 .0280414
                      ai | -.0002862 .0065554 -0.04 0.965 -.0133481 .0127757
                      rg | 0 (omitted)
                      _cons | -.956881 .2544114 -3.76 0.000 -1.463807 -.4499553
                      -------------+----------------------------------------------------------------
                      sigma_u | .28428561
                      sigma_e | .01412062
                      rho | .99753891 (fraction of variance due to u_i)
                      ------------------------------------------------------------------------------
                      F test that all u_i=0: F(25, 74) = 1004.71 Prob > F = 0.0000

                      * What should be done on the 'rg' which is omitted due to collinearity?

                      . est store fixed

                      . xtreg lsladi ti lsz qp ai rg, re

                      Random-effects GLS regression Number of obs = 104
                      Group variable: code Number of groups = 26

                      R-sq: Obs per group:
                      within = 0.0004 min = 4
                      between = 0.9921 avg = 4.0
                      overall = 0.9901 max = 4

                      Wald chi2(5) = 2676.26
                      corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

                      ------------------------------------------------------------------------------
                      lsladi | Coef. Std. Err. z P>|z| [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                      ti | .0025326 .0056746 0.45 0.655 -.0085894 .0136547
                      lsz | -.0017625 .0052271 -0.34 0.736 -.0120074 .0084824
                      qp | .0136659 .009894 1.38 0.167 -.005726 .0330579
                      ai | .0007057 .0059986 0.12 0.906 -.0110513 .0124627
                      rg | 1.000169 .023057 43.38 0.000 .9549783 1.04536
                      _cons | -1.137271 .0663371 -17.14 0.000 -1.267289 -1.007253
                      -------------+----------------------------------------------------------------
                      sigma_u | .02611504
                      sigma_e | .01412062
                      rho | .77377484 (fraction of variance due to u_i)
                      ------------------------------------------------------------------------------

                      . hausman fixed

                      ---- Coefficients ----
                      | (b) (B) (b-B) sqrt(diag(V_b-V_B))
                      | fixed . Difference S.E.
                      -------------+----------------------------------------------------------------
                      ti | .0025058 .0025326 -.0000269 .0026442
                      lsz | -.0071901 -.0017625 -.0054276 .0160652
                      qp | -.0014392 .0136659 -.0151051 .0110006
                      ai | -.0002862 .0007057 -.0009919 .0026439
                      ------------------------------------------------------------------------------
                      b = consistent under Ho and Ha; obtained from xtreg
                      B = inconsistent under Ha, efficient under Ho; obtained from xtreg

                      Test: Ho: difference in coefficients not systematic

                      chi2(4) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                      = 2.03
                      Prob>chi2 = 0.7298

                      * p-value is >0.05, thus the study use random effect model

                      5. Diagnostic checks:

                      (i) Multicollinearity
                      . regress lsladi ti lsz qp ai rg
                      . vif

                      Variable | VIF 1/VIF
                      -------------+----------------------
                      qp | 5.26 0.190003
                      lsz | 4.97 0.201375
                      ti | 1.71 0.584359
                      rg | 1.48 0.677186
                      ai | 1.17 0.857526
                      -------------+----------------------
                      Mean VIF | 2.92

                      (ii) Heteroskedasticity
                      . xtreg lsladi ti lsz qp ai rg, fe
                      . xttest3

                      Modified Wald test for groupwise heteroskedasticity
                      in fixed effect regression model

                      H0: sigma(i)^2 = sigma^2 for all i

                      chi2 (26) = 5.4e+09
                      Prob>chi2 = 0.0000

                      (iii) Serial correlation
                      . xtserial lsladi ti lsz qp ai rg

                      Wooldridge test for autocorrelation in panel data
                      H0: no first-order autocorrelation
                      F( 1, 25) = 5.028
                      Prob > F = 0.0341


                      * The diagnostic checks indicate heteroskedasticity and serial correlation problems as both p-values <0.05

                      6. To retify: perform OLS with Heteroskedasticity and Serial Correlation Robust Standard Error
                      . regress lsladi ti lsz qp ai rg, cluster (code)

                      Linear regression Number of obs = 104
                      F(5, 25) = 11362.50
                      Prob > F = 0.0000
                      R-squared = 0.9904
                      Root MSE = .02751

                      (Std. Err. adjusted for 26 clusters in code)
                      ------------------------------------------------------------------------------
                      | Robust
                      lsladi | Coef. Std. Err. t P>|t| [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                      ti | .0019701 .0077007 0.26 0.800 -.0138897 .0178299
                      lsz | -.007022 .0063009 -1.11 0.276 -.0199989 .0059549
                      qp | .0271579 .0165082 1.65 0.112 -.0068414 .0611571
                      ai | .0031823 .0098827 0.32 0.750 -.0171714 .0235361
                      rg | .9947972 .0248116 40.09 0.000 .9436968 1.045898
                      _cons | -1.079265 .0766559 -14.08 0.000 -1.23714 -.9213888
                      ------------------------------------------------------------------------------


                      From the above results, I have a few questions that are highly sought from the statalist members.
                      1. What is the best test to assess the assumption of normality? Does the test only confined to dependent variable? If the result still show a lack of normality in the residual errors after performing the data transformation (e.g. log transformation), does it affect the OLS regression results?

                      2. Does the measurement of variables (1 continuous variable, 2 categorical variables and 2 dummy variables) impact the OLS analysis?

                      3. Lastly, based on the steps shown above, am I on the right track using STATA command?

                      Thank you and I really hope that I will get a favourable reply.

                      Regards,

                      Hadysyam

                      Comment


                      • #12
                        Dear all,

                        May I get some comments from Statalist members on the above queries.

                        Your valuable inputs are highly sought so as to address my predicament on conducting the multivariate analysis.

                        Regards,

                        Hadysyam

                        Comment


                        • #13
                          Hadysyam:
                          your query has not received any reply so far because, I assume, it is too long.
                          You should be better off with re-posting a shorter version of it, focusing on one or two topics.
                          I would also recommend you to read the FAQ about how to post more effectively and how to report what you typed and what Stata gave you back via CODE delimiters. Thanks.
                          Kind regards,
                          Carlo
                          (Stata 19.0)

                          Comment


                          • #14
                            Dear all Statalist members,

                            Sorry for the lengthy posting. The data is a balanced panel data with N=26 and T=4, resulting in a total of 104 observations.

                            The dependent variable is measured as a ratio which bound to lie between the range to 0 and 1. Based on my understanding from many previous disclosure studies, when the dependent variable has values that fall between 0 to 1, then the multivariate OLS model becomes an ineffective estimation technique. To counter this, I have applied the natural logarithmic transformation to reduce the effect of skewness. However, the result still indicate non-normal distribution of Skewness = 3.114155 and Kurtosis = 10.83984.

                            On the other hand, the measurement of independent variables consist of 2 categorical variables (measurement: 1, 2 and 3), 1 continuous variable (natural logarithm of total revenue) and 2 dummy variable (measurement of 0 and 1).

                            Arising from the measurement of dependent and independent variables as stated above, what is the best method in analyzing the effect of independent variables on dependent variable (i.e. the extent of accrual accounting disclosure).

                            Regards,

                            Hadysyam

                            Comment


                            • #15
                              Hadysiam:
                              thanks for providing more details (posting what you typed and what Stata gave you back remains the best approach to let other listers know about what you're after, though).
                              Some remarks about your updates:
                              - if you have long N, short T panel dataset, you would be better off with -xtreg-, as your -depvar- is a continuos one;
                              - it is not among OLS prerequisites that -depvar- should follow a normal distribution (whereas residual should);
                              - I'm not clear with the role of the dummies: do you mean that -depvar- is the ratio between the 2 dummies (mesurement of 0 and 1)? Or what else?
                              Kind regards,
                              Carlo
                              (Stata 19.0)

                              Comment

                              Working...
                              X