Some help required regarding uni/multivariate regression

Thomas Meurs

Join Date: Jun 2015

Posts: 27
#1

Some help required regarding uni/multivariate regression

06 Jun 2015, 11:44

Hello all,

I am writing my accounting thesis and I have to run some regressions in Stata and I can use some help.

I am looking whether there is a difference between growth in revenue and the growth in non-financial measures (such as the number of employees) between fraud and non-fraud firms, comparing the year prior to the fraud and the fraud year.

I have compiled a sample of firms whom overstated their revenue in certain years and matched each firm with a non-fraudulent competitor (each firm has a unique firm identifier (a gvkey)).

I require:
- the descriptive statistics of my sample, differentiating between fraud and non-fraud firms.
- A univariate analyses that tests the differences in means between the two groups
- A correlation matrix, comparing all control variables with a measure which is called 'capacity diff' and measures: revenue growth - non-financial measure growth
- A multivariate regression which looks as follows:
P(fraud) = B0 + B1 Capacity Diff + Bi Control variables
Where P(fraud) denotes a dummy variable coded 1 for fraud firms and 0 for non-fraud firms.

I don't know much about Stata or how it works, I am willing though to read/google a lot, however, I thought I could make a separate topic for this one.
What I was wondering is how to run such regressions comparing the differences between the matched pairs (It is some sort of matched pair sample regression), so how do I tell Stata which fraud firm belongs to which non-fraud competitor? (If possible/necessary I can edit the data by hand since I only have 30 pairs) + how to tell which control variable belongs to the fraud firms and which to the non-fraud firms (variables such as leverage ratio, altman Z-score etc).
I know that since my dependent variable is either 1 or 0 (fraud or non-fraud) I need a probit/logit regression, but how to choose between the two?

And finally, can I just run something like:
[Code]
probit/logit P(fraud) Capacity Diff Leverage Altman Z etc? Or isn't it that simple?

Any help is much appreciated!

Thomas
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

06 Jun 2015, 14:41

You need to describe your data better in order to get answers to your questions. Showing us the output of -describe- and maybe -list-ing out a small representative set of relevant variables for a few observations, along with explanations of what each variable is/does/means (if it isn't obvious from the variable labels shown in -describe-), would be helpful.

As for logit vs probit, the choice between them is often just what is preferred in your discipline. In an ordinary logit or probit regression, or in multilevel mixed effects models, the results typically come out about the same except for a scaling factor reflecting the ratio of the variances of the standard normal (1) and standard logistic (pi²/3) distributions. In theory one might reach different conclusions from the two models, but it is a rare occurrence, and in a career now spanning 3 decades, I have yet to see it even once.

That said, your hand may be forced here. You have matched-pair data, and a relatively small number of pairs. Due to the small number of pairs, and the general preference within econometrics for consistency (fixed effects) over efficiency (random effects), you will likely have to use a fixed-effects model. Stata does not have a fixed-effects probit regression. So you will probably have to do conditional logistic regression (-xtlogit, fe-). If you are OK with using a random effects model instead (say, you have the blessing of the Hausman test), then you can chose between -xtlogit, re- and -xtprobit, re-. In that case, the choice is mostly a matter of taste, as noted above.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3018
#3

07 Jun 2015, 01:48

Dear Thomas,

Besides echoing Clyde's comments, I would like to point out that your data were collecting using a form of choice based sampling. That is, the data were collected in a way that ensures that 50% of the firms in the sample committed fraud, which presumably is not true in the population. Unless you explicitly account for this form of sampling in the estimation, you will get a biased estimate of at least the intercept, and therefore you won't be able to make predictions about what happens in the population and you won't be able to compute marginal effects.

Going back to what Clyde said, it looks as if you want to treat your data as a panel, but I am not sure that is appropriate here. It would be good if you provided more information about the data and about how the matching was done.

Best of luck,

Joao
Comment
Thomas Meurs

Join Date: Jun 2015

Posts: 27
#4

07 Jun 2015, 14:19

Allright, better descriptions will follow later. I have another question. I have the data for all of my fraud firms from 2000 up to 2014. I want to keep only the data regarding the fraud year and the previous 3 years. How do I do that?
The year of the fraud is denoted by variable 'year' and the firm year observations are ordered by a year variable 'fyear'. I think it should look something like:

Code:

drop if year <=fyear[_n-3]

So delete everything that does not fall within the range of the year of the fraud is committed minus three. I'm not really sure how the '==' thing work
(also it should delete years later than the fraud year..)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#5

07 Jun 2015, 15:13

No, your proposed code will look at the value of fyear 3 observations back, and if that result is greater than the value of year in the current observation, it will drop that observation. I don't think that's even close to what you want. Not to mention, it takes no account of the fact that sometimes the observation 3 observations back will belong to a different firm!

I will assume that the fraud year variable, year, is coded non-missing and has the same value in every observation for a given firm. To keep only those observations in the fraud year and three preceding years you could do:

Code:

// VERIFY year IS CONSTANT WITHIN FIRM by firm (year), sort: assert year[1] == year[_N] // KEEP ONLY OBSERVATIONS FOR FRAUD YEAR AND 3 PRECEDING by firm: keep if inrange(fyear, year-3, year)

There is one major difficulty you need to think about here: what happens to the non-fraud firms under this code? If the fraud year variable, year, is missing for these forms, then you will end up dropping all the non-fraud firm observations. Or is that variable set to the same year as the fraud-firm to which you have matched it in the work you describe earlier? In that case it will retain the observations in the matched fraud-firm's fraud year and the three preceding years--which seems to me like a sensible thing to do--but is it what you want? Or do you have some other way to identify which firms in your data are fraud firms? And if so, which observations do you want to keep for the non-fraud firms?

As an aside, if I were managing a data set like this, my inclination would be to use the name fyear for the year of the fraud, and the name year for the regular calendar year variable. It doesn't matter to Stata, of course, but unless you have some other rationale for what you've done, I fear you will end up confusing yourself, either now or when you come back to look at this in the future.
Comment
Thomas Meurs

Join Date: Jun 2015

Posts: 27
#6

07 Jun 2015, 15:56

Hi Clyde,

Thanks for your quick response, I think I should mention you in the preface :p. The reason I want to drop the observations outside of that time window is because I need to compute some control variables which are based on the fraud year t and and t-1 (in one case t-3) and I figured it was the easiest way was to delete all irrelevant data and compute my variables without doing all sorts of coding to only consider a certain year etc (due to my limited knowledge of Stata).
For example, one variable would be return on assets (ROA) and is computed: Net Income_t(NI)/Total Assets_t-1 (AT). Given that frauds occur at different years I thought I only keep the data in the way you described.

To answer your questions, the firms in my sample are coded 1 if fraud was committed and 0 otherwise (variable is 'fraud') furthermore the fraud firm and its competitor are matched by a variable 'pair' giving the same value to both the firms. The variable is called fyear because fiscal years are concerned.
There are no missing data for either variables fyear (downloaded from compustat) and year (manually entered) so there's no issue there I guess.

After the first part of your code Stata states:
1 contradiction in 59 by-groups
assertion is false
r(9);
I tried:

Code:

list year if !(year < . & year > 0)

But of course nothing happened. So I can't see where the contradiction is.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#7

07 Jun 2015, 16:15

The contradiction is that there is some firm for which the variable year takes on more than one value. You can find it by doing the following:

Code:

by firm (year), sort: gen problem = (year[1] != year[_N]) list if problem

Stata will show you a listing of all observations for any firm (apparently there is only 1) for which the variable year is not constant. You will then need to figure out how that happened, and fix it.
Comment

Thomas Meurs

Join Date: Jun 2015
Posts: 27

12 Jun 2015, 14:41

Oke, here I am again.

describe gives this:

Code:

 storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------------
fraud           byte    %10.0g                Fraud
emp_diff        float   %9.0g                 
find            float   %9.0g                 dummy if freecash <-.5
lev             float   %9.0g                 
altz            float   %9.0g                 Altman Z Score
MVE             float   %9.0g                 Market Value of Equity
BTM             float   %9.0g                 Book-to-Market
ETP             float   %9.0g                 Earnings-to-Price
ROA             float   %9.0g                 
age             float   %9.0g                 age of firm
MAdum           float   %9.0g                 M&A in year prior to Fraud
B4              float   %9.0g                 Auditor is Big Four
TA              float   %9.0g                 Total Accruals
SIdum           float   %9.0g                 dummy if Special Items
rev_g           float   %9.0g                 revenue growth
AT              double  %10.0g                Assets - Total
nnfm            float   %9.0g                 dummy = 1 if nfm_g<0

Where fraud is a dummy either 1 for a fraud firm and 0 otherwise (and is the dependent variable).
emp_diff is a variable which captures: revenue growth - employee growth and is the main variable of interest.
The rest of the variables are control variables and are, according to literature, known to be associated with financial statement fraud.
find = dummy is the amount of free cash is below 0.5, scaled by current assets
lev = leverage, total debt / total assets
altz = Altman Z score, known for being associated with financial distress
MVE, BTM, ETP and ROA control for abnormal market performance
age = age of firm, lower firm age is associated with higher manipulation of earnings (higher IPO price)
MAdum = is dummy if an M&A took place in the year before the fraud (increases incentive to inflate earnings)
SIdum = is dummy if special items appeared in statement of income (earnings management indicator)
rev_g = to check whether fraud firms aren't just high growth firms
AT = controlling for size
nnfm = dummy if the non-financial measure growth (employees in this case) is negative and downsizing explains the result.

when I just do:

Code:

logit fraud emp_diff find lev altz MVE BTM ETP ROA age MAdum B4 TA SIdum rev_g AT nnfm
estimates table, star(.1 .05 .01)

I get:

Code:

logit fraud emp_diff find lev altz MVE BTM ETP ROA age MAdum B4 TA SIdum rev_g AT nnfm

Iteration 0:   log likelihood = -37.392902  
Iteration 1:   log likelihood =  -25.69289  
Iteration 2:   log likelihood = -24.519038  
Iteration 3:   log likelihood = -23.909381  
Iteration 4:   log likelihood = -23.882307  
Iteration 5:   log likelihood = -23.882112  
Iteration 6:   log likelihood = -23.882111  

Logistic regression                               Number of obs   =         54
                                                  LR chi2(16)     =      27.02
                                                  Prob > chi2     =     0.0412
Log likelihood = -23.882111                       Pseudo R2       =     0.3613

------------------------------------------------------------------------------
       fraud |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    emp_diff |    -4.1983   3.109527    -1.35   0.177    -10.29286    1.896261
        find |   -2.95828   4.409608    -0.67   0.502    -11.60095    5.684392
         lev |   1.863242   2.703586     0.69   0.491    -3.435689    7.162173
        altz |   .0308986   .0274076     1.13   0.260    -.0228193    .0846164
         MVE |  -.0004969   .0003526    -1.41   0.159     -.001188    .0001942
         BTM |  -.8395334   .9631364    -0.87   0.383    -2.727246    1.048179
         ETP |  -.0097791   .0156988    -0.62   0.533    -.0405481    .0209899
         ROA |  -4.377158   5.498204    -0.80   0.426    -15.15344    6.399125
         age |  -.0262698    .040653    -0.65   0.518    -.1059483    .0534087
       MAdum |  -.5870859    1.06695    -0.55   0.582     -2.67827    1.504099
          B4 |  -2.311186   1.510504    -1.53   0.126     -5.27172    .6493482
          TA |   12.10962   9.437233     1.28   0.199    -6.387021    30.60625
       SIdum |  -1.534439   1.288782    -1.19   0.234    -4.060406    .9915277
       rev_g |   3.022402   2.742113     1.10   0.270     -2.35204    8.396844
          AT |   .0004096   .0002989     1.37   0.171    -.0001762    .0009955
        nnfm |   .2155724   1.152204     0.19   0.852    -2.042705     2.47385
       _cons |   2.837296   2.418026     1.17   0.241    -1.901948    7.576541
------------------------------------------------------------------------------
Note: 0 failures and 1 success completely determined.

. estimates table, star(.1 .05 .01)

------------------------------
    Variable |    active      
-------------+----------------
    emp_diff | -4.1983001     
        find | -2.9582801     
         lev |  1.8632418     
        altz |  .03089858     
         MVE | -.00049693     
         BTM | -.83953342     
         ETP | -.00977908     
         ROA |  -4.377158     
         age |  -.0262698     
       MAdum | -.58708586     
          B4 | -2.3111859     
          TA |  12.109617     
       SIdum | -1.5344392     
       rev_g |  3.0224021     
          AT |  .00040965     
        nnfm |  .21557236     
       _cons |  2.8372961     
------------------------------
legend: * p<.1; ** p<.05; *** p<.01

Can anyone tell me if my regression technically correct? Ignoring all sorts of factors such as sample size etc? Besides the technicality, how would you run the regression (i.e. adding all sorts of fixed effects etc.).

I'm also not sure what the Log Likelihood tells me? What would it mean if in another model the Log Likelihood is -25.00? And what does the Chi.sq tells me?

When looking at the results, my first impression would be that it is not a very useful model given that none of the variables are significant, do you guys agree or should I interpret the results differently?

Kind Regards,
Thomas

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#9

12 Jun 2015, 17:23

Well, assuming that you have not one but several observations for each firm, then you would need to account for the non-independence of observations. One popular way to do that is with a fixed effects model. You need an additional numeric variable that identifies which firm each observation pertains to. Let's call that firm for the sake of illustration. It is also usually a good idea to use the clustered robust variance estimator in such circumstances.

Code:

xtset firm xtlogit fraud emp_diff find lev altz MVE BTM ETP ROA age MAdum B4 TA SIdum rev_g AT nnfm, fe vce(cluster firm)

However, if your data set has only one observation per firm, then you can stick with flat -logit-.

Either way, have you explored graphically the relationships between your continuous predictors and your outcome? -lowess, logit- is a handy way to see if there are substantial non-linearities lurking. If there are, entering the covariates as is can lead to inconsistent estimates in that the actual effect you are interested in might be confounded with capturing the non-linearity. So if you have strongly non-linear relationships, particularly non-monotonic ones, you should consider transformations or splines or something like that.

The familiar linear regression is fit by finding coefficients that minimize the sum of the squared residuals. Logistic regression is estimated in a different way. Instead, the probability of observing a data set like the one you have given any hypothetical set of coefficients from the model, can be calculated as a binomial probability for each observation, and as the product of those for the data set as a whole. (The last clause about products, by the way, is why logistic regression is only suitable when all the observations are independent draws from the population of interest. With dependent draws, as with clustering within firms, the probability of A&B is no longer probability of A times probability of B.) That overall probability of observing the data set, as a function of the hypothetical coefficients, is referred to as the likelihood function. Because it typically takes on ridiculously small values if the sample has more than a handful of observations, it is usually more convenient to work with its logarithm, which, of course, is the log-likelihood. Logistic regression models are estimated using iterative procedures to zero in on the coefficients that maximize the (log) likelihood. [Digression: I said earlier that OLS is different because the coefficients are chosen to minimize sum of squared residuals. But actually, it can be shown that those same coefficients are also the ones that maximize the log-likelihood for the ordinary linear model if you assume a normal distribution for the residuals.] There are many, many statistical models for which estimation is carried out by finding the parameter estimates that maximize the (log)-likelihood.

So the log-likelihood is in a sense a measure of the fit of the model. But unlike R2 which has an absolute scale of 0-1, the log-likelihood does not have a fixed natural scale. So it is not really useful in its own right. But there are some other theorems about its probability distribution that are of interest. The most useful result in your context is this: if you fit a logistic regression model with just a constant term, and then fit your real model, then 2*(difference between the log-likelihoods of the two models) has, asymptotically, a chi square distribution with degrees of freedom equal to the number of predictors in your real model if the null hypothesis that all of the predictors' coefficients are zero is true. So that statistic, which is reported in Stat's output as LR chi2 is an omnibus test of the joint significance of all your predictors.

As for whether this model is useful, there are several issues. One is, useful for what purpose? A model may not be very valuable for understanding causal relationships, but still work well for prediction. Or vice versa. So other attributes of your model such as the area under the ROC curve may be more important than the statistical significance of any or all variables if prediction is your main interest. The other issue is that 54 observations is a very small sample size for fitting a model with 16 predictors. That is not even 4 observations per variable. If you can't get more data, look into a more parsimonious model. If you can't posit a more parsimonious model and keep a straight face, then you need (a lot) more data before you proceed. The risk, by the way, with so few observations per variable is not just that you are underpowered to detect effects but that you are likely to overfit the noise in your data!
Comment
Minhaj uddin

Join Date: Dec 2023

Posts: 45
#10

20 Aug 2024, 08:13

Hello, I have some related doubts.

I have 20 years of data, where the first 16 years are divided into Pre crisis (2000-2007) and post crisis (2008–2015). I have used both univariate and multivariate analysis, but I have found inconsistency between the two results.

Like under univariate analysis, I have compared the mean performance between the two periods using a t-test, and the results show that average performance has significantly declined post-crisis.

But when I run the regression using other firm-specific factors with the crisis dummy (2008–2015), the result shows that the dummy coefficient is positive and highly significant, suggesting positive impact of the crisis on performance. This is contradictory to the univariate result.

How can this result be justified? By the way, to make the regression result comparable with univariate, I have restricted the data to the year 2015 only so that non crisis period comprise only the pre crisis period.

Any insights or suggestions on how to reconcile these findings would be greatly appreciated.
Comment
Minhaj uddin

Join Date: Dec 2023

Posts: 45
#11

20 Aug 2024, 09:16

Hi Professor @ Clyde,

I would greatly appreciate your input on the above query. Your insights would be very helpful.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#12

20 Aug 2024, 09:43

Based on the limited information provided, only a limited response can be given. (Even if full information had been provided, to fully resolve your question would almost certainly require advice from somebody knowledgeable in finance or economics, whichever field you are working in.)

There is nothing paradoxical or unexpected here. The implication is that the negative performance effect of the post-crisis years is fully accounted for, and even over-ridden by differences in the additional firm-specific characteristics that were included in the second model. To really understand what is going on, one would need to draw a causal diagram (directed acyclic graph) of the relationships among the variables in the full model. Some of the variables included may have been included inappropriately, if they are mediators of the crisis effect or if they are colliders of the crisis:performance association. But even if all of them are appropriately included, there is still no paradox: it simply means that some of the differences are themselves sufficient to account for the crisis effect itself, and some even have effects in the opposite direction. This phenomenon would be particularly pronounced, I think, if your data are serial cross sections, and perhaps less pronounced in longitudinal data. But either way, you would certainly expect that following the onset of the crisis there would be many changes in firm-specific attributes that might mitigate the adverse crisis effects. This could arise either due to the selective survival of firms based on these attributes or due to some firms' changing their attributes as a response to the crisis and its knock-on effects.

To give a simple and transparent example of this kind of phenomenon, if we were to compare the cancer mortality rates in the past five years between people who identify the Beatles as their favorite pop-music group ever, and those who identify Taylor Swift, we would see that it is astronomically greater among the Beatles fans. But that's just because the Beatles fans are mostly very old and the Taylor Swift fans mostly young. If you were to then add age to the model, the Beatles vs Swift effect would largely if not totally disappear. Something analogous to this is going on in your data.
Comment

Minhaj uddin

Join Date: Dec 2023
Posts: 45

#13

20 Aug 2024, 10:43

Basicly, my objective is to test if the crisis had a significant impact on firm efficiency or not. For that first I have done the univariate analysis to study the trend in the performance metric and test if the average efficiency significantly differ between the pre-and-post crisis period. The result shows that the average efficiency during the post-crisis is significantly lower than the pre-crisis period. Several similar work in literature has stopped there only by concluding that the crisis had a significant negative impact on firm efficiency. However, in my limited understanding, such conclusion are not exactly true, because there are several factors that determine the performance of a firm, therefore, the observed decline could be driven by these factors rather than the crisis itself. For the same I have taken the multiple regression analysis, similar to estimate the model similar to to the attached picture. Where crisis dummy is interacted with several factors of firms performance to see if the crisis has altered the relationship. But I am confuse with interpretation of the crisis dummy, what exactly it shows. Does it shows the average change in dependent variable compare to pre-crisis period? If that is the case then how come in my case crisis coefficient is positive and significant whereas the average has significantly declined?

Image of the model which I am using as a reference

Click image for larger version

Name: GFC.PNG
Views: 1
Size: 72.6 KB
ID: 1762042

My base line result

Code:

------------------------------------------------------------------------------
       INCRS | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       ROA_w |   .0026273   .0039329     0.67   0.504     -.005081    .0103356
      ZROA_w |  -.0003164   .0005421    -0.58   0.559     -.001379    .0007462
       CAR_w |  -.0015536    .000853    -1.82   0.069    -.0032254    .0001182
Liquidity2_w |   .3154634   .0461338     6.84   0.000     .2250429    .4058839
     LLPTL_w |  -.2465933   .1889663    -1.30   0.192    -.6169604    .1237739
        MQ_w |  -.9728294   .3753977    -2.59   0.010    -1.708595   -.2370635
      lnta_w |   .0330016   .0063222     5.22   0.000     .0206103    .0453929
       ETA_w |    .911227   .1078076     8.45   0.000      .699928    1.122526
     BooneTA |   .0291774   .0177778     1.64   0.101    -.0056665    .0640213
    BooneADV |  -.0287212   .0198192    -1.45   0.147    -.0675662    .0101238
    BooneDep |  -.0474446   .0202943    -2.34   0.019    -.0872207   -.0076686
         BSD |  -.0124707   .0014397    -8.66   0.000    -.0152924   -.0096489
         GDP |  -.0006011   .0019953    -0.30   0.763    -.0045118    .0033096
   Inflation |  -.0117203   .0025033    -4.68   0.000    -.0166266   -.0068139
         SMD |   .0012972    .000193     6.72   0.000      .000919    .0016754
     PostGFC |   .1358606   .0221602     6.13   0.000     .0924274    .1792937
       _cons |   .6944597    .066164    10.50   0.000     .5647805    .8241388
-------------+----------------------------------------------------------------
    /sigma_u |    .101775   .0092293    11.03   0.000     .0836859    .1198641
    /sigma_e |   .1035896   .0024126    42.94   0.000     .0988609    .1083183
-------------+----------------------------------------------------------------
         rho |   .4911647   .0469566                      .4001733    .5826201
------------------------------------------------------------------------------
LR test of sigma_u=0: chibar2(01) = 439.30             Prob >= chibar2 = 0.000

My interaction result

Code:

------------------------------------------------------------------------------------
             INCRS | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------------+----------------------------------------------------------------
             ROA_w |   .0009375   .0039111     0.24   0.811    -.0067281    .0086031
            ZROA_w |   .0006202   .0006272     0.99   0.323    -.0006091    .0018495
                   |
c.ZROA_w#c.PostGFC |  -.0010373   .0003891    -2.67   0.008    -.0017999   -.0002747
                   |
             CAR_w |  -.0017855   .0008541    -2.09   0.037    -.0034595   -.0001114
      Liquidity2_w |   .2874544   .0462602     6.21   0.000     .1967861    .3781228
           LLPTL_w |   -.128059   .1902931    -0.67   0.501    -.5010267    .2449086
              MQ_w |  -1.148161    .377696    -3.04   0.002    -1.888432   -.4078908
             ETA_w |   .8706207   .1082076     8.05   0.000     .6585376    1.082704
            lnta_w |   .0402727   .0068562     5.87   0.000     .0268348    .0537106
                   |
c.lnta_w#c.PostGFC |  -.0144917   .0040807    -3.55   0.000    -.0224896   -.0064937
                   |
           BooneTA |   .0295007   .0175681     1.68   0.093    -.0049322    .0639336
          BooneADV |  -.0374101   .0198678    -1.88   0.060    -.0763502      .00153
          BooneDep |  -.0404795   .0201511    -2.01   0.045     -.079975    -.000984
               BSD |  -.0125994   .0014296    -8.81   0.000    -.0154015   -.0097974
               GDP |  -.0006901   .0019729    -0.35   0.726    -.0045569    .0031767
         Inflation |  -.0112497   .0024788    -4.54   0.000     -.016108   -.0063914
               SMD |   .0012268   .0001917     6.40   0.000     .0008512    .0016025
           PostGFC |   .3007419   .0473832     6.35   0.000     .2078725    .3936112
             _cons |   .6378883   .0683892     9.33   0.000     .5038479    .7719287
-------------------+----------------------------------------------------------------
          /sigma_u |   .1058681   .0098641    10.73   0.000     .0865349    .1252014
          /sigma_e |   .1023555   .0023928    42.78   0.000     .0976658    .1070452
-------------------+----------------------------------------------------------------
               rho |   .5168649   .0483593                      .4224968    .6102933
------------------------------------------------------------------------------------
LR test of sigma_u=0: chibar2(01) = 449.79             Prob >= chibar2 = 0.000

Please pardon the length of this post, and I appreciate any assistance you can provide.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#14

20 Aug 2024, 12:09

Where crisis dummy is interacted with several factors of firms performance to see if the crisis has altered the relationship. But I am confuse with interpretation of the crisis dummy, what exactly it shows. Does it shows the average change in dependent variable compare to pre-crisis period? If that is the case then how come in my case crisis coefficient is positive and significant whereas the average has significantly declined?

You did not mention interaction in #1. This puts it in an entirely different light, and your new questions have hit precisely the right note.

No, the coefficient of a variable X in a model that contains X, ,c.X#c.PostGFC, and PostGFC the coefficient of PostGFC definitely does not represent the average change in the dependent variable compared to pre-crisis period. It represents the average change in the dependent variable conditional on X being zero. For some variables, X = 0 is not even possible in the real world, so sometimes the coefficient of PostGFC represents a change in the dependent variable that cannot be realized in the real world--it is a hypothetical, counterfactual only. If X = 0 is possible, then you can interpret the coefficient of X as the predicted change in the dependent variable in the sub-population for which X = 0 holds. But it is definitely not the average. So in this case, you are comparing two things that are not comparable.

Now, I have no idea what ZROA or lnta are, so I don't know if they can take on zero values or not, nor, even if they can, whether the case where they are zero is of any real interest or not. Those are substantive questions that you can answer (or, if you cannot, you need advice from somebody in your field about that, not statistical advice).

To get an average marginal effect of the crisis on performance (INCRS) from your interaction model, you can run -margins, dydx(PostGFC)- following the regression. Added: What I said in #12 still remains true, and it is entirely possible that the inclusion of all of these variables in the model may well cause the average marginal effect estimate to differ greatly (even opposite in sign) from what you got from a crude bivariate comparison.

By the way, I was under the impression from #1 that PostCrisis is a dichotomous yes/no variable. If so, you should be representing it with i.PostGFC, not c.PostGFC in your regression.
Comment
Minhaj uddin

Join Date: Dec 2023

Posts: 45
#15

20 Aug 2024, 14:33

Thank You so much, Prof Clyde.

By the way what is your take on the first model (1) in the image that I have attached? Where model (1) has only crisis dummy, model (2) includes the interaction term. So correct if I am wrong whatever you have said is applicable to model (2) involving the interaction term where the interpretation of crisis dummy cannot be made in isolation. One more thing, is it that the interpretation of the variables involved in interaction term is two way, on one side it can be interpreted as you said the average impact of PostGFC on dependent variable conditional on X being zero, on the other side coefficient on X represents the average change in dependent variable during the non crisis period.

My real doubt is the interpretation of model (1) where only crisis dummy is involved without interaction term. How to interpret that crisis dummy? what does it exactly show? In my baseline result it is PostGFC (positive and significant). If it represent the average change in dependent variable during the crisis period compare to non crisis period, at least in my case it seems Counterintuitive given the the average value of INCRS (dependent variable) is already found significantly lower than Pre-GFC period from univariate analysis.

Am I making a mistake by comparing the two results? Should I go with the idea that the average INCRS has declined and now by taking multiple regression I want to find out the factors that have caused the decline and interpret the crisis dummy like other independent variables in the model.

kind Regards,
Minhaj
Comment

Announcement