May I use PCA to create an Index and use that as one of my independent variable?

Pete Henry

Join Date: Feb 2022

Posts: 10
#16

26 Mar 2022, 05:09

Dear all,

I have a question that I think might fit here / thread topic. This concerns also the use of the score of the pc1 value transformed to a Dummy as independent variable.

I have a component that measures the complexity of a company as a part of firm-size, number of business and geographic segments, etc.. This score is then used to create a dummy variable that equals 1 (0) if the pc1 value as measure for complexity is higher (lower) for that focal company compared to the industry median pc1 value (with the median generated with "excludeself").
This approach follows: Cheng/Lee/Shevlin (2016) Internal Governance and Real Earnings Management (The Accounting Review) p. 1071 f.

The question is:
Does this seem like a reasonable approach to you in subdividing the complete company sample into complex / non-complex firms and then run the regressions once with

Code:

xtset Industry xtreg Y x x i.Panel_Year if Complex==1, fe and once with xtreg Y x x i.Panel_Year, fe

so as to compare the effects of x on Y for complex firms with the sample as a whole in generating differences of the coefficients and then performing an t-Test for significant differences in the coefficient estimates?
So would this make the direct inclusion of the pc1 value redundant, since firm complexity is done through the dummy?

Question 2:
If the pc1 value has missing values resulting from missing values in the underlying variables, the dummy variable also has values of 1/0 and "." . Is this a problem if these missing values are included in the whole sample regression or is this not a problem since Stata ignores missing values in the regression?

Thanks in advance!

Stata 17/BE
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#17

26 Mar 2022, 10:59

Question 1. I would not take this approach. There are two levels on which I disagree with it. First, running the two separate -xtreg- commands will leave you with separate results for the complex and non-complex, but no way to then compare them. There will be no t-test for differences in the coefficient estimates coming from the two separate equations. So you will be at a dead end. Instead, use an interaction model:

Code:

xtset Industry xtreg Y i.Complex##(explanatory variables)..., fe

Then to compare the marginal effect of a variable X among the complex and non-complex you look at the Complex#X interaction coefficient and its associated statistics: that is the difference you want to capture. Be sure to remember to prefix the explanatory variables with c. or i. as the case may be.

There is another problem with this approach. Although it is very commonly done, dichotomizing continuous variables is usually a bad idea. It discards information and introduces noise. I'd be inclined to just use the index itself as a continuous variable. Again, an interaction model (this time with c.Complexity_index) will enable you to answer your question.

Question 2: Stata omits observations containing missing values on any regression variable when doing the estimation. In that sense it is not a problem. From a higher perspective, however, missing data is always a problem because of the potential to leave you with a biased sample for your analysis. What, if anything, can be done about that depends on lots of details and can't be dealt with here.
Comment
Pete Henry

Join Date: Feb 2022

Posts: 10
#18

31 Mar 2022, 15:25

Clyde,

not only many thanks for helpful answers to the questions but also for the (plausible) suggestions beyond!!

In the case of dichotomizing continuous variables: Your approach seems indeed plausible. But I`m still wondering if that is more a matter of "taste" or "methodological correctness" since the Paper/Journal i mentioned is one of the highest ranked in Accounting and Finance field. So I would question how this could pass the peer review process if this would be a methodolgical flaw.

In case of running regressions and if i understand you right, i could run for example:

For kind of baseline / whole group regressions

Code:

xtset Industry xtreg Y x x x i.Panel_Year, fe vce(cluster Industry) with variation / additional Variables xtreg Y x x x Dummies_of_interest i.Panel_Year, fe vce(cluster Industry) esttab ....

and for the Group of Complex defined Firms (dummy: D_Complex) this could be analogous (with the dichotomized variant):

Code:

xtreg Y D_Complex#c.(all continuous explanatory variables) i.Panel_Year if D_Complex==1, fe vce(cluster Industry) xtreg Y D_Complex#c.(all continuous explanatory variables) D_Complex#i.(Dummies_of_interest) i.Panel_Year if D_Complex==1, fe vce(cluster Industry) esttab ....

And the impact of complex firms on the respective variables is then directly identifyable by the different estimators and p-values and/or any further tests (chi2 difference)? And in the case of marginal effects by the margins, dydx command?

Thanks again!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#19

31 Mar 2022, 16:12

For the perils of dichotomization, see Harrell's "Dichotomania" post: https://www.fharrell.com/post/errmed/#catg. Harrell is commenting specifically on the medical literature, which is my own area, so I know it best. But if you read all of his points, you will see that none of them is at all specific to medicine or any other knowledge domain--they are perfectly general statistical concerns.

The notion that a paper is free of methodological flaws, or even free of horrendous methodological problems, simply because it passes review in a prominent journal is, well, laughable. There are tons of papers in prestigious journals that are flat-out crappy. It's a sad state of affairs, but that's the way it is. In the case of dichotomization, it is one of those things that has become so widespread that it is difficult to root it out of the literature now, as people are prone to model their current projects on what they see in the literature. But it is still a bad idea.

Concerning the whole group regressions, don't forget the i. prefixes on the Dummies_of_interest.

Concerning the contrast between the complex and non-complex subsets, you don't have it quite right. It should be

Code:

xtreg Y D_Complex##c.(all continuous explanatory variables) D_Complex##i.(Dummies_of_interest Panel_Year), fe vce(cluster Industry)

Note the use of ## rather than #, and the inclusion of Panel_Year within the categorical variable list interacted with D_Complex. Crucially note also the absence of an -if- clause restricting to D_Complex == 1. This code, with the interactions, and without the restriction, will, in effect, simultaneously estimate the same model in the D_Complex = 0 and D_Complex = 1 groups and enable you to compare them. (I'm assuming that contra my advice you are staying with dichotomized complexity here.) You can see them by running:

Code:

margins D_Complex, dydx(*) noestimcheck

after the regression.
Comment
Pete Henry

Join Date: Feb 2022

Posts: 10
#20

05 Apr 2022, 09:25

Thanks again for the valuable suggestions!

It's a sad state of affairs, but that's the way it is.

Indeed. Beeing still a student makes it difficult to assess the detailed quality of studies and obviously even seems hard to detect for the reviewers of the respective Journals. Universities strongly give the advice to rely on these "A - Journals". As is often read, economic implications for possible future legislation requirements like a change in Corporate Board-Composition are emphasized. So as these investigations aim to be practical meaningful and not an academic exercise or an end in itself i would have expected a kind of "methodolocial review" as well.

I'm assuming that contra my advice you are staying with dichotomized complexity here.

No, this was more related to the general methodological approach. I tried both, dichotomized and the continuous variant with different measures of my dependent variable and the results actually seem to fit better with my hypotheses and the continuous variable.

As a follow up question on the interpretation of marginal effects, that also relies on this thread: https://www.statalist.org/forums/forum/general-stata-discussion/general/1543220-interpretation-of-coefficient-of-independent-variable-which-is-measured-as-a-of-gdp

Code:

| Robust JM_ABS_DACC | Coefficient std. err. t P>|t| [95% conf. interval] -----------------+---------------------------------------------------------------- Brd_SIZE_Ref | -.0013204 .0008042 -1.64 0.111 -.0029629 .0003221 Brd_INDEP_R_REFI | -.0061074 .0085991 -0.71 0.483 -.0236691 .0114544 Avg_Dir_TEN | .0000837 .0006858 0.12 0.904 -.0013168 .0014842 Avg_I_Dir_App | .0011617 .0012528 0.93 0.361 -.0013969 .0037204 Avg_Dir_Age | -.0000465 .0002393 -0.19 0.847 -.0005353 .0004423 Long_I_Dir_R | -.0220877 .0092888 -2.38 0.024 -.041058 -.0031173 FIRM_SIZE | -.0102655 .0056754 -1.81 0.081 -.0218563 .0013253 LEV | .0539474 .0234049 2.30 0.028 .0061482 .1017466

JM_ABS_DACC as DV is measured as percentage of total assets (it is ~ 0.06). Analogous to the aforementioned thread a 1 % point increase in board independence would lead to a ß*%-point decrease in the DV or an additional person in Board Size leads to a ß - % point decrease in the DV (in this case, economically, a very small impact)

Does this behave in the same manner for logarithmic variables? In the case of Firm Size (measured as ln of total assets) a 1 % percent increase in Firm-Size leads to a ß - %-point decrease in the DV? So does the difference in the interpretation of variables as proportions (0 -1) and variables as ln(..) lies in percentage and percentage points?

Beside the interpretation:

Most studies also cover industry and year fixed effects, which I have confirmed by testparm also for my case and therefore included these in the model above.

However, due to the missing time parameter in "xtset Industry" and the manual time-dummies, the test for autocorrelation (e.g. by "xtserial") is not possible. Since I detected autocorrelation in the dependent variable by significance of " reg DV L.DV x x x " in revised analysis, it seems to me therefore - already independently of "xtserial" - that the consideration of the lagged DV is necessary in the main model. Would it be therefore appropriate to use

Code:

"xtset CompanyID Panel_Year"

instead of

Code:

"xtset Industry"

and then regress

Code:

xtreg Y x x x i.dummy_Vars, fe vce(cluster Industry)"

? In this case "xtserial" and the lagged DV would be possible and industry-fixed effects would still be given, due to the fact that companies are involved in industries they virtually never change, as described in #3 https://www.statalist.org/forums/for...gression-model.

Thanks a lot and sorry for the thematically slight deviation from the original pca-index question.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#21

05 Apr 2022, 09:57

First let's deal with the interpretation of the coefficients. I'm inferring from your description that the board size variable is a count of the number of members of the board. So a difference of 1 in that variable is associated with a difference of -.006 in your outcome variable. Since your outcome variable is, itself a percentage, this would be a difference of -.006 percentage points.

It is less clear to me what the independence variable is. Let me supppose that it is also a percentage. Then a 1 percentage point difference in board independence will be associated with a -.001 percentage point difference in your outcome variable.

I also infer from your description that firm size is actually not firm size but is the natural logarithm of firm size. In that case, a 1% (not percentage point) difference in firm size corresponds to a difference of ln(1.01) = 0.00995 in the log of firm size. This, in turn, is associated with a difference of 0.00995*-.010202655 = -0.00010215 percentage points in the outcome variable (which you could round to -0.0001 percentage points). Notice that because the coefficient is small, the end result is very close to what you would get by applying a rule of thumb that multiplies the coefficient by .01 to correspond to a 1% difference in untransformed size. That approximation works well only for very small coefficients, so I avoid using it given that it is so simple to just do the calculation directly.

Notice in all cases that I refrain from using causal language. Unless your study is the result of an experiment, you cannot assert causality on the basis of this analysis.

As for your second question, I can only give you a partial answer. Using -xtset Company_ID Panel_Year- is only possible if Company_ID and Panel_Year jointly identify unique observations in your data. That may well be the case, and if so, it will open up the possibility of using commands that can estimate auto-regressive structure. And you will be able to use -cluster(Industry)- only if each company appears consistently in the same industry throughout the data set. If any company is simultaneously (or at different time) in two different industries, you won't be able to do that. If your data meet the requirement and you do that, then you are correct that industry level effects will still be accounted for by absorption into the company level fixed effects. Industry level effects, however, will not be estimable in this model.

I can't advise you further on managing the autocorrelation. I am an epidemiologist, and we are very rarely in a position in our discipline to have high-frequency data that exhibits autocorrelation. Our data gathering mechanisms are sufficiently cumbersome and expensive that our longitudinal data sets usually gather data at long intervals, such that autocorrelation is simply not an issue. I have very little experience working with the kind of data where autocorrelation has to be taken into account, so I won't pretend I can give you guidance on it.
Comment
Pete Henry

Join Date: Feb 2022

Posts: 10
#22

09 Apr 2022, 09:09

Clyde, thanks again!

yes, my data are given in the way, that there are about 30 Industries, which in each case again comprise of roughly 8 - 400 companies, with each company covering more or less 15 years from 2005-2019 (unbalanced).

then you are correct that industry level effects will still be accounted for by absorption into the company level fixed effects. Industry level effects, however, will not be estimable in this model.

So this means that with

Code:

xtset CompanyID Panel_Year xtreg Y x x x, fe vce(cluster Industry)

i would result in Firn- and Year-Fixed Effects. Since -as you already guessed- firms are uniquely identified within an industry, firm fixed effects would also consider Industry-level fixed effects.

But if I run a -lets call it- "true industry fixed effects" model (like i guess most journal paper do so, when they state "Industry and Year fixed effects included" or sometimes also called "Industry and Year dummies included") I would think of regression Models which look like this and seem to me (almost) equivalent/identical:

Code:

areg Y x x x i.Panel_Year, absorb(Industry) vce(cluster Industry) reg Y x x x i.Panel_Year i.Industry, vce (cluster Industry) xtset Industry xtreg Y x x x i.Panel_Year, fe vce(cluster Industry) reghdfe Y x x x , absorb(Industry Panel_Year) vce(cluster Industry)

The "true" Industry and Year fixed effects models would in this case yield other estimators (ß) compared to the "company-year" approach with (a kind of different?) Industry fixed effects consideration. Is this what you mean by "Industry level effects, however, will not be estimable in this model." ?

To me, xtreg seems to me the handiest, since it also allows other tests like hausman (fe/re) or the xttest3 (ssc) for groupwise heteroskedasticity. But, as also stated in other threads xtset with a panelvar only comes at the cost of not allowing any lags or first diff. operators like (l.Var ; d.Var)

So somehow, i can`t think of any possibility, that allows for both - Industry and year fixed effects and lags/leads, (which in my understanding are necessary for testing autocorrelation in a Industry-year setten - manually or with xtserial). Therefore, even if you are not familiar with autocorrelation approaches: Could you think of any regression model setting, which enables industry and year fixed effects with allowing for lags or would you think that this simply isn`t possible?

I first thought of " egen industry_firm = group(CompanyID Industry) " and then use " xtset industry_firm Panel_Year ", but in essence this seems to me the same as "xtset companyid Panel_Year".
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#23

09 Apr 2022, 10:49

If you use -xtset Industry-, or, equivalently use -areg- or -reghdfe- and absorb Industry, your model well be ignoring the non-independence of observations within firms. It will give estimates of outcome differences among the industries. And you will not be able to deal with autoregression/serial correlation issues. The inability to handle autoregression/serial correlation, by the way, is not some idiosyncracy of the -xtset- command. Lags and leads are mathematically undefinable in this model because if I am looking at a 2009 observation for industry X, there are many 2008 observations for industry X--namely one for each firm!--so, which one is "the" lag for a given observation? Evidently you will see that the lag should be the one from the same firm. Agreed! But that's why -xtset- would require firm as the panel variable, so it would know to track things on firms, not industries.

Ultimately your dilemma arises from the fact that you are trying to fit what is, in reality, a three-level model into the Procrustean bed of a two-level analysis. You are being pushed into this dilemma by restricting your approaches to fixed-effects models. There are no three-level fixed-effects models, because linear algebra. Random-effects (mixed) models do not suffer these limitations; though they have different problems of their own and are strongly (in my view, excessively rigidly) deprecated in some disciplines.
Comment
Pete Henry

Join Date: Feb 2022

Posts: 10
#24

10 Apr 2022, 09:05

But that's why -xtset- would require firm as the panel variable, so it would know to track things on firms, not industries.

Thanks for that clarification!

Since my perspective occurs at the company level like "how could a change in x effect the y in a company", "xtset industry" does not seem to be the appropriate approach to me, which then only looks at the industry differences, but not at the individual differences.

Thus, the industry-fixed effect arises in the firm-side approach ("xtset Company Year") due to the fact that firms are, by their very nature, tied to industry to some degree in their actions and operations. Let's say a certain law is enacted explicitly for an industry group that e.g. determines the min. required board size (as number of persons) which affects all firms together. Fixed industry effects arise here because even if fixed firm effects are the dominant view, all firms in the time-fixed industry assignment are affected. However, the time variant estimators are only estimted for the variables that individually affect these firms. From Stata's point of view, this is evident from the fact that if i set "xtset Company Year" an run the regression "xtreg y x x i.Panel_Year i.Industry, fe vce(cluster Industry)" all but one of the industry dummies are "omitted". Agreed?
(i.Panel_Year added because "xtset Company Year" automatically includes only company fixed effects).

Finally: allowing std. error correlation by clustering at industry level would then be justified or preferable by companies being in the industry context (like e.g. the standard textbook example: study-hours -> testscore for students of one class vs. students of other classes, where std. errors are clustered at class level - https://www.statology.org/clustered-standard-errors/). So this would be a deviation from the common practice of clustering Std. errors at the same level as xtset-ing panel_variables, namely "xtset companyid year" and "xtreg y x x i.Panel_Year, fe vce(cluster companyid)".

Does this seem plausible?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#25

10 Apr 2022, 10:41

I don't know how to respond to #24 other than by repeating #23. You are looking for the least wrong version of a two-level analysis of an inherently three-level data generating process. It is not uncommon to see this here on Statalist--these dilemmas are inescapable if you restrict yourself to using fixed-effects models. It seems that most of the time people settle on -xtset-ing at the firm level and clustering errors at the industry level (or the analogous levels in their problem.) Whether that is actually the least wrong approach probably depends on detailed specifics of the problem and deciding about it requires detailed substantive considerations about what is related to what and how strongly in the real world, as these determine which of the inconsistencies between the analysis and the data generating process has the least distorting effects on the results.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment