Subsample analysis using a binary variable to split the sample

Rochelle Zhang

Join Date: Jun 2025

Posts: 0
#1

Subsample analysis using a binary variable to split the sample

19 Jul 2024, 09:58

Dear all,
This is not a stata coding question, but an econometric question.

My sample consists of firm year observations. My main explanatory variables are NET1, NET 2, my model looks like this

Y = NET1+ NET 2+control variables

Step 1. I conduct a panel regression analyses to test the effects of NET1, NET 2,

Step 2: I split my sample using the median value of NET1 into Hi_ NET1=1 (i.e., NET1> sample median) and Hi_ NET1 (i.e., NET1<= sample median), then run regression for Hi_ NET1=1 vs. 0 separately,

I want to run two regressions
Y = NET1+ NET 2+control variables if Hi_ NET1=1
Y = NET1+ NET 2+control variables if Hi_ NET1=0

I want to see whether the effect of NET2 is more or less important in each subsample.

My coauthor says I can’t include NET1 as an explanatory variable because I used it to partition the sample. Is that true?

Thanks,
Tags: None
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#2

19 Jul 2024, 10:16

Since there is variation within NET1 you should still be able to estimate the models, but I can also understand why your coauthor might be concerned about this, since in a sense, you are using NET1 to determine which outcomes are included in the model when you create two samples, then you go on to use NET1 again to predict your outcome. My take is that what you are doing turns out to be okay as long as you think variation in NET1 predicts the outcome to some extent regardless of whether NET1 is low or high.

That said, if I were you I would not use two models for this, because I would prefer to use all of the information available in a single model. Instead, I would estimate a model where I interact NET1 and Hi_NET1 to see if the size of the relationship depends on Hi_NET1. Then I would follow up with a model that includes a quadratic term for NET1. Then I would plot the predicted values by values of NET1 and mark the point where low becomes high with a vertical dashed line on the plot. If you are right and there is a difference in the size of the effect depending on low or high, I'd expect that dashed line to mark an inflection point in the plot where the effect either accelerates or decelerates at the inflection point. I might then check the results against your two model approach just to see if the models agree as a robustness test. I would probably present the curvilinear model in a paper.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#3

19 Jul 2024, 12:08

I basically endorse the recommendations in #2, but with a few tweaks.

1. If you include c.Net1##i.Hi_NET1 in the model, then the model will also include a term for Hi_NET1 itself. (If you don't know factor variable notation well enough to understand this, read -help fvvarlist- before proceeding.) This means that in addition for allowing the marginal effect of Net1 to differ between low and high values of Net1, you are also modeling a "jump" in the outcome at the median value of NET1. That may be perfectly reasonable as a model of the real-world data generating process, but typically it is not. More typically, you want the Net1:outcome relationship to be continuous there, in effect constraining the "jump" to be 0. So you could do that by avoiding the ## notation and using instead c.Net1 and c.Net1#i.Hi_NET1. Alternatively, you could get the same kind of results by specifying the effects of Net1 in your model with a linear spline instead. And either way, you might find your results simpler to interpret if you first center Net1 at the median, so that your Hi_Net1 variable becomes an indicator of positive vs negative values of the variable, and the model intercept becomes the expected outcome at the median value of Net1.

2. I would explore the relationship graphically before including a quadratic term in Net1. But assuming that there is reason to have the quadratic term, it is important that you interact both the linear and quadratic components with Hi_Net1. To interact a variable with the linear component but not the quadratic is to fit a model that constrains the parabola in a bizarre way that seldom reflects real-world conditions. By interacting both the linear and quadratic terms, you specify a fit among all possible parabolas that pass through the same y-intercept--a far less constrained situation, and that constraint is justified by the continuity considerations mentioned in the preceding paragraph. And again, I would probably use Net1 centered at its median for these quadratic analyses as well.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#4

19 Jul 2024, 14:01

In the second quadratic model I would not include the Hi_Net1 as a predictor at all. The idea here is to use the quadratic curve to model the continuous (rather than discrete) change in the effect of net1 as net1 increases. For example, if OP thinks the effect should be small for values considered "low", then that part of the curve should be relatively flat. Likewise if OP believes values considered "hi" should have a large effect, then that part of the curve should be relatively steep. You should even be able to model a case where the low side should be negative and high should be positive with a quadratic curve, which does not necessarily need to look like a parabola if the domain is bounded by the range given by real observations. Notice that the quadratic model without the Hi_Net1 variable is very similar to the model interacting Hi_Net1 and Net1. In the former we interact the variable with a continuous version of itself, while in the latter case we interact the variable with a discrete version of itself. Notice as well, in principle it should be possible to take the average instantaneous slope of the curved line under the quadratic model before and after the hi/low cutoff to get something that should be equivalent to the splines idea in #3, unless I'm misunderstanding something there.

I agree with the rest of #3, especially the point about centering variables.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#5

19 Jul 2024, 14:44

From #2:

That said, if I were you I would not use two models for this, because I would prefer to use all of the information available in a single model. Instead, I would estimate a model where I interact NET1 and Hi_NET1 to see if the size of the relationship depends on Hi_NET1. Then I would follow up with a model that includes a quadratic term for NET1.

I (mis)interpreted that to mean adding a quadratic term to the interaction model. I agree with #4 that a non-interacted model with a quadratic term is, in a sense, a generalization of the interaction model itself. And I agree that a quadratic term with no interaction is more sensible than one with interaction. My concern was with the idea of an interaction involving the linear term but not the quadratic one.

Also from #2:

Then I would plot the predicted values by values of NET1 and mark the point where low becomes high with a vertical dashed line on the plot. If you are right and there is a difference in the size of the effect depending on low or high, I'd expect that dashed line to mark an inflection point in the plot where the effect either accelerates or decelerates at the inflection point.

A quadratic curve will never have an inflection point. The lowest degree polynomial that can do that is a cubic. If an inflection point is truly expected, you need to go to the third power.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#6

19 Jul 2024, 15:17

I think I am misusing the term "inflection point" here honestly. I'm expecting more of an "elbow" based on OP's description of the modeling problem, which would not constitute a change in the direction of curvature (i.e. an inflection point).

If there really is some kind of sharp discontinuity between the two portions of net1, (basically, a sharp turn in the effect size after transitioning from low to high) a quadratic model might not be appropriate. In that case you'd probably want to stick with the discrete model or consider using splines. I don't know whether this is the case, but it is something OP might want to consider.
Comment
Rochelle Zhang

Join Date: Jun 2025

Posts: 0
#7

19 Jul 2024, 16:40

Many Thanks to Daniel and Clyde ! Let me carefully go through your postings and make changes to my model accordingly !

Best,
Comment

Announcement

Subsample analysis using a binary variable to split the sample

Comment

Comment

Comment

Comment

Comment

Comment