Logit with multiple (3+) fixed effects in context of dyadic data

Jan Meyer

Join Date: Oct 2017

Posts: 5
#1

Logit with multiple (3+) fixed effects in context of dyadic data

19 Oct 2017, 06:19

Hi everyone,

I am studying group behaviour such as herding and the impact of these “biases” on the decision-making behaviour of individuals. My data set consists of 360 daily observations of investments made by individuals in firms (the same individual can invest in multiple firms -> not one-to-many but many-to-many). I extended the data set to not only include the realized investments (~24,000) but also unrealized ones representing potential alternative investments individuals decided not to pursue (~1,000,000).

My data looks as follows: I look at realized (tie = 1) and unrealized (tie = 0) investments made by investors (invest_id) into companies (name) that exhibit certain characteristics at t-1 such as the number of already committed investors (lag_inv_co~t). To restrict the sample size I randomly selected 5 unrealized ties for each realized one (as to prior literature).

As I want to study effects such as herding, best measured with lagged variables, I want to control as best as possible for any unobserved time-invariant heterogeneity across firms (“name”) and investors (“invest_id”).

Naturally I thought of using two fixed effects on firm and investor-level, which is, though, very problematic in context of a logistic regression (incidental parameter problem: would be fine for the “name” FE as T> ~40, but for investor FE with T=6 this leads very likely to biased coefficients).

A natural alternative would be to estimate a conditional logit, but there doesn’t exist any practical or even theoretical implementations to date that consider two+ (and not one) FE (in addition to the time FE). Additionally, I want to cluster SE across both FE which is also complicated to implement in context of a conditional logit (the clus_nway package exists, but does not work with clogit).

In a perfect world I would be looking for a command like the following:

clogit tie lag_raised_amount lag_inv_count lag_target i.day i.date, group(name invest_id) vce(cluster name invest_id)
-> clogit only allows for one fixed effect
or
clogit tie lag_raised_amount lag_inv_count lag_target i.day i.date i.name, group(invest_id) vce(cluster name invest_id)
-> clogit works assuming coefficient estimates are not biased as T>40 for i.name; but clustered standard errors are only allowed for variables that also appear in group()

Ideally this would also account for the rareness of the events (e.g., via relogit), as in the whole data set there are about 2.4% realized events.

Has anyone an idea how I could solve this problem? Is there maybe a totally different approach of how to model this data and still be able to draw a conclusion about how individuals are influenced in their investment decision by the number of committed investors in t-1 (i.e., herding)?

Any hints are appreciated!
Jan

Last edited by Jan Meyer; 19 Oct 2017, 06:46.
Tags: binary response, clustered standard errors, dyadic dataset, fixed effects, logit
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

19 Oct 2017, 10:39

Well, I think you're trying to put a square peg into a round whole here. In fact, your presentation may even oversimplify the difficulties. If your selection of 5 controls (tie = 0) for each case (tie = 1) was truly completely random and just used to reduce the sample size to manageable levels, then it raises no other issues. But if this is actually a 5:1 match based on some characteristics of investor, firm, or both, then it introduces still more dependencies that you have not addressed in the model.

What you have here is a crossed effects (perhaps reduced to multiple membership by the selection of 5 controls per case) multi-level model, and I would model it as such using -melogit- or -meqrlogit-. This becomes even more urgent if you really have 5:1 matched 6-tuples, which form yet another level nested within the crossed investor and firm effects. I realize that the finance community looks skeptically at multi-level models. But the alternative is to force your data into some mis-specified model. I'm not sure why consistent estimates of a mis-specified model are preferable to possibly inconsistent estimates of a more correct model.
Comment
Jan Meyer

Join Date: Oct 2017

Posts: 5
#3

20 Oct 2017, 10:11

Thank you so much for the suggestions, Clyde. I agree with you that one rarely finds multi-level models in the finance literature, but I am abolutely willing to pursue the "correct" approach (even if this means that in a potential review process the model requires more explaining ;-) )

Let me give you a bit more detail about how I selected the 5 controls for each case:
Investor i makes on day t an investment in the funding campaign of company c. Now I look back 7 days (let's assume for now that this is the correct time frame) and consider every other funding campaign of companies c_n as a potential alternative to the investment Y_{i t c}. This includes different daily observations of the same funding campaign, e.g., on t-2 and t-5.
This results in ~200 unrealized ties for each realized tie.
I then sample 5 of these potential unrealized investments to each realized investment (depending on the final model, I may enforce a restriction that only one observation per unique funding campaign is allowed).

Is this something you would define as "truly random" or constitutes the fact that each unrealized tie can only be sampled from funding campaigns active in the prior week already a "matching" and, thus, violates the assumption of randomness?

Thanks a lot!
Jan

Last edited by Jan Meyer; 20 Oct 2017, 10:14.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#4

20 Oct 2017, 10:32

Is this something you would define as "truly random" or constitutes the fact that each unrealized tie can only be sampled from funding campaigns active in the prior week already a "matching" and, thus, violates the assumption of randomness?

A truly random match would be generated without any reference to a particular case in selecting controls. That is, one might segregate out all the potential controls and just select a random subset of the desired size, with no conditioning at all on anything having to do with the cases. But what you describe involves interval matching on the time period, so that is not truly random. These are matched 6-tuples. It may well be that the matching is loose and the intra-tuple correlation of the outcome is very low. It may even turn out that when you run the multi-level model the intra-tuple correlation is so low that you decide to remove that level from the model. But we can't assume it will work out that way. So I would start out with a multi-level model that includes the 6-tuple as a level.
Comment
Jan Meyer

Join Date: Oct 2017

Posts: 5
#5

23 Oct 2017, 09:25

Over the weekend I fitted the following model (melogit does not converge):

Code:

meqrlogit tie lag_raised_log lag_inv_count_log i.lag_target lag_pct_needed || _all:inv_id || name:, or

To my rather untrained "muli-level eye", this looks alright except for the _all random effects parameter (ci for constant is 0 to Inf). Additionally, this model already took 1 hour to converge although it did not include any fixed effects (day, weekday, month) and was only based on a subset of n = 6000 instead of ~43,000. Considering the exponential behaviour, I guess that fitting the full model will take >> 1 day (also because meqrlogit ist not parallelized).

I spent additionally some time to study multi-level models. I found one ASQ paper (Hallen, 2008) that uses a structurally comparable data set and states the following: "Random effects are only appropriate in contexts involving a single source of interdependence and are thus inappropriate for reducing autocorrelation among network dyads (Gulati and Gargiulo, 1999)."
Is this statement simply incorrect or is that something I need to consider?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#6

23 Oct 2017, 10:31

Before you go about interpreting this model, it appears to be mis-specified. I don't know what the variables in your model are, so perhaps I am missing something here. But I'm concerned about the random effects specification. It looks as if your intent was to specify crossed random effects (or multiple membership) at the inv_id and name levels. If so, you omitted the crucial R. in front of inv_id. Alternatively, you were looking for a random slope for some predictor variable inv_id at the _all: level, but then you omitted including inv_id among the fixed effects. In either case, what you have specified is not what you want, and the model as written doe not appear to be a valid model of anything. Note also that with correction of this, you may find that -melogit- will converge. (Although, to be honest, there aren't a lot of reasons to prefer -melogit- to -meqrlogit- or vice versa, unless you need to use some features that one supports but not the other.) So you need to fix the code and re-run it. Yes, multilevel logistic models can be very slow. If you are trying to fit a random slopes model here, and if the variance components are peripheral to your research goals and do not have to be very precisely estimated, you could use the -laplace- option to speed things up. If you are trying to do crossed effects, then -melogit- will default to the laplace operation and that will speed things up compared to what you have. It may still be very slow, but not quite as bad.

As for your question about the appropriateness of random effects, I don't know. The statement you quote is fairly vague. I don't know what "involving a single source of interdependence" means. If I were inclined to check out the references you mention (and I don't have time to do that today), I would not be able to find them based on the fragmentary information you have provided.

So let's talk about good posting practices. Better yet, read the FAQ which covers it in depth. Among the points mentioned that would improve your post:

1. Spell out abbreviations on first use: this is an interdisciplinary forum and abbreviations that everyone in your discipline understands may well be utterly unknown to the majority of Forum participants. Only use abbreviations if you can reasonably assume that every educated person would instantly recognize them.

2. When giving references provide complete references. Again, this is a multidisciplinary forum. Papers that may have folklore status in your discipline are likely to be unheard of by most readers. Just citing author name(s) and dates is nowhere near sufficient information for people to find them.

3. Don't use screenshots to show Stata output. Copy directly from Stata's Results window or your log file and paste into the Forum editor between code delimiters. (FAQ #12 explains how to use those.) Screenshots are often difficult or impossible to read. (Yours was just barely readable on my computer.)

4. Don't use screenshots to show data examples. This is even more important than #3. Even when the screenshot is legible, it is impossible to import the data to Stata. So if somebody who wants to help you needs to try out and test a solution to make sure it works with your data, a screenshot prevents them from doing that unless they are willing to manually type your data into the data editor. Few people are willing to do that. The helpful way to show data examples is to use the -dataex- command Run -ssc install dataex- and then run -help dataex- for instructions.

If you follow all the guidance in the FAQ you improve your chances of getting timely and helpful responses.
Comment
Jan Meyer

Join Date: Oct 2017

Posts: 5
#7

23 Oct 2017, 15:39

Thanks for the suggestions, Clyde. For any future posts I will consider the posting practices you mentioned (unfortunately, it is apparently not possible to edit old posts after a certain time).

r/ meqrlogit: I was indeed aiming for specifying a crossed-random effect and simply forgot the "R." . I probably read the vignette twice over the weekend and still managed to miss that... I will re-estimate the model with that specification.

r/ the quote I mentioned: the referenced paper (Gulati, R., & Gargiulo, M. (1999). Where Do Interorganizational Networks Come From? American Journal of Sociology, 104(5), 1439–1493.) uses a "random-effects panel probit model". Not sure how this relates to the quote I mentioned from Hallen, B. L. (2008). The Causes and Consequences of the Initial Network Positions of New Organizations: From Whom Do Entrepreneurs Receive Investments? Administrative Science Quarterly, 53(4), 685–718.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#8

23 Oct 2017, 17:21

Thanks for the complete references. The Administrative Science Quarterly is not available through my institution. I did download the Gulati & Gargiulo paper. I cannot find anything in there that sounds like it supports what you say Hallen said. Perhaps I am missing something, or perhaps Hallen has misinterpreted them, or perhaps you have misinterpreted Hallen. Based on information available to me I can't tell.

That said, I am not aware of any reasons beyond the general limitations of random effects models (relying on assumption that the random effects are independent of the measured predictors) why they would not be applicable in your situation. If someone else following this thread knows otherwise, I hope he/she will chime in, as I would love to learn about it too.
Comment

Announcement

Logit with multiple (3+) fixed effects in context of dyadic data

Comment

Comment

Comment

Comment

Comment

Comment

Comment