Logistic/probit regression with unbalanced panel data

Paolo Maldini

Join Date: Feb 2022

Posts: 49
#1

Logistic/probit regression with unbalanced panel data

08 Feb 2023, 03:59

I want to run a logistic regression where my outcome variable (i.e. drop_status) equals 1 if an individual stops using a given sub-reddit after a government policy is introduced, and 0 otherwise.

If my objective is learning about every user's dropout probability from Reddit based on their online behavior, I am not if my unit of analysis should be at the post-level or username-level?
Specifically, each post is evaluated using a continuous sentiment variable (i.e. 1 for positive, -1 for negative, or 0 for neutral sentiment), so if I want to learn about Kenny's probability of dropout based on the online sentiment of all of his posts, shall I compute an average sentiment for each username and then run the model with a continuous variable based on average sentiment?

Would this mean running the following where each row would represent one user and their average sentiment:
```
collapse mood, by(username)
```

Here is a data example:
```
dataex username int date long mood byte drop_status
```
```
----------------------- copy starting from the next line ----------------------- [CODE] * Example generated by -dataex-. For more info, type help dataex clear input str36 username int date long mood byte drop_status username date mood drop_status Kenny 2020-09-02. -1 1 Kenny 2020-09-03. -1 1 Kenny 2020-09-07. 1 1 Cartman 2020-09-03. -1 0 Cartman 2020-09-06. -1 0 Cartman 2020-09-08. -1 0 Mackey 2020-09-03. 0 0 Mackey 2020-09-04. 0 0 Mackey 2020-09-08. 1 0 Kyle 2020-09-13. -1 1 Kyle 2020-09-14. -1 1 ------------------ copy up to and including the previous line ------------------ ```
Tags: logit, panel, panel data, probit, regression
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

08 Feb 2023, 04:28

Paolo:
do you have a panel dataset?
Stata does not care whether it s balanced or not.
If you have, as it seems, a binary regressand, go -xtlogit-.

Last edited by Carlo Lazzaro; 08 Feb 2023, 04:38.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Paolo Maldini

Join Date: Feb 2022

Posts: 49
#3

08 Feb 2023, 05:29

Originally posted by Carlo Lazzaro View Post

Paolo:
do you have a panel dataset?
Stata does not care whether it s balanced or not.
If you have, as it seems, a binary regressand, go -xtlogit-.

Thanks Carlo for your great support!
Perhaps panel data is an incorrect description, it's more that I want to predict a binary outcome variable for each username, but my independent variables (e.g. sentiment per post) are essentially repeated observations for each username. Therefore, I thought that I need to first collapse my repeated sentiment observations by username, and then run the logistic regression. However, the downside of clustering results at the username-level, is that I will be losing out information and thus my preference is to run a model predicting dropout status while considering the various sentiment for each post written by a person.
Happy to clarify.

Last edited by Paolo Maldini; 08 Feb 2023, 05:32.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#4

08 Feb 2023, 11:40

Paolo:
if we assume that each message is actually a measurement instance of sentiment (or whatever; I'm probably too old-fogey for being familiar with this kind of -I presume - social media) and if the dates this message are posted do not differ that much among users, provided users are more or less (due to attrition) the same ones of the starting sample, I'd say that you have panel data.

Kind regards,
Carlo
(Stata 19.0)
2 likes
Comment
Paolo Maldini

Join Date: Feb 2022

Posts: 49
#5

09 Feb 2023, 06:05

Originally posted by Carlo Lazzaro View Post

Paolo:
if we assume that each message is actually a measurement instance of sentiment (or whatever; I'm probably too old-fogey for being familiar with this kind of -I presume - social media) and if the dates this message are posted do not differ that much among users, provided users are more or less (due to attrition) the same ones of the starting sample, I'd say that you have panel data.

Carlo: thanks for the thorough and clear explanation as always!
You are correct that 1) each message measures a specific social media post, and 2) the dates of reddit posts don't vary much across users, however, 3) users change throughout time (i.e. attrition) in my sample which tracks users for 3 years, so perhaps this does not make a panel data. For instance, only 40% of users in my first year sample remain active by year 3, which is the end of my dataset.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#6

09 Feb 2023, 12:08

Paolo:
personally, I would investigate your dataset as a panel data and try to delve into/speculate on the possible reasons of the remarkable 3-year attrition rate (or its opposite, that is what makes some posters "adherent" to Reddit.
It may well be that sone literature on this topic is available

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Paolo Maldini

Join Date: Feb 2022

Posts: 49
#7

09 Feb 2023, 12:23

Originally posted by Carlo Lazzaro View Post

Paolo:
personally, I would investigate your dataset as a panel data and try to delve into/speculate on the possible reasons of the remarkable 3-year attrition rate (or its opposite, that is what makes some posters "adherent" to Reddit.
It may well be that sone literature on this topic is available

Thanks, this is really helpful.
Comment

Announcement

Logistic/probit regression with unbalanced panel data

Comment

Comment

Comment

Comment

Comment

Comment