Collapse/generate mean for only some rows in long format?

Anne Christensen

Join Date: Apr 2023

Posts: 28
#1

Collapse/generate mean for only some rows in long format?

09 Oct 2023, 12:43

Hi,
I have repeated measures data on sleep times in long format. I need to generate a baseline mean of the variable tib_2 and tst_2 if they were registered in week 0, and have that mean be on a single row, followed by the remaining observations.

My data look like this:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str10 id float(days week tib_2 tst_2) str7 pt_responded "QyJPARcsAx" -5 0 540 450 "Ja" "QyJPARcsAx" -4 0 540 300 "Ja" "QyJPARcsAx" -3 0 540 390 "Ja" "QyJPARcsAx" -2 0 375 280 "Ja" "QyJPARcsAx" -1 0 385 255 "Ja" "QyJPARcsAx" 0 0 430 390 "Ja" "QyJPARcsAx" 1 1 550 375 "Ja" "QyJPARcsAx" 2 1 415 360 "Ja" "QyJPARcsAx" 3 1 475 385 "Ja" "QyJPARcsAx" 4 1 645 505 "Ja" "QyJPARcsAx" 6 1 485 430 "Ja" "QyJPARcsAx" 7 1 550 490 "Ja" "QyJPARcsAx" 8 2 550 490 "Ja" "TkcjjAJi72" -7 0 500 300 "Ja" "TkcjjAJi72" -6 0 505 230 "Ja" "TkcjjAJi72" -5 0 520 225 "Ja" "TkcjjAJi72" -4 0 470 350 "Ja" "TkcjjAJi72" -2 0 550 350 "Ja" "TkcjjAJi72" 0 0 525 230 "Ja" "TkcjjAJi72" 1 1 345 305 "Ja" "TkcjjAJi72" 2 1 290 155 "Ja" "TkcjjAJi72" 3 1 310 270 "Ja" "TkcjjAJi72" 4 1 340 210 "Ja" "TkcjjAJi72" 5 1 275 115 "Ja" "TkcjjAJi72" 7 1 300 275 "Ja" end

So far I have done it by generating new variables, computing means for the days before day 0, and then generating yet another variable in which the mean repeats for the days in the first week and is then followed by the consecute observations.

However, I need to drop repeating rows and have only one row for week 0, and this is not so straightforward because respondents have af different number of days in the first week - so I can't just drop if days < 0 for example, because for some their baseline observation is on day -2... I hope this makes sense.

Anyway I am going for something like this:

input str10 id float(days week tib_2 tst_2) str7 pt_responded

"QyJPARcsAx" 0 0 468.333 344.166 "Ja"
"QyJPARcsAx" 1 1 550 375 "Ja"
"QyJPARcsAx" 2 1 415 360 "Ja"
"QyJPARcsAx" 3 1 475 385 "Ja"
"QyJPARcsAx" 4 1 645 505 "Ja"
"QyJPARcsAx" 5 1 565 385 "Ja"
"QyJPARcsAx" 6 1 485 430 "Ja"

I am familiar with the collapse command, is it perhaps possible to use that within only week 0?

Thank you in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

09 Oct 2023, 13:06

First, there is a contradiction between what you show as the desired result and what you say you want. You say you want to average over days < 0. But the results you show have the average over days <= 0. In this case, I'll assume you really want what you show, not what you said. If you meant what you said, you can modify the code accordingly.

Code:

foreach v of varlist tib_2 tst_2 { by id (days), sort: egen baseline = mean(cond(days <= 0, `v', .)) by id (days): replace `v' = baseline if days <= 0 drop baseline } by id week (days), sort: egen awt = count(days) by id week (days): replace awt = 1 if week > 0 drop if days < 0

This reproduces the desired results you show.

But be careful how you use this. I don't know what tib_2 and tst_2 are, nor how you will use them. But because these variables, for week 0, are now averages, rather than individual day observations, if you use them as outcome variables in a regression, you will be introducing heteroscedasticity. To handle this type of heteroscedasticity, you should use -aweight-s in the regression to correct for this. It is for this reason that the code also generates the awt_* variables. (If you are using them as predictor/explanatory variables, then you do not need the -aweight-s.

Last edited by Clyde Schechter; 09 Oct 2023, 13:08. Reason: Changed code, taking the -by id week(days)... commands out of the loop, as they are only needed once.
Comment
Anne Christensen

Join Date: Apr 2023

Posts: 28
#3

09 Oct 2023, 13:23

Dear Clyde
Thank you very much, from a quick run this looks like exactly what I wanted (and you were right to assume that I wanted avg over days <= 0). Is this what is sometimes referred to as a loop?

Thank you very much for alerting me to the issue on heteroscedasticity - my supervisors have not mentioned this when they suggested this method, and I do indeed plan to use these variables as outcomes.
If you could point me to how I include -aweight-s in a -mixed- command, that would be great.

Thanks again, much appreciated!

Last edited by Anne Christensen; 09 Oct 2023, 13:29.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

09 Oct 2023, 13:35

OK, I was thinking of a simple OLS regression. To handle this in -mixed- is a bit more complicated, as -mixed- does not allow -aweight-s.

Code:

mixed ...., residuals(, by(week)) ...

This will be "overkill" to some extent in that Stata will estimate a separate residual variance for each level of the week variable instantiated in the data. That is only a problem to the extent that it will slow down the calculations somewhat. But I wouldn't worry about that: the sample sizes of data sets with data being directly measured on human subjects are seldom large enough for this slowdown to be a real problem. It's more of a problem with huge databases.

But let me ask why your supervisors recommended doing this in the first place? It would really be simpler to just use all of the original observations. What is gained by this aggregation of the data in week 0?
Comment
Anne Christensen

Join Date: Apr 2023

Posts: 28
#5

10 Oct 2023, 01:35

Okay, thank you for that information. If I plan to use days, not weeks as my time variable, will it still make sense to include residuals for week? The -mixed- command will look something like:
mixed tst_2 c.days##p1_random_group|| id: days , robust (the study is an RCT).

May I just ask again if that foreach command is what is sometimes referred to as a loop? I am asking because I am new to working with long form data, and these foreach commands come up regularly when I am trying to fix something - but the code is quite unintuitve to me, compared to other Stata codes, so I have been hesitant to use it.

Well, I suppose there are 2 main reasons. For our primary outcomes, we have only one baseline measurement. However, for these data, the baseline is a sleep diary completed every day for 7 days (ideally, some have fewer days, some actually have more because their baseline completion is also contingent on other data being complete), and for some of our results we need to report just one baseline value. Related, sleep diary data are generally reported and analyzed on a weekly mean basis in the litterature (we are interested in means sleep times, not individual days), but I guess for -mixed- commands that is not necessarily relevant.

I realize we could still use all values for the -mixed- command, but the rationale there is that the data before day 0 are pre-treatment data (treatment begins on day 1), in which we have no reson to believe that there is any (meaningful) association between days/time and our outcome - or at least that is not what we want to model. Days will be the time variable in the mixed command (not weeks), and so from my understanding, if we use non-aggregated data on days -7 to 0, we would also be modelling the association of these days with change in outcome - correct?
We are considering some type of sensitivity analysis to model the effect of baseline time though, that is,- the effect of just completing sleep diary with no treatment.

I am very grateful for any thoughts you have on this.
Comment
Anne Christensen

Join Date: Apr 2023

Posts: 28
#6

10 Oct 2023, 05:01

Edit just in case anyone else will use this: 'drop if days < 0' only works if every ID has data on day 0 - if not, you need to include if-statements to avoid them being dropped.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#7

10 Oct 2023, 11:13

If I plan to use days, not weeks as my time variable, will it still make sense to include residuals for week?

Yes, because it is the week, not the day, that determines whether the observation represents a single observation or an average of several.

May I just ask again if that foreach command is what is sometimes referred to as a loop?

Yes, -foreach- is Stata's main command for looping. (There is also -forvalues-, and, much less commonly used, -while-.) -foreach- is really modeled on the looping commands (often also called -foreach-, or just -for- or something like that) widely used in nearly all major programming languages. I think many people have difficulty grasping the concept of looping. Once you have that down, the specific syntax of -foreach- will feel natural (or, as natural as working with local macros feels to you.)

Concerning the explanation for aggregating the baseline week to a single observation but having daily observations for the others, I don't find it persuasive. First, the representation of the association of outcomes with time isn't married to the frequency with which observations are recorded in the data. You can have daily observations and still represent time in weeklong blocks if that is appropriate for the time-outcome relationships. In fact, you could represent those relationships using time represented in monthlong or yearlong blocks, or whatever. And when you say that it is conventional in sleep studies to use weeklong averages, that suggests that you should be aggregating the post-baseline data as well.

Now, you have raised, perhaps unintentionally, another point that bears on this issue. If the baseline data are from a sleep diary recorded by the participant, but the subsequent data are from direct observation, then you have yet another source of heteroscedasticity. So even if the seven baseline observations were retained (or, doing it the other way, even if all the data were aggregated to weeks), you would still need to do the residuals separately by week. (Or by an indicator variable that distinguishes the baseline from other observations.)
Comment
Anne Christensen

Join Date: Apr 2023

Posts: 28
#8

10 Oct 2023, 11:32

Originally posted by Clyde Schechter View Post

Yes, because it is the week, not the day, that determines whether the observation represents a single observation or an average of several.

Okay thank you - and it is not a problem that week is not otherwise included in the -mixed- command?

Thank you for the clarifying comment on -foreach-. Looks like I have to get familiar with it.

Originally posted by Clyde Schechter View Post

Concerning the explanation for aggregating the baseline week to a single observation but having daily observations for the others, I don't find it persuasive. First, the representation of the association of outcomes with time isn't married to the frequency with which observations are recorded in the data. You can have daily observations and still represent time in weeklong blocks if that is appropriate for the time-outcome relationships. In fact, you could represent those relationships using time represented in monthlong or yearlong blocks, or whatever. And when you say that it is conventional in sleep studies to use weeklong averages, that suggests that you should be aggregating the post-baseline data as well.

In a sense I think this is true, we should perhaps aggregate by week throughout the analyses on sleep diary outcomes, based purely on the general convention in the literature - however, when we have daily data available (and these are recorded by the participant throughout btw, so no additional heteroscedasticity there), is it not correct that it is more powerful to use these rather than collapse them in weeks?

I don't see why it isn't relevant that the hypothesis we want to test pertains to the association of treatment time and effect, and not baseline time. If I were to model time beginning from eg. day -7 to day 50, but I am really only interested in what happens between day 0 to 50 (only here do I assume there is an association), then why model the effect from day -7 to 0 as well, what do I gain?
I guess you could say then why have a 7-day mean baseline, but the rationale is that such a baseline is less sensitive to outliers and thus hopefulle more accurate for the general pre-treatment sleep of the participant.

I am very new to MLM, so I might be completely off - I appreciate your thoughts.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#9

10 Oct 2023, 11:46

I don't see why it isn't relevant that the hypothesis we want to test pertains to the association of treatment time and effect, and not baseline time. If I were to model time beginning from eg. day -7 to day 50, but I am really only interested in what happens between day 0 to 50 (only here do I assume there is an association), then why model the effect from day -7 to 0 as well, what do I gain?
I guess you could say then why have a 7-day mean baseline, but the rationale is that such a baseline is less sensitive to outliers and thus hopefulle more accurate for the general pre-treatment sleep of the participant.

Somewhere in your model, there will be a time under treatment variable that represents how effective you expect the treatment to be at that time. Typically this is just a dichotomous variable that is 0 before treatment begins and 1 afterwards. But it doesn't have to be that. If one expects treatment effectiveness to rise gradually after treatment starts and then, say, reach a plateau at week 3, you can build a variable that does that. If you expect treatment effectiveness to spike one week after treatment begins and then decay gradually thereafter, you can build a variable that does that. Whatever relationship between time and treatment effectiveness you can imagine, you can build a variable that reflects that. And it is that variable, not the frequency of the data, that will model treatment effectiveness in your study. The frequency of the data does not constrain the way you model the association between treatment effect and time, except that you cannot model treatment effect at a finer level of precision than the frequency of the data.

So, I'm not suggesting you model time from day -7 to day 50. I'm suggesting you include daily data from day -7 to day 50. But you can still model the effects of treatment as only beginning at day 0.

From a purely modeling and statistical perspective, I think it would be best to use the daily data throughout. It does provide better statistical precision than aggregating the data to the weekly level. The only reason I suggested aggregating all the data is that you said it is conventional to do with sleep data. (But disciplinary conventions often lack a good statistical rationale, and are often, statistically, counterproductive.)
Comment
Anne Christensen

Join Date: Apr 2023

Posts: 28
#10

10 Oct 2023, 12:01

Thank you Clyde. So, just to make sure I understand, if I were to include all available data and not aggregate the baseline, but not model the effect of time as beginning from before treatment, then I would have to use another variable than days (here) to set time in my mixed command, right?

From the discussions we have had thus far in the group, the rationale of simply using the days variable is that we suppose the effect of treatment to increase somewhat linearly with time - though it could look very different for each participant (I have understood this to be related to the fact that we expect random slopes to be the best fit - not sure if that is accurate). In fact we haven't discussed any other way, certainly not a dichotomous time variable. I do understand that our time variable doesn't need to be base don 'actual' time, as it is currently, but I have to say this is still confusing to me, and I suspect it is because I am still quite unfamiliar with how the analyses, or the -mixed- command works.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#11

10 Oct 2023, 12:46

Thank you Clyde. So, just to make sure I understand, if I were to include all available data and not aggregate the baseline, but not model the effect of time as beginning from before treatment, then I would have to use another variable than days (here) to set time in my mixed command, right?

Yes, that's precisely what I'm saying.

Based on what you say about expecting a roughly linear effect once treatment begins, you could construct the rx_time variable from the days variable as -gen rx_time = max(0, days). This would set rx_time to 0 for all days in baseline (because they are days -7 through 0) and then rise linearly with days thereafter. And you can model a random slope for this variable as well if that seems appropriate.

Multi-level modeling is not simple, neither in concept nor in implementation. So don't be too hard on yourself. If this is your first time doing it, you should expect it to be difficult and confusing. It takes a fair amount of experience with this to become comfortable with it. And, unfortunately, translating the concepts into code also requires a learning curve. So you're trying to learn two complicated things simultaneously.
Comment
Anne Christensen

Join Date: Apr 2023

Posts: 28
#12

11 Oct 2023, 04:00

Thank you very much for this tip, also for the words of encouragment. Well needed!
Comment
Anne Christensen

Join Date: Apr 2023

Posts: 28
#13

31 Oct 2023, 08:42

Hi, I hope it is OK with a follow up to this topic.

I am comparing different covariance structures based on LL, AIC and BIC following recommendations in Rabe-Hesketh and Skrondal (Multilevel and longitudinal modeling using stata). However, using the rx_time variable as suggested above (which otherwise works well) is not possible when specifying an autoregressive (AR 1) structure due to repeated time values within lowest-level panels (mixed command: mixed tib_2 rx_time|| pilot_id: rx_time, residuals(ar 1, t(rx_time)).

Is there any chance you Clyde, or someone else could help me understand why that is only a problem for that covariance structure? If I specify "if rx_time >0" the command works, but that would be a different model then right?

Thank you in advance.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#14

31 Oct 2023, 09:35

So, it seems your data has multiple observations with rx_time = 0 for some values of pilot_id. The reason things fail only when you try autoregressive residual covariance estimation is that it is undefinable in this situation. Autoregressive residual structure means estimating the correlation between the residual at time t and the residual for the same pilot_id at time t-1. But what happens at time 1? There is more than one time 0 residual for the same pilot_id: which of those residuals should be taken into account for this calculation? There is no way to answer that question. So the time1:time0 correlation is undefinable.

If you restrict to rx_time > 0, then, yes, you are fitting a different model. Or, more precisely put, you are fitting your model to a subset of the data that excludes time 0, so its results are not generalizable to time 0.
Comment
Anne Christensen

Join Date: Apr 2023

Posts: 28
#15

01 Nov 2023, 02:54

Thank you for explaining why this is problematic for the autoregressive structure, that makes sense.

This is related to a question I have about which consequences different operationalisations of time have for this model

What I am thinking them is that if we decide on the AR 1 structure, we need to set up the data for that, this is, have only one observation for 'day 0' pr ID - can you perhaps tell me what, if any, difference this makes compared to a model including the non-aggregated observations for days -10 to 0 for example, that is simply using rx_time as you helped define above (rx_time = max(0, days)?
Reminder, my data look like this:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float pilot_id double mcheck_createddate float(days rx_time week bedtime) 1 . . 0 . . 2 1978315739066 -7 0 0 82087504 2 1978405251349 -6 0 0 82087504 2 1978498510312 -5 0 0 82087504 2 1978575898339 -4 0 0 82087504 2 1978663788416 -3 0 0 82087504 2 1978747364817 -2 0 0 82087504 2 1978840544624 -1 0 0 82087504 2 1978925071258 0 0 0 82087504 2 1979024778656 1 1 1 8.61e+07 end format %tc mcheck_createddate format %tcHH:MM bedtime

Essentially I am unsure if, by definding rx_time this way, is day 0 the defined/modelled (unsure of terminology) as a mean of previous days anyway? In other words, how does it influence our analyses, eg the statistical power, whether we whether we collapse the baseline days into 1 row/observation, or keep them but model time as rx_time?

Thank you very much.
Comment

Announcement

Collapse/generate mean for only some rows in long format?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment