wide to long I think?

Yewande Alimi

Join Date: Aug 2023

Posts: 3
#1

wide to long I think?

14 Aug 2023, 12:09

I am attempting to creating a data set that collapses a large data set that has multiple entries for each person (identified as FellowID), where the unique observations are identified by ProcID. My ultimate goal is to analyze this data set looking at the volume of entries for each Fellow and assessing predictors that changed the volume (ie year AdminSate etc), so I need to create a single entry for each Fellow ID that includes the year when the volumes happened but an individual variable that summarizes the total volume per ProcDate. I tried considering rangestat (count) ProcID, by(FellowID) but I don't really have an interval and not sure if this will create this volume indicator for me.

input int ProgramID str32 ProgramType str2 AdminState int(FellowID YEARBEGIN YEAREND ProcDate) str7 Role str13 Type str43 category str80 subcategory str101 subcat2 long ProcID float(region COVID)
1020 "Bariatrics/Foregut" "PA" 1951 2016 2017 21001 "Primary" "Laparoscopic" "HPB - Biliary" "Gallbladder" "Cholecystectomy without cholangiogram" 864902 0 0
1230 "Bariatrics" "TX" 2064 2016 2017 21001 "Primary" "Laparoscopic" "Bariatric" "Sleeve gastrectomy" "n/a" 864913 2 0
1230 "Bariatrics" "TX" 2064 2016 2017 21001 "Primary" "Laparoscopic" "Bariatric" "Sleeve gastrectomy" "n/a" 864914 2 0
1230 "Bariatrics" "TX" 2064 2016 2017 21001 "Primary" "Laparoscopic" "Bariatric" "Bariatric procedure, revisional" "Lap-Band Removal w/o Reconstruction" 864915 2 0
1230 "Bariatrics" "TX" 2064 2016 2017 21001 "Primary" "Endoscopic" "GI Endoscopy" "Endoscopy, upper, diagnostic" "Intraoperative" 864916 2 0
1115 "Advanced GI MIS" "OH" 2012 2016 2017 21003 "Primary" "Laparoscopic" "Abdominal Wall" "Hernia, inguinal" "Repair with mesh" 865334 1 0
1115 "Advanced GI MIS" "OH" 2012 2016 2017 21003 "Teach" "Laparoscopic" "Jejunum/Ileum" "Small bowel feeding tube" "n/a" 865335 1 0
1115 "Advanced GI MIS" "OH" 2012 2016 2017 21003 "Teach" "Laparoscopic" "Jejunum/Ileum" "Small bowel feeding tube" "n/a" 865338 1 0
1115 "Advanced GI MIS" "OH" 2012 2016 2017 21003 "Teach" "Laparoscopic" "GI Endoscopy" "Endoscopy, upper, therapeutic" "With dilatation of stricture"
Tags: None
Daniel Schaefer

Join Date: Mar 2020

Posts: 822
#2

14 Aug 2023, 13:22

Welcome to the forum. It sounds like you are trying to reshape long to wide, since you want a single row for each FellowID. Actually, you almost certainly want to keep your data in long format, since many operations on this kind of data are easier with long data. That said, it's not really clear what you're trying to do. I'm having trouble parsing this (run on) sentence:

My ultimate goal is to analyze this data set looking at the volume of entries for each Fellow and assessing predictors that changed the volume (ie year AdminSate etc), so I need to create a single entry for each Fellow ID that includes the year when the volumes happened but an individual variable that summarizes the total volume per ProcDate.

Could you please say a little bit more about how your data is organized? For example, there are multiple entries of FellowID, right? Usually this happens when some other variable is clustered within FellowID, like years.

Second, could you please say a little more about what you are trying to do? It sounds like you are trying to make a new column that provides some extra information about different FellowIDs, but I have no idea what "total volume" means in this context.
1 like
Comment
Yewande Alimi

Join Date: Aug 2023

Posts: 3
#3

15 Aug 2023, 06:12

Thanks for the post. I think you are correct, that I want to keep the data long. Let me clarify my goals as I do still have some outstanding questions. My goal was to create a volume variable (how many procedures had each fellowID performed) which I was ultimately successful in doing using the code below:

// gen number of cases variable for each Fellow

levelsof FellowID // display each invidivual fellow
gen casevolume = . // create case volume counter
quietly {
foreach lev in `r(levels)' {

di `lev'
di _n _n as txt "FellowID = " as result `lev'
count if FellowID == `lev'
replace casevolume = `r(N)' if FellowID == `lev'
}
}

Each FellowID has multiple ProcID to account for each different procedure that same fellow performed.

Now in my long data set each entry has a volume variable ( it is the same value for each unique FellowID). My ultimate goals are as follows:
1. evaluate predictors of increased or decreased case volume looking at region, time, COVID
2. Did the types of cases (var category) change following COVID?

Is it better to keep the data set long in this situation and simply use code like this to evaluate goal 1 (var aboveavg is a binary variable of fellows who had above or below average cases as determined by the mean of the data set):
regress casevolume COVID
regress casevolume i.region
regress casevolume i.YEARBEGIN i.region
logistic aboveavg COVID i.region
regress casevolume COVID i.region

I believe for goal 2 above it is necessitated that I keep the data long as each fellowID? Thoughts?
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10277

15 Aug 2023, 06:59

Originally posted by Yewande Alimi View Post

// gen number of cases variable for each Fellow

levelsof FellowID // display each invidivual fellow
gen casevolume = . // create case volume counter
quietly {
foreach lev in `r(levels)' {

di `lev'
di _n _n as txt "FellowID = " as result `lev'
count if FellowID == `lev'
replace casevolume = `r(N)' if FellowID == `lev'
}
}

This is not a standard way to deal with these kinds of problems. You want to use the -bysort- prefix, and it is a one-liner.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int ProgramID str32 ProgramType str2 AdminState int(FellowID YEARBEGIN YEAREND ProcDate) str7 Role str13 Type str43 category str80 subcategory str101 subcat2 long ProcID float(region COVID)
1020 "Bariatrics/Foregut" "PA" 1951 2016 2017 21001 "Primary" "Laparoscopic" "HPB - Biliary"  "Gallbladder"                     "Cholecystectomy without cholangiogram" 864902 0 0
1230 "Bariatrics"         "TX" 2064 2016 2017 21001 "Primary" "Laparoscopic" "Bariatric"      "Sleeve gastrectomy"              "n/a"                                   864913 2 0
1230 "Bariatrics"         "TX" 2064 2016 2017 21001 "Primary" "Laparoscopic" "Bariatric"      "Sleeve gastrectomy"              "n/a"                                   864914 2 0
1230 "Bariatrics"         "TX" 2064 2016 2017 21001 "Primary" "Laparoscopic" "Bariatric"      "Bariatric procedure, revisional" "Lap-Band Removal w/o Reconstruction"   864915 2 0
1230 "Bariatrics"         "TX" 2064 2016 2017 21001 "Primary" "Endoscopic"   "GI Endoscopy"   "Endoscopy, upper, diagnostic"    "Intraoperative"                        864916 2 0
1115 "Advanced GI MIS"    "OH" 2012 2016 2017 21003 "Primary" "Laparoscopic" "Abdominal Wall" "Hernia, inguinal"                "Repair with mesh"                      865334 1 0
1115 "Advanced GI MIS"    "OH" 2012 2016 2017 21003 "Teach"   "Laparoscopic" "Jejunum/Ileum"  "Small bowel feeding tube"        "n/a"                                   865335 1 0
1115 "Advanced GI MIS"    "OH" 2012 2016 2017 21003 "Teach"   "Laparoscopic" "Jejunum/Ileum"  "Small bowel feeding tube"        "n/a"                                   865338 1 0
end

bysort FellowID: gen wanted=_N

See

Code:

help bysort

and

Code:

help _variables

You have multilevel data (repeated observations of the same individuals). If you are using linear regression, at least cluster the errors at the individual level as these observations are not independent. But yes, you will want to keep a long data layout for analysis.

Comment

Yewande Alimi

Join Date: Aug 2023

Posts: 3
#5

22 Aug 2023, 13:48

Thank you that was very helpful!
Comment

Announcement