Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating variable

    Hi everyone,

    I've been struggling with a project, and would appreciate some help.

    In brief, I’m working with a large healthcare database. I have to define “hospital volume” for my analysis. The database provides a "Hospital ID" which is specific for each hospital, the database contains several entries (different subjects) from the same "Hospital ID" ("PUF Facility ID"). I’ve used the "egen" command to create a “volume” variable, based on the frequency of the hospital ID.
    The data is from 2004-2015. I also have a “year” (Year of Diagnosis) variable. The issue arises in that not all the hospitals contributed data every year, so I can’t just divide my number of hospital entries by 11 years. I have to be able to create a variable with the number of years the specific hospital contributed, and then average out. I’m sure there is a very simple way to do it, I just don’t know how to.

    I've copied my code below, along with what it yields:


    egen volume=group(PUF_FACILITY_ID), label

    Click image for larger version

Name:	Screen Shot 2019-03-06 at 14.20.07.png
Views:	1
Size:	70.3 KB
ID:	1486965


    Now I have the "year part". If I tabulate

    bysort volume: tab YEAR_OF_DIAGNOSIS

    I get the following:

    Click image for larger version

Name:	Screen Shot 2019-03-06 at 14.23.09.png
Views:	3
Size:	59.2 KB
ID:	1486968


    As we can see, some institutions have subjects every year, others do not.

    What I need to complete this is to be able create a variable identifying how many years did an institution contribute to the data ( ex. 1, 2, 3 years) so I can then divide the volume for each specific institution by the specific number of years.

    I'm hope this is clear enough. I sure there is an easy way to do this, likely utilizing a loop.

    I would appreciate any help.
    Attached Files

  • #2
    While there is nothing illegal about -egen volume=group(PUF_FACILITY_ID), label-, the use of the name volume for the resulting variable is at best confusing, and at worst misleading. The -egen, group()- function simply assigns consecutive numbers starting from 1 to the different PUF Facilities in your data set. When you -tab- that variable, you do indeed get "volume" reported in the Freq. column of the output table, but the variable itself doesn't contain anything that could be remotevely considered the facility's volume. If you want the number of cases in each facility in each year, you can get that with:

    Code:
    by PUF_FACILITY_ID year, sort: gen cases_here_this_year = _N
    If you need the number of distinct years that each facility appears in the data set you can do this:
    Code:
    by PUF_FACILITY_ID year, sort: gen n_years = 1 if (_n == 1)
    by PUF_FACILIT_ID: replace n_years = sum(n_years)
    by PUF_FACILITY_ID: replace n_years = n_years[_N]

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      While there is nothing illegal about -egen volume=group(PUF_FACILITY_ID), label-, the use of the name volume for the resulting variable is at best confusing, and at worst misleading. The -egen, group()- function simply assigns consecutive numbers starting from 1 to the different PUF Facilities in your data set. When you -tab- that variable, you do indeed get "volume" reported in the Freq. column of the output table, but the variable itself doesn't contain anything that could be remotevely considered the facility's volume. If you want the number of cases in each facility in each year, you can get that with:

      Code:
      by PUF_FACILITY_ID year, sort: gen cases_here_this_year = _N
      If you need the number of distinct years that each facility appears in the data set you can do this:
      Code:
      by PUF_FACILITY_ID year, sort: gen n_years = 1 if (_n == 1)
      by PUF_FACILIT_ID: replace n_years = sum(n_years)
      by PUF_FACILITY_ID: replace n_years = n_years[_N]

      Clyde,

      Thank you for your feedback. I found the second part to be exactly what I needed. I had assigned the total number of "cases" by hospital to every subject as a continuous variable. Using the code you suggested I was able to "count" the number of years each institution contributed to the data, and then divide the total number by this. I now have a institutional cases/year value for every subject.

      Once again, Thank you.

      Comment

      Working...
      X