Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshape? Collapse?

    Hi, everyone! I was wondering if I could get some advice. I am working with the AHRQ's MEPS data, and I am stuck in trying to figure out how to shape (for the lack of a better word) my dataset into what I want. My data currently looks like this:
    dupersid condidx icd9codx cccodex ccnum female rrace marry1x marry2x
    1 1 272 53 53 Male(0) White(0) 1 MARRIED 1 MARRIED
    1 2 692 253 253 Male(0) White(0) 1 MARRIED 1 MARRIED
    1 3 460 126 126 Male(0) White(0) 1 MARRIED 1 MARRIED
    2 1 460 126 126 Female(1) White(0) 1 MARRIED 1 MARRIED
    2 2 524 136 136 Female(1) White(0) 1 MARRIED 1 MARRIED
    2 3 599 159 159 Female(1) White(0) 1 MARRIED 1 MARRIED
    2 4 599 159 159 Female(1) White(0) 1 MARRIED 1 MARRIED
    2 5 599 159 159 Female(1) White(0) 1 MARRIED 1 MARRIED
    And I am trying to make a dataset in which all condidx associated with the dupersid is in one row (and just create a flag for a particular condition in my analysis that I am interested in). I have tried to reshape the data and have gotten errors in the many times I have tried to transform it. The more I look at it, the I know that I will not need multiple rounds of marital status, age, etc., but I thought I did when I was trying to clean it last week.

    Any advice/guidance would be great. I've been trying to figure out this problem for for a while, and I thought that taking a week off from it would give me a fresh set of eyes, but I am back to where I started.

    Thank you!

  • #2
    Seems like you want something like this:

    Code:
    reshape wide icd9codx cccodex ccnum female rrace marry1x marry2x, i(condidx) j(dupersid)
    Please reread the FAQ linked at the top of the forum. You provide data, but you don't give us correctly formatted example data generated with the dataex command, as you are asked. You mention you've tried other code, but you don't post the code as you are asked. You mention that you get errors, but you don't post the errors as you are asked. Help us help you by giving us all of the relevant information!

    Comment


    • #3
      Your description of what you want to do is not 100% clear. But it sounds like you are thinking of -reshape-ing your data to wide layout. You can do that with:
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input byte(dupersid condidx) int(icd9codx cccodex ccnum) str10 female str9 rrace str10 marry1x str9 marry2x
      1 1 272  53  53 "Male(0) "   "White(0) " "1 MARRIED " "1 MARRIED"
      1 2 692 253 253 "Male(0) "   "White(0) " "1 MARRIED " "1 MARRIED"
      1 3 460 126 126 "Male(0) "   "White(0) " "1 MARRIED " "1 MARRIED"
      2 1 460 126 126 "Female(1) " "White(0) " "1 MARRIED " "1 MARRIED"
      2 2 524 136 136 "Female(1) " "White(0) " "1 MARRIED " "1 MARRIED"
      2 3 599 159 159 "Female(1) " "White(0) " "1 MARRIED " "1 MARRIED"
      2 4 599 159 159 "Female(1) " "White(0) " "1 MARRIED " "1 MARRIED"
      2 5 599 159 159 "Female(1) " "White(0) " "1 MARRIED " "1 MARRIED"
      end
      
      reshape wide icd9codx cccodex ccnum, i(dupersid) j(condidx)
      But here's my advice: don't do it--leave your data as it is. You haven't said what you plan to do in terms of further management or analysis, but there are only a few things in Stata that work better (or even at all) in wide layout. Stata commands generally work best (or only) with data in long layout. Unless you know for a fact that you will be doing something that requires the wide layout, you are going to be better off with the data as it is. Yes, it entails having repetitious variables like sex, race, and (perhaps) marital status that are the same for all observations of the same person. But unless you are using a data set that is straining towards the limits of memory, that is not really a problem.

      By the way, it is not a good idea to use abbreviations or jargon here. I'm an epidemiologist in the USA, so I know what AHRQ and MEPS are. But this is a multi-disciplinary international forum, and most of the people here will not. In general, it is best to write posts in language that would be understood by any college-educated adult anywhere in the world. The only specialized knowledge you should assume everyone here shares is basic statistics and at least a little bit about Stata.

      In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      Comment


      • #4
        Notice the line of code Clyde provides is slightly different than the one I provide. The line I give in #2 is based on the way I interpret the wording here:

        all condidx associated with the dupersid is in one row
        But I initially wrote the line like Clyde has it because that's what it looks like you want from the data. Please take care when deciding which line best fits your problem.

        Edit: it's not within the culture on this forum to "empty quote" to signify agreement with or high quality of a post, but this is nonetheless worth repeating for emphasis:

        But here's my advice: don't do it--leave your data as it is. You haven't said what you plan to do in terms of further management or analysis, but there are only a few things in Stata that work better (or even at all) in wide layout. Stata commands generally work best (or only) with data in long layout.
        Even in cases where it seems like it is convenient to convert from long to wide, usually there are more idiomatic conventions for the long format that you may not be aware of.
        Last edited by Daniel Schaefer; 07 Aug 2023, 09:38.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Your description of what you want to do is not 100% clear. But it sounds like you are thinking of -reshape-ing your data to wide layout. You can do that with:
          [code]

          But here's my advice: don't do it--leave your data as it is. You haven't said what you plan to do in terms of further management or analysis, but there are only a few things in Stata that work better (or even at all) in wide layout. Stata commands generally work best (or only) with data in long layout. Unless you know for a fact that you will be doing something that requires the wide layout, you are going to be better off with the data as it is. Yes, it entails having repetitious variables like sex, race, and (perhaps) marital status that are the same for all observations of the same person. But unless you are using a data set that is straining towards the limits of memory, that is not really a problem.

          By the way, it is not a good idea to use abbreviations or jargon here. I'm an epidemiologist in the USA, so I know what AHRQ and MEPS are. But this is a multi-disciplinary international forum, and most of the people here will not. In general, it is best to write posts in language that would be understood by any college-educated adult anywhere in the world. The only specialized knowledge you should assume everyone here shares is basic statistics and at least a little bit about Stata.

          In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
          In terms of further management and analysis, I plan to run some basic regression analyses to look at predictors of use and expenditures and also get summary statistics. And in my head, I needed to get the dataset into a format where each ID was a single case (if that makes sense.) I will also make sure to post in the future in the correct format. And I will also make sure to avoid jargon.

          Comment


          • #6
            Originally posted by Daniel Schaefer View Post
            Seems like you want something like this:

            Code:
            reshape wide icd9codx cccodex ccnum female rrace marry1x marry2x, i(condidx) j(dupersid)
            Please reread the FAQ linked at the top of the forum. You provide data, but you don't give us correctly formatted example data generated with the dataex command, as you are asked. You mention you've tried other code, but you don't post the code as you are asked. You mention that you get errors, but you don't post the errors as you are asked. Help us help you by giving us all of the relevant information!
            I'll make sure to do so in the future. I tried fixing my code and following your example, and I will have to sit down and fix my dataset. I encountering the issue where the dupersid gets an error where it says that it doesn't represent unique IDs.

            Comment


            • #7
              I plan to run some basic regression analyses to look at predictors of use and expenditures and also get summary statistics.
              Depending on the specifics and details, this suggests that you might want to reduce to one observation per ID, but doing that by aggregating up the expenditures and perhaps counting up the utilization of key services. That you would do with -collapse-, perhaps after replacing xonsisz, icd9codx, cccodex, or ccnum with indicator variables for some specific conditions that you want to focus on. But without knowing the details, it's not possible to give more specific advice.

              For example, if among the predictor variables you would like to use in your regression are the number of transactions with cccodex = 126 or condidx == 599, you could do

              Code:
              gen byte code126 = 126.cccodex
              gen byte cond599 = 599.condidx
              collapse (sum) code126 cond599, by(dupersid)
              Similarly, summing the expenditures (I do not see an expenditure variable in your example data, so I don't illustrate it) for a given person might make sense as a way of calculating your dependent variable in a one-observation-per-person regression. That would be done the same way.

              Evidently, though, you will have to think through the details of exactly what the predictors and outcome metric need to be in order to answer your research questions and tailor your data set accordingly.
              Last edited by Clyde Schechter; 07 Aug 2023, 10:52.

              Comment


              • #8
                Do you expect condidx and dupersid to uniquely identify the observations? if so, You can double check by running the following:

                Code:
                isid condidx dupersid
                If you get an error back, then condidx and dupersid don't uniquely identify your rows, but they must in order to reshape. This isn't just an arbitrary limitation of reshape: if condidx and dupersid don't uniquely identify your observations, there is a fundamental logical issue with what you are asking for. I would also check these two variables for missing values. If there are missing values, they may not be automatically dropped (I can't remember if reshape drops these for you off hand), in which case that is probably your problem.

                Although, this may be a red herring since you might not need to reshape, depending on the details of what you are trying to do.

                Comment

                Working...
                X