Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to balance this dataset

    Hi all

    I am having a hard time thinking about the best way to balance this dataset. The format is:
    plan ID State Year Drug
    1a AL 2012 Abilify
    1a AL 2012 Humalog
    2a AL 2012 Abilify
    2a AL 2012 Novolog
    1a AL 2013 Abilify
    1a AL 2013 Humalog
    1a AL 2013 Humira
    I need each plan/state/year to have an observation for every drug that is listed on any plan in that year. In my example table, Plan 1a in AL in 2012 would need to also have a row for Novolog, because Plan 2a has Novolog in 2012. Plan 2a in AL in 2012 would conversely need a row for Humalog, because Plan 1a has it in that year.

    Any advice for how I could code this? Much appreciated, thank you!

  • #2
    The substance of this is handled with the -fillin- command. However, you don't want to completely rectangularize the data, because you don't need any Novolog observations in 2013. So you want to rectangularize your data set separately for each year. Unfortunately, -fillin- does not support the -by- prefix. This is a job for -runby-

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str3(planid state) int year str7 drug
    "1a " "AL " 2012 "Abilify"
    "1a " "AL " 2012 "Humalog"
    "2a " "AL " 2012 "Abilify"
    "2a " "AL " 2012 "Novolog"
    "1a " "AL " 2013 "Abilify"
    "1a " "AL " 2013 "Humalog"
    "1a " "AL " 2013 "Humira"
    end
    
    capture program drop one_year
    program define one_year
        fillin planid state year drug
        exit
    end
    
    runby one_year, by(year)
    -runby- is written by Robert Picard and me, and is available from SSC.

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      The substance of this is handled with the -fillin- command. However, you don't want to completely rectangularize the data, because you don't need any Novolog observations in 2013. So you want to rectangularize your data set separately for each year. Unfortunately, -fillin- does not support the -by- prefix. This is a job for -runby-

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str3(planid state) int year str7 drug
      "1a " "AL " 2012 "Abilify"
      "1a " "AL " 2012 "Humalog"
      "2a " "AL " 2012 "Abilify"
      "2a " "AL " 2012 "Novolog"
      "1a " "AL " 2013 "Abilify"
      "1a " "AL " 2013 "Humalog"
      "1a " "AL " 2013 "Humira"
      end
      
      capture program drop one_year
      program define one_year
      fillin planid state year drug
      exit
      end
      
      runby one_year, by(year)
      -runby- is written by Robert Picard and me, and is available from SSC.

      In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
      Thank you for responding! So, would I need to manually enter every drug name for each plan/state/year? My dataset has 38E6 observations so that is not really an option unfortunately.

      Comment


      • #4
        No, no, no, no, no. Everything in the code from *Example generated by dataex down through the first -end- command is just a way to get some example data into Stata to illustrate the code. Replace all of that with just -use-ing your actual data set. In other words, start from where it says -capture program drop one_year- and take it from there after loading your data set into Stata.

        -dataex- is a convenience command for Statalist users. What you saw there is a short block of code that loads a small data set into Stata. -dataex- is a program that produces code like that from a real data set. That way people asking questions on Statalist can show an example of their data in a way that adequately conveys all the information necessary for others to work with it, and those who answer questions can use it to replicate that example data in their own Stata setup to try out code on it.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          No, no, no, no, no. Everything in the code from *Example generated by dataex down through the first -end- command is just a way to get some example data into Stata to illustrate the code. Replace all of that with just -use-ing your actual data set. In other words, start from where it says -capture program drop one_year- and take it from there after loading your data set into Stata.

          -dataex- is a convenience command for Statalist users. What you saw there is a short block of code that loads a small data set into Stata. -dataex- is a program that produces code like that from a real data set. That way people asking questions on Statalist can show an example of their data in a way that adequately conveys all the information necessary for others to work with it, and those who answer questions can use it to replicate that example data in their own Stata setup to try out code on it.
          Haha I must be sleep deprived, sorry for the misunderstanding! I am trying to run the code now, and it has been running for almost an hour. Would you expect that command to take a while with a large dataset?

          Comment


          • #6
            In an data set with 38,000,000 observations, yes I would expect this to take a very long time. It might well be days, rather than hours. If, by the time you get this, it has not finished and you are concerned that it is hung, and are willing to start over, add the -status- option to the -runby- command. That way you will get a periodic progress report showing how many observations have been processed and an estimate of the remaining time.

            Comment


            • #7
              Thanks for the continued help here. The code finally finished running but I got a r(3900) error:

              store_data(): 3900 unable to allocate string <tmp>[1614142241,1]
              runby_main(): - function returned error
              <istmt>: - function returned error

              Any ideas?

              Comment


              • #8
                Not a clue. I've never seen that before. Sounds like a memory issue, but I can't be sure. Did you get results, or did the code stop without finishing the job?

                Comment


                • #9
                  It didn't have any other output than the message I copied in my previous post, but it seemed like the code was able to run? Not exactly sure. It did say "end of do-file" so I think that means it finished.

                  Comment


                  • #10
                    I'm pretty sure it's a memory issue, since the first of those messages says that it tried to create a 1.64 billion by 1 matrix of strings in mata and failed. Which makes perfect sense to me.

                    There are two possibilities here. One is that the resulting data set would be too large for Stata no matter how you tried to build it. In that case, it isn't going to happen and we can quit trying now. So you should do a back of the envelope calculation of the number of observations that will be in the resulting data set and compare that to the limit for your flavor of Stata. (See -help limits- to find how many observations you can have.) Also make sure you have enough space on your mass storage device to save the file once it is created in memory.

                    More optimistically, it can be done but needs another way that will be gentler on memory requirements along the way. So what I would do is break up the file you are starting from into several smaller files. Each of the smaller files should consist of all observations for a selected range of years. So, if the years in your data range from, say, 2000-2020, I would make one data set for 2000-2004, another for 2005-2009, etc. Then run the code from #3 separately for each ofr these smaller data sets. And then append all the results together.

                    Added: Oh, and remember to add the -status- option to the -runby- command so you can see how things are progressing as you run.

                    Comment

                    Working...
                    X