Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Turning half-year and quarterly data into monthly data

    Suppose that I have many variables (in the following example, only one variable x1 is present) for a lot of companies (id) across many years (year).
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int id float tm long x1
    1101 564        .
    1101 565        .
    1101 566        .
    1101 567        .
    1101 568        .
    1101 569 11922113
    1101 570        .
    1101 571        .
    1101 572        .
    1101 573        .
    1101 574        .
    1101 575  9817626
    1101 576        .
    1101 577        .
    1101 578 12423117
    1101 579        .
    1101 580        .
    1101 581 11760556
    1101 582        .
    1101 583        .
    1101 584 10615331
    1101 585        .
    1101 586        .
    1101 587 13033535
    1102 564        .
    1102 565        .
    1102 566        .
    1102 567        .
    1102 568        .
    1102 569  6037845
    1102 570        .
    1102 571        .
    1102 572        .
    1102 573        .
    1102 574        .
    1102 575  6207117
    1102 576        .
    1102 577        .
    1102 578 12815413
    1102 579        .
    1102 580        .
    1102 581 20180703
    1102 582        .
    1102 583        .
    1102 584 17554606
    1102 585        .
    1102 586        .
    1102 587 16931368
    end
    format %tm tm
    In the year 2007 (and before), the data on x1 (and many others) are in half-year frequency; while in the year 2008 (and after) the data on x1 (and many others) are in quarterly frequency.
    I'd like to construct monthly data so that in the same half-year (2007 and before) or the same quarter, the monthly observations are the same as the end observations of the half-year or quarterly observations. Any suggestions?
    Ho-Chuan (River) Huang
    Stata 17.0, MP(4)

  • #2
    The expected results are "y1" as below.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int id float tm long(x1 y1)
    1101 564        . 11922113
    1101 565        . 11922113
    1101 566        . 11922113
    1101 567        . 11922113
    1101 568        . 11922113
    1101 569 11922113 11922113
    1101 570        .  9817626
    1101 571        .  9817626
    1101 572        .  9817626
    1101 573        .  9817626
    1101 574        .  9817626
    1101 575  9817626  9817626
    1101 576        . 12423117
    1101 577        . 12423117
    1101 578 12423117 12423117
    1101 579        . 11760556
    1101 580        . 11760556
    1101 581 11760556 11760556
    1101 582        . 10615331
    1101 583        . 10615331
    1101 584 10615331 10615331
    1101 585        . 13033535
    1101 586        . 13033535
    1101 587 13033535 13033535
    1102 564        .  6037845
    1102 565        .  6037845
    1102 566        .  6037845
    1102 567        .  6037845
    1102 568        .  6037845
    1102 569  6037845  6037845
    1102 570        .  6207117
    1102 571        .  6207117
    1102 572        .  6207117
    1102 573        .  6207117
    1102 574        .  6207117
    1102 575  6207117  6207117
    1102 576        . 12815413
    1102 577        . 12815413
    1102 578 12815413 12815413
    1102 579        . 20180703
    1102 580        . 20180703
    1102 581 20180703 20180703
    1102 582        . 17554606
    1102 583        . 17554606
    1102 584 17554606 17554606
    1102 585        . 16931368
    1102 586        . 16931368
    1102 587 16931368 16931368
    end
    format %tm tm
    Ho-Chuan (River) Huang
    Stata 17.0, MP(4)

    Comment


    • #3


      It seems you want to fill missing monthly values with the next non missing value (half-year before 2007; quarterly after).
      I hope you know why you're that, because giving arbitrary values to missing values is generally not a good idea.

      Anyway here with your data you can do:
      Code:
      gen year = yofd(dofm(tm))
      gen x2=.
      bysort id year (tm) : replace x2=x1[6] if _n<=6  & year<2008
      bysort id year (tm) : replace x2=x1[_N] if _n<=12 & _n>6  & year<2008
      bysort id year (tm) : replace x2=x1[3] if _n<=3  & year>=2008
      bysort id year (tm) : replace x2=x1[6] if _n<=6 & _n>3  & year>=2008
      bysort id year (tm) : replace x2=x1[9] if _n<=9 & _n>6  & year>=2008
      bysort id year (tm) : replace x2=x1[12] if _n<=12 & _n>9  & year>=2008
      
      format %13.0g x2
      But again, I doubt it would be really useful,
      Best,
      Charlie

      Comment


      • #4
        Originally posted by Charlie Joyez View Post

        It seems you want to fill missing monthly values with the next non missing value (half-year before 2007; quarterly after).
        I hope you know why you're that, because giving arbitrary values to missing values is generally not a good idea.

        But again, I doubt it would be really useful,
        Best,
        Charlie
        Charlie, Thanks for the suggestion. It seems to work with the example but I need to test your suggested procedure in a much larger dataset. In fact, it is not uncommon to replace (missing, unavailable) monthly observations (in the same quarter or half-year) with observed data of quarterly or half-year frequency.
        Ho-Chuan (River) Huang
        Stata 17.0, MP(4)

        Comment


        • #5
          You can do this easily. Here is another way using mipolate (SSC): For more on mipolate, see its help or http://www.statalist.org/forums/foru...-interpolation or other mentions here.


          Code:
          clear
          input int id float tm long(x1 y1)
          1101 564        . 11922113
          1101 565        . 11922113
          1101 566        . 11922113
          1101 567        . 11922113
          1101 568        . 11922113
          1101 569 11922113 11922113
          1101 570        .  9817626
          1101 571        .  9817626
          1101 572        .  9817626
          1101 573        .  9817626
          1101 574        .  9817626
          1101 575  9817626  9817626
          1101 576        . 12423117
          1101 577        . 12423117
          1101 578 12423117 12423117
          1101 579        . 11760556
          1101 580        . 11760556
          1101 581 11760556 11760556
          1101 582        . 10615331
          1101 583        . 10615331
          1101 584 10615331 10615331
          1101 585        . 13033535
          1101 586        . 13033535
          1101 587 13033535 13033535
          1102 564        .  6037845
          1102 565        .  6037845
          1102 566        .  6037845
          1102 567        .  6037845
          1102 568        .  6037845
          1102 569  6037845  6037845
          1102 570        .  6207117
          1102 571        .  6207117
          1102 572        .  6207117
          1102 573        .  6207117
          1102 574        .  6207117
          1102 575  6207117  6207117
          1102 576        . 12815413
          1102 577        . 12815413
          1102 578 12815413 12815413
          1102 579        . 20180703
          1102 580        . 20180703
          1102 581 20180703 20180703
          1102 582        . 17554606
          1102 583        . 17554606
          1102 584 17554606 17554606
          1102 585        . 16931368
          1102 586        . 16931368
          1102 587 16931368 16931368
          end
          format %tm tm
           
          mipolate x1 tm, by(id) gen(y2) backward
          
          assert y1 == y2
          But I agree with Charlie. For this kind of data, the procedure is arbitrary, the increase in sample size is illusory and the correlation (dependence) structure of the result is meaningless.

          I don't doubt that this is a common procedure in some fields, presumably because people have data on different bases which they wish to relate. The more easily defensible procedure is to aggregate monthly data to coarser resolution.

          Comment


          • #6
            In fact, it is not uncommon to replace (missing, unavailable) monthly observations (in the same quarter or half-year) with observed data of quarterly or half-year frequency.
            Ok, so good for you, but I guess it depends what you aim at. Because you are transforming some discrete interval into a (sort of) continuous one, but with large discontinuities, and stagnation parts, which could be confusing, because the "real" trend is probably smoother than the one here constructed.

            Anyway, the code should be easily extended to several variables, if the break between 2007 and 2008 periodicity of values is the same. A loop could be useful for the extension e.g:

            Code:
            foreach v in x1 y1 x2 y2 .... { /* Add all variables you want*/ 
             bysort id year (tm) : replace `v'=`v'[6] if _n<=6  & year<2008 bysort id year (tm) : replace `v'=`v'[_N] if _n<=12 & _n>6  & year<2008 bysort id year (tm) : replace `v'=`v'[3] if _n<=3  & year>=2008 bysort id year (tm) : replace `v'=`v'[6] if _n<=6 & _n>3  & year>=2008 bysort id year (tm) : replace `v'=`v'[9] if _n<=9 & _n>6  & year>=2008 bysort id year (tm) : replace `v'=`v'[12] if _n<=12 & _n>9  & year>=2008 }
            The size of the dataset shouldn't change anything as long as the id / tm observation is unique.

            Best,
            Charlie

            Comment


            • #7
              Originally posted by Nick Cox View Post
              You can do this easily. Here is another way using mipolate (SSC): For more on mipolate, see its help or http://www.statalist.org/forums/foru...-interpolation or other mentions here.

              But I agree with Charlie. For this kind of data, the procedure is arbitrary, the increase in sample size is illusory and the correlation (dependence) structure of the result is meaningless.

              I don't doubt that this is a common procedure in some fields, presumably because people have data on different bases which they wish to relate. The more easily defensible procedure is to aggregate monthly data to coarser resolution.
              Nick, thanks again. I agree with you that aggregation might be a better way.
              Ho-Chuan (River) Huang
              Stata 17.0, MP(4)

              Comment

              Working...
              X