Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Grouped from ungrouped data

    Hi all,

    I have a data set with 2.2m observations (individuals), nested in 28 countries, between 1973-2017. Because of the size of the dataset and complexity of models, running them with the individual data is taking far too long. Fortunately, the outcome is binary, and so it is possible to group the data into unique covariate patterns with no danger to inference, hopefully reducing the data set considerably.

    Unfortunately, I have no idea how to get from ungrouped to grouped data based on these covariate patterns. I actually only have three individual level covariates (age, gender, education), and not sure how the country-level variables can be factored into this. Does anyone have any idea on how to group the data like this?


  • #2
    Code:
    help contract

    Comment


    • #3
      I just generated 2.2 million observations on 4 variables, the dataset came out 44mb (which is nothing) and the estimation of the Probit model took 3.49 seconds (which is not prohibitive).

      This grouping of data was done long time ago when people were doing the calculations on computers occupying whole floors and with punched cards.

      Given that you re not familiar with the technique of aggregation, you should just do your analysis at individual level.

      Comment


      • #4
        Daniel:
        I agree with Joro.
        Given that individuals are nested within countries and your outcome is binary, you can go -melogit.-
        Last edited by Carlo Lazzaro; 10 Dec 2018, 05:40.
        Kind regards,
        Carlo
        (Stata 18.0 SE)

        Comment


        • #5
          Thank you Nick.

          Joro: seems like a bizarre comment to make considering you are not aware of the type of model I am running, and only included 4 variables when I said there were many country level variables. I have, of course, run the models I intend to use which have taken many hours without finishing.

          Comment


          • #6
            Daniel:
            bizarre as they may seem, replies are mostly based on the details providd by the original poster and (often) a bit of a guess-work from the interested readers.
            Kind regards,
            Carlo
            (Stata 18.0 SE)

            Comment


            • #7
              If your model is proving hard to fit, my guess is mostly that it's hard to fit. It would be good if contracting to a smaller dataset solved the problem, but I agree with others that we need more detail on what you're trying to do to add further advice. Usually, when a model won't converge easily you need to backtrack to something much simpler and then build in complications more gradually until you get a sense of which predictor or parameterisation is a difficulty too far.

              Comment


              • #8
                Daniel -

                Let me just add that, if a response seems inappropriate to you, perhaps you need to take the author's good intent as a given and ask yourself why might they have written what they did. The answer in this case is that your question in post #1 lacked detail, as Carlo pointed out.

                Unfortunately, I have no idea how to get from ungrouped to grouped data based on these covariate patterns. I actually only have three individual level covariates (age, gender, education), and not sure how the country-level variables can be factored into this. Does anyone have any idea on how to group the data like this?
                I'd suggest you revisit the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, noting especially sections 9-12 on how to best pose your question.

                You'll find that Statalist members are reluctant to accept an assertion of a problem and a desired approach to solving it in the absence of supporting information. In general this comes from a reluctance to give a correct answer to an inappropriate question that fails to address the actual problem, thereby sending the recipient of the advice off in a direction that ultimately proves unhelpful in solving the actual problem.

                Nick suggests as much in post #7, and I expect this is why he gave just a terse response in post #2, limited to the name of the command that indeed you might have found without his help by reviewing the table of contents of the Stata Data Management Reference Manual PDF included in your Stata installation and accessible from Stata's Help menu.

                I specifically agree with the final sentence in Nick's post #7.

                Comment


                • #9
                  Daniel, if my comment seems bizarre to you, how about you do the following:

                  1. Load your data and type the following on your Stata command line, and report what Stata is telling you in response, like I have done for the example dataset:

                  Code:
                  . clear
                  
                  . webuse invest2
                  
                  .  xtset company time
                         panel variable:  company (strongly balanced)
                          time variable:  time, 1 to 20
                                  delta:  1 unit
                  
                  . xtdes
                  
                   company:  1, 2, ..., 5                                      n =          5
                      time:  1, 2, ..., 20                                     T =         20
                             Delta(time) = 1 unit
                             Span(time)  = 20 periods
                             (company*time uniquely identifies each observation)
                  
                  Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                                          20      20      20        20        20      20      20
                  
                       Freq.  Percent    Cum. |  Pattern
                   ---------------------------+----------------------
                          5    100.00  100.00 |  11111111111111111111
                   ---------------------------+----------------------
                          5    100.00         |  XXXXXXXXXXXXXXXXXXXX
                  
                  
                  . xtsum
                  
                  Variable         |      Mean   Std. Dev.       Min        Max |    Observations
                  -----------------+--------------------------------------------+----------------
                  invest   overall |   248.957   267.8654      12.93     1486.7 |     N =     100
                           between |             246.9354    42.8915     608.02 |     n =       5
                           within  |             149.9249   -101.363   1127.637 |     T =      20
                                   |                                            |
                  market   overall |  1922.223   1420.783      191.5     6241.7 |     N =     100
                           between |             1491.225     670.91   4333.845 |     n =       5
                           within  |             470.8022   380.5779   3830.078 |     T =      20
                                   |                                            |
                  stock    overall |   311.067   371.5523         .8     2226.3 |     N =     100
                           between |              228.435      85.64    648.435 |     n =       5
                           within  |             309.6505   -334.568   1888.932 |     T =      20
                                   |                                            |
                  company  overall |         3   1.421338          1          5 |     N =     100
                           between |             1.581139          1          5 |     n =       5
                           within  |                    0          3          3 |     T =      20
                                   |                                            |
                  time     overall |      10.5   5.795331          1         20 |     N =     100
                           between |                    0       10.5       10.5 |     n =       5
                           within  |             5.795331          1         20 |     T =      20
                  
                  .

                  2. Then copy and paste one or more of the models that you have run and took hours to complete. If you have output of how they have completed, paste the output as well, if you dont, just put the commands that you wrote.

                  Comment

                  Working...
                  X