Grouped from ungrouped data

Daniel Devine

Join Date: Dec 2017

Posts: 21
#1

Grouped from ungrouped data

10 Dec 2018, 05:10

Hi all,

I have a data set with 2.2m observations (individuals), nested in 28 countries, between 1973-2017. Because of the size of the dataset and complexity of models, running them with the individual data is taking far too long. Fortunately, the outcome is binary, and so it is possible to group the data into unique covariate patterns with no danger to inference, hopefully reducing the data set considerably.

Unfortunately, I have no idea how to get from ungrouped to grouped data based on these covariate patterns. I actually only have three individual level covariates (age, gender, education), and not sure how the country-level variables can be factored into this. Does anyone have any idea on how to group the data like this?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

10 Dec 2018, 05:18

Code:

help contract
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#3

10 Dec 2018, 05:24

I just generated 2.2 million observations on 4 variables, the dataset came out 44mb (which is nothing) and the estimation of the Probit model took 3.49 seconds (which is not prohibitive).

This grouping of data was done long time ago when people were doing the calculations on computers occupying whole floors and with punched cards.

Given that you re not familiar with the technique of aggregation, you should just do your analysis at individual level.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#4

10 Dec 2018, 05:28

Daniel:
I agree with Joro.
Given that individuals are nested within countries and your outcome is binary, you can go -melogit.-

Last edited by Carlo Lazzaro; 10 Dec 2018, 05:40.

Kind regards,
Carlo
(Stata 19.0)
Comment
Daniel Devine

Join Date: Dec 2017

Posts: 21
#5

10 Dec 2018, 06:06

Thank you Nick.

Joro: seems like a bizarre comment to make considering you are not aware of the type of model I am running, and only included 4 variables when I said there were many country level variables. I have, of course, run the models I intend to use which have taken many hours without finishing.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#6

10 Dec 2018, 06:59

Daniel:
bizarre as they may seem, replies are mostly based on the details providd by the original poster and (often) a bit of a guess-work from the interested readers.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#7

10 Dec 2018, 08:25

If your model is proving hard to fit, my guess is mostly that it's hard to fit. It would be good if contracting to a smaller dataset solved the problem, but I agree with others that we need more detail on what you're trying to do to add further advice. Usually, when a model won't converge easily you need to backtrack to something much simpler and then build in complications more gradually until you get a sense of which predictor or parameterisation is a difficulty too far.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

10 Dec 2018, 09:21

Daniel -

Let me just add that, if a response seems inappropriate to you, perhaps you need to take the author's good intent as a given and ask yourself why might they have written what they did. The answer in this case is that your question in post #1 lacked detail, as Carlo pointed out.

Unfortunately, I have no idea how to get from ungrouped to grouped data based on these covariate patterns. I actually only have three individual level covariates (age, gender, education), and not sure how the country-level variables can be factored into this. Does anyone have any idea on how to group the data like this?

I'd suggest you revisit the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, noting especially sections 9-12 on how to best pose your question.

You'll find that Statalist members are reluctant to accept an assertion of a problem and a desired approach to solving it in the absence of supporting information. In general this comes from a reluctance to give a correct answer to an inappropriate question that fails to address the actual problem, thereby sending the recipient of the advice off in a direction that ultimately proves unhelpful in solving the actual problem.

Nick suggests as much in post #7, and I expect this is why he gave just a terse response in post #2, limited to the name of the command that indeed you might have found without his help by reviewing the table of contents of the Stata Data Management Reference Manual PDF included in your Stata installation and accessible from Stata's Help menu.

I specifically agree with the final sentence in Nick's post #7.
1 like
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

10 Dec 2018, 10:48

Daniel, if my comment seems bizarre to you, how about you do the following:

1. Load your data and type the following on your Stata command line, and report what Stata is telling you in response, like I have done for the example dataset:

Code:

. clear

. webuse invest2

.  xtset company time
       panel variable:  company (strongly balanced)
        time variable:  time, 1 to 20
                delta:  1 unit

. xtdes

 company:  1, 2, ..., 5                                      n =          5
    time:  1, 2, ..., 20                                     T =         20
           Delta(time) = 1 unit
           Span(time)  = 20 periods
           (company*time uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                        20      20      20        20        20      20      20

     Freq.  Percent    Cum. |  Pattern
 ---------------------------+----------------------
        5    100.00  100.00 |  11111111111111111111
 ---------------------------+----------------------
        5    100.00         |  XXXXXXXXXXXXXXXXXXXX


. xtsum

Variable         |      Mean   Std. Dev.       Min        Max |    Observations
-----------------+--------------------------------------------+----------------
invest   overall |   248.957   267.8654      12.93     1486.7 |     N =     100
         between |             246.9354    42.8915     608.02 |     n =       5
         within  |             149.9249   -101.363   1127.637 |     T =      20
                 |                                            |
market   overall |  1922.223   1420.783      191.5     6241.7 |     N =     100
         between |             1491.225     670.91   4333.845 |     n =       5
         within  |             470.8022   380.5779   3830.078 |     T =      20
                 |                                            |
stock    overall |   311.067   371.5523         .8     2226.3 |     N =     100
         between |              228.435      85.64    648.435 |     n =       5
         within  |             309.6505   -334.568   1888.932 |     T =      20
                 |                                            |
company  overall |         3   1.421338          1          5 |     N =     100
         between |             1.581139          1          5 |     n =       5
         within  |                    0          3          3 |     T =      20
                 |                                            |
time     overall |      10.5   5.795331          1         20 |     N =     100
         between |                    0       10.5       10.5 |     n =       5
         within  |             5.795331          1         20 |     T =      20

.

2. Then copy and paste one or more of the models that you have run and took hours to complete. If you have output of how they have completed, paste the output as well, if you dont, just put the commands that you wrote.

Announcement

Grouped from ungrouped data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment