Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to perform simple analysis using data organised in a nested structure?

    Dear all,

    I have a dataset organised in four levels (L=4):
    • Geographical areas
    • Households
    • Individuals
    • Activity

    Each observation (row in the dataset) corresponds to an activity.
    Each individual may have undertaken a different number of activities (or even no activity - in this case the activity-related attributes are missing values).
    Households may contain at least one person.
    Areas contain households.

    Note that there are unique correspondences between these entities (i.e. an individual cannot belong to more than one household, a household belongs to not more than one area).
    Each of these entities is associated to attributes (for example, household income refer to households and does not vary within the same household; activity duration is an attribute of an activity etc.).

    A simplified and hypothetical representation of the dataset would be:

    Area household household_income person activity_duration
    1 1 2000 1 15
    1 1 2000 1 20
    1 2 2500 1 5
    1 2 2500 2 10
    1 2 2500 2 15
    1 3 1500 1 35
    1 3 1500 1 40
    1 3 1500 1 10
    1 4 6000 1 5
    … … … … …

    If I am interested in getting some basic descriptive statistics on activity duration, then I can run for instance tabstat and it would be fine.

    But if I am interested in analysing household income, running the same command will be misleading as Stata assumes that the unit of analysis is the activity. The average of household income in this simplified example with four households would be 2444 (and not 3000 – which would be the expected result). Shortly: How can I calculate statistics on income having households as unit of analysis (UoA) - i.e. counting each household only once?

    I would like to avoid transforming the dataset with the command reshape – because in this case I had to do it L-1 times for datasets containing L entity levels level of analysis.

    Apologies if this question had already appeared. The closest forum entry I found was this one (http://www.statalist.org/forums/foru...y-nested-group), but I am not sure if it provides a straightforward answer to my question. I was expecting something more like tabstat household_income, uoa(household) stats(mean) if such command options were available.

    Thanks in advance,

    Thiago
    Last edited by Thiago Guimaraes; 01 May 2017, 05:45.

  • #2
    In presenting data examples, please use dataex (from SSC) to generate them.

    As with most Stata commands, you can restrict the sample using an if qualifier, see help if. Your example has no activity identifier so I generate one arbitrarily. This is needed because Stata orders observations randomly when the data is not fully sorted and the activity identifier is used to fully sort the observations.

    What you need is to pick one observation to represent the person and one observation to represent the household. This is easy to do once you understand how to group observations using the by command (see help by). The _n system variable (see help _variables) is used to identify the observation number within each by group.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(Area household household_income person activity_duration)
    1 1 2000 1 15
    1 1 2000 1 20
    1 2 2500 1  5
    1 2 2500 2 10
    1 2 2500 2 15
    1 3 1500 1 35
    1 3 1500 1 40
    1 3 1500 1 10
    1 4 6000 1  5
    end
    
    * you should have a unique activity identifier, the following is arbitrary
    gen long activ_id = _n
    
    * fully sort observations across all levels
    isid Area household person activ_id, sort
    
    * tag first activity per person to represent the person
    by Area household person: gen person1 = _n == 1
    
    * tag first activity of the first person to represent the household
    by Area household: gen household1 = _n == 1
    
    tabstat household_income if household1
    Last edited by Robert Picard; 01 May 2017, 09:36.

    Comment


    • #3
      Robert Picard 's answer shows basic Stata logic in his usual impeccable style. It may help to know that this approach has also long been implemented as a standard egen function.

      Here's a translation. There is no gain in brevity, and certainly not in efficiency, as calling up egen just adds a layer of code to be interpreted.

      Bauhaus-Shaker-Quaker-Ikea-Tufte minimalists will prefer doing as much as possible with ground-level Stata code. Others may want to know the egen way to do it.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(Area household household_income person activity_duration)
      1 1 2000 1 15
      1 1 2000 1 20
      1 2 2500 1  5
      1 2 2500 2 10
      1 2 2500 2 15
      1 3 1500 1 35
      1 3 1500 1 40
      1 3 1500 1 10
      1 4 6000 1  5
      end
      
      * you should have a unique activity identifier, the following is arbitrary
      gen long activ_id = _n
      
      * fully sort observations across all levels
      isid Area household person activ_id, sort
      
      * tag first activity per person to represent the person
      egen person1 = tag(Area household person)
      
      * tag first activity of the first person to represent the household
      egen household1 = tag(Area household)
      
      tabstat household_income if household1

      Comment

      Working...
      X