How to perform simple analysis using data organised in a nested structure?

Thiago Guimaraes

Join Date: May 2017

Posts: 1
#1

How to perform simple analysis using data organised in a nested structure?

01 May 2017, 04:51

Dear all,

I have a dataset organised in four levels (L=4):
• Geographical areas
• Households
• Individuals
• Activity

Each observation (row in the dataset) corresponds to an activity.
Each individual may have undertaken a different number of activities (or even no activity - in this case the activity-related attributes are missing values).
Households may contain at least one person.
Areas contain households.

Note that there are unique correspondences between these entities (i.e. an individual cannot belong to more than one household, a household belongs to not more than one area).
Each of these entities is associated to attributes (for example, household income refer to households and does not vary within the same household; activity duration is an attribute of an activity etc.).

A simplified and hypothetical representation of the dataset would be:

Area household household_income person activity_duration
1 1 2000 1 15
1 1 2000 1 20
1 2 2500 1 5
1 2 2500 2 10
1 2 2500 2 15
1 3 1500 1 35
1 3 1500 1 40
1 3 1500 1 10
1 4 6000 1 5
… … … … …

If I am interested in getting some basic descriptive statistics on activity duration, then I can run for instance tabstat and it would be fine.

But if I am interested in analysing household income, running the same command will be misleading as Stata assumes that the unit of analysis is the activity. The average of household income in this simplified example with four households would be 2444 (and not 3000 – which would be the expected result). Shortly: How can I calculate statistics on income having households as unit of analysis (UoA) - i.e. counting each household only once?

I would like to avoid transforming the dataset with the command reshape – because in this case I had to do it L-1 times for datasets containing L entity levels level of analysis.

Apologies if this question had already appeared. The closest forum entry I found was this one (http://www.statalist.org/forums/foru...y-nested-group), but I am not sure if it provides a straightforward answer to my question. I was expecting something more like tabstat household_income, uoa(household) stats(mean) if such command options were available.

Thanks in advance,

Thiago

Last edited by Thiago Guimaraes; 01 May 2017, 05:45.
Tags: None
Robert Picard

Join Date: Mar 2014

Posts: 1536
#2

01 May 2017, 09:22

In presenting data examples, please use dataex (from SSC) to generate them.

As with most Stata commands, you can restrict the sample using an if qualifier, see help if. Your example has no activity identifier so I generate one arbitrarily. This is needed because Stata orders observations randomly when the data is not fully sorted and the activity identifier is used to fully sort the observations.

What you need is to pick one observation to represent the person and one observation to represent the household. This is easy to do once you understand how to group observations using the by command (see help by). The _n system variable (see help _variables) is used to identify the observation number within each by group.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(Area household household_income person activity_duration) 1 1 2000 1 15 1 1 2000 1 20 1 2 2500 1 5 1 2 2500 2 10 1 2 2500 2 15 1 3 1500 1 35 1 3 1500 1 40 1 3 1500 1 10 1 4 6000 1 5 end * you should have a unique activity identifier, the following is arbitrary gen long activ_id = _n * fully sort observations across all levels isid Area household person activ_id, sort * tag first activity per person to represent the person by Area household person: gen person1 = _n == 1 * tag first activity of the first person to represent the household by Area household: gen household1 = _n == 1 tabstat household_income if household1

Last edited by Robert Picard; 01 May 2017, 09:36.
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35709

01 May 2017, 10:50

Robert Picard 's answer shows basic Stata logic in his usual impeccable style. It may help to know that this approach has also long been implemented as a standard egen function.

Here's a translation. There is no gain in brevity, and certainly not in efficiency, as calling up egen just adds a layer of code to be interpreted.

Bauhaus-Shaker-Quaker-Ikea-Tufte minimalists will prefer doing as much as possible with ground-level Stata code. Others may want to know the egen way to do it.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(Area household household_income person activity_duration)
1 1 2000 1 15
1 1 2000 1 20
1 2 2500 1  5
1 2 2500 2 10
1 2 2500 2 15
1 3 1500 1 35
1 3 1500 1 40
1 3 1500 1 10
1 4 6000 1  5
end

* you should have a unique activity identifier, the following is arbitrary
gen long activ_id = _n

* fully sort observations across all levels
isid Area household person activ_id, sort

* tag first activity per person to represent the person
egen person1 = tag(Area household person)

* tag first activity of the first person to represent the household
egen household1 = tag(Area household)

tabstat household_income if household1

Announcement

How to perform simple analysis using data organised in a nested structure?

Comment

Comment