Replicating the values of a variable to fill specific gaps

Tommaso Bechini

Join Date: Oct 2016

Posts: 9
#1

Replicating the values of a variable to fill specific gaps

01 Mar 2017, 10:46

Dear Statalisters,

I am working with a dataset originating from two different datasets, one with a series of variables regarding the household quality of a number of families, the other certain aspects of the quality of life of its members.

I need to work with both, but, logically, the number of observations is considerably greater in the "individual" dataset.

To fill the gaps in the "household" dataset, I would like to replicate the values of the "household" dataset's variables so that it features the same value each time the same family code is present in the "individual" dataset.

To further make myself clear: if person #1, #2 and #3 are all members of family #1, in which I have a corresponding value of 1000 for variable X, I would like for each member of family #1 to feature the value 1000 in variable X.

I think that some sort of use of the "by/bysort" command is what I need, but I am not sure.

I checked if this question was previously asked, but I don't think so, if that should be the case, please redirect me to the thread.

I thank in advance for any contribution!

Tommaso Bechini
Tags: bysort, replicate, Suggestion
Roger Chu

Join Date: Feb 2016

Posts: 84
#2

01 Mar 2017, 10:58

Hello Tommaso,

As I understand your question, you have individual data, and then household data that has 1 observation per household with other descriptive variables linked to each household.

Does the dataset with individual data have a variable that links each observation to the household dataset? For example, does the individual dataset have a variable family that corresponds to a variable family in the household dataset. If so, you should be able to do a simple merge.

Code:

use individual_data merge m:1 family using household_data
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

01 Mar 2017, 11:12

It's difficult for me to imagine how you have found yourself in this circumstance. I would think that if you had two data sets, one with household level information only, and the other with individual level information, that you would combine them with the -merge- command

Code:

use household_data, clear merge 1:m household_id_variable using individual_data

Had you done that, all of the household-level information from the household data set would automatically be added to the observations of all individuals in the household.

But it seems that you have somehow created a data set in which the household information appears only in one member of each household. So, here's how you can handle that.

[clear]
by household_id_variable (X), sort: assert X == X[1] if !missing(X)
by household_id_variable (X), sort: replace X = X[1] if missing(X) // FOR NUMERIC VARIABLES

// OR

by household_id_variable (X), sort: assert X == X[_N] if !missing(X)
by household_id_variable (X), sort: replace X = X[_N] if missing(X) // FOR STRING VARIABLES
[/code]

That said, since I'm a bit perplexed about how you got to this situation in the first place, I'm worried that there may be other issues. For example, the above code implicitly relies on the variable X being non-missing for only one person in each household, or at least that if it is non-missing for more than one they at least all have the same value. If that isn't true, then the situation isn't really as described and the results produced could be seriously wrong. That's why I put those -assert- commands in there, so Stata will stop and not blunder through inappropriately.

Added: Crossed with Roger Chu's response, which proposes the same as my first approach.
1 like
Comment
Tommaso Bechini

Join Date: Oct 2016

Posts: 9
#4

04 Mar 2017, 04:48

Dear all,

The problem was originated due to my working dataset being originally in Excel format, which I had imported. In said dataset, the observations from the two original datasets had simply being "glued" together, without adapting the number of observations.

I found the two original datasets and merged them in Stata, it seems everything is fine now.

Thank you kindly for the time you have dedicated me.

Tommaso Bechini
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

04 Mar 2017, 09:54

The problem was originated due to my working dataset being originally in Excel format, which I had imported. In said dataset, the observations from the two original datasets had simply being "glued" together, without adapting the number of observations.

I'm glad you solved your problem, and glad you closed the thread explaining how.

I think this is a good illustration of the profound differences between spreadsheets and statistical software. There is a general lesson here. Whenever you import data from a spreadsheet, you really have to scrutinize the spreadsheet and understand just what it contains. It is not uncommon to find, as here, that different parts of the spreadsheet need to be imported into separate data sets and then recombined in a different way in Stata (or any other statistical software, for that matter). In spreadsheets, there is no constraint on what and what kind of data goes where: all sorts of visually attractive but analytically unsuitable arrangements are possible. By contrast, had you wanted to create in Stata, from other Stata data sets, a layout that, in the Browser window, would resemble what you found in your Excel file, it would have required going out of your way to do so.

You can never say it to often: despite their superficial resemblances, spreadsheets are not statistical packages. What works well in one can fare poorly in the other. Each is good for different purposes, and substituting one for the other often ends badly. And when you are interfacing between them you have to be careful.
1 like
Comment

Announcement

Replicating the values of a variable to fill specific gaps

Comment

Comment

Comment

Comment