Help with Looking up values in dataset

Adam Tapaszto

Join Date: Sep 2023

Posts: 1
#1

Help with Looking up values in dataset

26 Sep 2023, 06:52

Dear Stata experts,

I have an immense patent dataset with company names and ids. Some of the company ids are missing and I would like to fill them by looking them up based on the company names from rows in which they are matched. Here is an example of the issue:

patent_id company_id company_name

1 1 company1

2 . company2

3 . company3

4 2 company2

5 . company3

6 . company1

7 3 company3

8 . company1

9 . company2

Through what commands could I fill those missing values with the ids given that the ids are matched to a company at least in one row for for all companies? Please bear in mind that this is a very simplistic representation of my dataset. I would need something similar to the lookup method in excel. Thank you in advance!
Tags: data, lookup, missing
Nick Cox

Join Date: Mar 2014

Posts: 35775
#2

26 Sep 2023, 07:02

You're safe if only one distinct non-missing identifier is given for each company name. This works for your example:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(patent_id company_id) str8 company_name 1 1 "company1" 2 . "company2" 3 . "company3" 4 2 "company2" 5 . "company3" 6 . "company1" 7 3 "company3" 8 . "company1" 9 . "company2" end egen max_id = max(company_id), by(company_name) egen min_id = min(company_id), by(company_name) bysort company_name (company_id) : replace company_id = company_id[1] if max_id == min_id list, sepby(company_name)

For your full data, you likely have other problems given changes over time and minor inconsistencies in name, but that's the problem you asked about.

Clashes of identifier can be explored by looking at observations where maximum and minimum differ.

Last edited by Nick Cox; 26 Sep 2023, 07:06.
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10274
#3

26 Sep 2023, 07:03

Code:

bysort company_name (company_id): replace company_id=company_id[1]

Missing values are sorted last. Therefore, the above fills all observations of a particular company_name if there is at least one nonmissing company_id with the first sorted value. But it assumes consistency in the values of company_id, i.e., these values do not vary within company_name.
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 330
#4

26 Sep 2023, 07:12

Try something like:

Code:

sort company_name company_id by company_name: company_id=company_id[1]

but I worry that there may be companies where no company_id is non-missing, or there may be companies with multiple company_id. Then this would at least list the problem records:

Code:

sort company_name company_id by company_name company_id: company_id=company_id[1] by company_name: list if company_id[1]!= company_id[_N]| company_id==.

None of this is tested, and will no doubt need revision. See also https://www.stata.com/manuals13/u13.pdf especially sections 13.7.1 and 13.7.2 if you are not familiar with the use of square brackets with -by-.
Comment

patent_id	company_id	company_name
1	1	company1
2	.	company2
3	.	company3
4	2	company2
5	.	company3
6	.	company1
7	3	company3
8	.	company1
9	.	company2

Announcement

Help with Looking up values in dataset

Comment

Comment

Comment