Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with Looking up values in dataset

    Dear Stata experts,

    I have an immense patent dataset with company names and ids. Some of the company ids are missing and I would like to fill them by looking them up based on the company names from rows in which they are matched. Here is an example of the issue:
    patent_id company_id company_name
    1 1 company1
    2 . company2
    3 . company3
    4 2 company2
    5 . company3
    6 . company1
    7 3 company3
    8 . company1
    9 . company2
    Through what commands could I fill those missing values with the ids given that the ids are matched to a company at least in one row for for all companies? Please bear in mind that this is a very simplistic representation of my dataset. I would need something similar to the lookup method in excel. Thank you in advance!


  • #2
    You're safe if only one distinct non-missing identifier is given for each company name. This works for your example:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(patent_id company_id) str8 company_name
    1 1 "company1"
    2 . "company2"
    3 . "company3"
    4 2 "company2"
    5 . "company3"
    6 . "company1"
    7 3 "company3"
    8 . "company1"
    9 . "company2"
    end
    
    egen max_id = max(company_id), by(company_name)
    egen min_id = min(company_id), by(company_name)
    bysort company_name (company_id) : replace company_id = company_id[1] if max_id == min_id
    list, sepby(company_name)
    For your full data, you likely have other problems given changes over time and minor inconsistencies in name, but that's the problem you asked about.

    Clashes of identifier can be explored by looking at observations where maximum and minimum differ.
    Last edited by Nick Cox; 26 Sep 2023, 07:06.

    Comment


    • #3
      Code:
      bysort company_name (company_id): replace company_id=company_id[1]
      Missing values are sorted last. Therefore, the above fills all observations of a particular company_name if there is at least one nonmissing company_id with the first sorted value. But it assumes consistency in the values of company_id, i.e., these values do not vary within company_name.

      Comment


      • #4
        Try something like:

        Code:
        sort company_name company_id
        by company_name:  company_id=company_id[1]
        but I worry that there may be companies where no company_id is non-missing, or there may be companies with multiple company_id. Then this would at least list the problem records:

        Code:
        sort company_name company_id
        by company_name company_id: company_id=company_id[1]
        by company_name: list if company_id[1]!= company_id[_N]| company_id==.
        None of this is tested, and will no doubt need revision. See also https://www.stata.com/manuals13/u13.pdf especially sections 13.7.1 and 13.7.2 if you are not familiar with the use of square brackets with -by-.

        Comment

        Working...
        X