Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Match and manipulation

    Hi, friends. I have a dataset containing six vars, just as below.

    clear
    input id str35 name1 str35 name2 str35 date str35 date1 str35 date2
    1001 "ENERGY INC" "Gas Natural " "11/28/2016" "06/31/1986" "08/31/2017"
    1001 "GAS NATURAL INC" "Gas Natural " "11/28/2016" "01/31/1986" "08/31/2017"
    1012 "TECH DATA CORP" "Tech Data " "09/21/2017" "04/30/1986" "12/29/2017"
    end

    I want to generate a variable denoting "match": if "name1" or "name2" is matched for each same "id", and its respective "date" is within the period from "date1" to "date2", then "match" ==1, or else "match" ==0. How to do that? Thank you for any advice.


  • #2
    if "name1" or "name2" is matched for each same "id"
    What does this mean?

    Comment


    • #3
      I wish you could provide (1) more observations in your, as well as (2) your desired dataset. If you provide more detailed information about your data, we can help you easier.

      Based on my understanding, I can suggest you the followings.

      a) Finding out the matches in dates:You can simply convert the dates to the Stata dates and then use qualifiers to find out whether the date is within the period.

      Code:
      clear
      input id str35 name1 str35 name2 str35 date str35 date1 str35 date2
      1001 "ENERGY INC" "Gas Natural " "11/28/2016" "06/30/1986" "08/31/2017"
      1001 "GAS NATURAL INC" "Gas Natural " "11/28/2016" "01/31/1986" "08/31/2017"
      1012 "TECH DATA CORP" "Tech Data " "09/21/2017" "04/30/1986" "12/29/2017"
      end
      gen date_m=date(date,"MDY")
      gen date1_m=date(date1,"MDY")
      gen date2_m=date(date2,"MDY")
      gen matchindate=.
      replace matchindate=1 if date_m>=date1_m & date_m<=date2_m
      replace matchindate=0 if date_m<date1_m | date_m>date2_m
      Alternatively, you can create two datasets: a dataset with ‘date’ and another dataset with ‘date1’ and ‘date2’. Next, you can use the 'rangejoin' command (ssc install rangejoin) to merge based on dates.

      b) Finding out the matches in the names:As far as I understood, the variables ‘name1’ and ‘name2’ are not exactly similar to each other. Thus, you might want to consider fuzzy matching (see e.g., the matchit or reclink commands). However, an issue might arise because the names can be very different from each other (depending on your data). Therefore, I suggest you to first, manually work on the names to make sure that they are at least quite similar to each other. Besides, before using Stata, you might want to investigate your data using the fuzzy add-in in excel (check the similarity score and investigate the matched, and non-matched items.).

      Comment

      Working...
      X