Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating Variable based on NAIC Digit Level match for Firms

    I have a list of firms that I am comparing to each other. Next I want to compare the NAIC code of each firm_id to each firm_id2 on each digit of the 6-digit level NAIC code. If the digit is the same, I am saying they are similar at the "X-level."

    I am also trying to determine if the region is the same (same_reg) among two firms - do I generate another variable firm_id2 and then see if region matches across each firm?

    Data:
    firm_id M_Acq_Naic M_Acq_Reg
    1 511210 "JP"
    2 236116 "EU"
    3 451120 "AM"
    3 451120 "AM"
    3 451120 "AM"
    3 451120 "AM"
    4 441110 "AM"
    4 441110 "AM"
    5 811310 "EU"
    6 221119 "EU"
    7 813212 "JP"

    where similarity between firm 1 and 2 is 0. But similarity between firm 2 and 6 is 1 (first digit) and firms 6 and 8 are similar at the 2-digit level so similarity will be equal to 2.
    where same_reg between firm 1 and 2 is 0 but is 1 between firms 3 and 4.

    Thanks in advance for your help!

  • #2
    You have accidentally posted your topic in Statalist's Mata Forum, which is used for discussions of Stata's Mata language. Your question will see a much larger audience if you post it in Statalist's General Forum.

    Also, if you have not already done so, take a look at the Statalist FAQ linked to at the top of this page for posting guidelines and suggestions.

    Comment


    • #3
      (Crossed in the ether with William's posting; this inserted as an edit.)

      Welcome to StataList. Per the title, this particular section of StataList is for questions about Mata, Stata's matrix programming language, which is not all needed for your problem. Probably 10X as many people will look at your question if posted in the Stata section rather than here. Also, you'll want to take a look at the FAQ about how to show example data using -dataex-.

      Here's a stab at your problem, recognizing that I don't quite understand what you want to do with the NAIC codes (note that most people here won't even know what they are, since we're not all economists. Given that to my limited knowledge, the NAIC codes have a hierarchical structure, I'm not sure you want just a digit by digit comparison as you ask, but that's what I can easily do.)

      The key thing here is that your question inherently involves pairs of observations, so the best way to go is to actually have a file of pairs, which the -cross- command can do for you. How does the following look?

      Code:
      // Example data
      clear
      input firm_id M_Acq_Naic str2 M_Acq_Reg
      1 511210 "JP"
      2 236116 "EU"
      3 451120 "AM"
      3 451120 "AM"
      3 451120 "AM"
      3 451120 "AM"
      4 441110 "AM"
      4 441110 "AM"
      5 811310 "EU"
      6 221119 "EU"
      7 813212 "JP"
      end
      // Duplicate firms in the example create a mess.  I presume they're not
      // part of the actual data.
      duplicates drop firm_id, force
      // NAIC codes should be strings, especially for current purposes.
      tostring M_Acq_Naic*, replace // Easier if NAIC codes are strings
      //
      // Now, we're ready to do something.
      // Make a file to pair with itself.
      preserve
      tempfile temp
      rename * *_2  
      save `temp'
      restore
      //
      rename * *_1
      cross using `temp'  // the workhorse here
      //
      drop if  (firm_id_1 == firm_id_2) // no self pairs
      // Create indicator variables for digit matches on NAIC codes
      forval i = 1/6 {
         gen PairSameDigit`i' = substr(M_Acq_Naic_1,`i',1) == substr(M_Acq_Naic_2,`i', 1)
      }
      // Drop duplicate firm pairs
      gen min = min(firm_id_1, firm_id_2)
      gen max = max(firm_id_1, firm_id_2)
      bysort min max: keep if _n ==1
      drop min max
      // Now do what you want with the NAIC digit match indicator variables.
      // ???
      //
      // If you had region variables, which your example did not contain, you could do:
      gen RegionSame = region_1 == region_2

      Comment


      • #4
        Thank you for your help! I will be more careful about where I post...

        With respect to a portion of the code:
        cross using `temp'

        I get the following error - r(459) "sum of expand values exceed 2,147,483,620. The dataset may not contain more than 2,147,483,620 observations." That is because there are duplicate firms- The data looks at a firm's deal activity by year (1990-2016) so it is possible the firm is listed 26 times... there are 373,088 unique firms but a total of 1,210,053 observations. I believe this is where the error is coming from...

        Comment


        • #5
          Mike may come back to this, but you can re-post the question in the General Forum if you need a more expedited reply.

          Comment


          • #6
            Discussion moved to

            https://www.statalist.org/forums/for...atch-for-firms

            Comment

            Working...
            X