Dear all,
I have two firm-level panel datasets; the first includes data from 2008-2010 and the second from 2011-2012. I want to create a panel dataset from 2008-2012. Both datasets do contain a firm ID variable that is consistent within the respective dataset, thus enabling panel analysis over the respective period. However, the firms do not have a consistent ID across datasets; in the first, the ID variable starts from 1, in the second from roughly 1,000,000. At the same time, I know that at least a fraction of firms appears in both datasets, but given the circumstances just described, I cannot use the firmID variable(s) to match the firms across datasets (i.e., I cannot simply append one dataset to the other, sort by firmID year and be done.)
There are four variables (call them A, B, C, D) that appear in both datasets and are time-invariant (founding year of the firm, post code, sector and legal status) (Note: I dropped those that change any of the above over time). Further, there are many variables that are time-variant and appear in both datasets (E, F, G, ..., W). At last, there is also a string variable X that lists the clients of the firm (in numbers=IDs, i.e. e.g. "1234, 2280, 34873"). What I aim to do is to determine an algorithm that does the following:
Match (=assign the same unique ID) if:
A_2010=A_2011, B_2010=B_2011, C_2010=C_2011, D_2010=D_2011,
AND
for variables E,F (I could also use, for example, E,F,G,H) it holds that
E_2010 ϵ {E_2011*0.9 , E_2011*1.1} AND F_2010 ϵ {F_2011*0.9 , F_2011*1.1}
AND
{Clients_2010} must contain at least one element of {Clients_2011}.
AND
If two ore more firms from 2010 meet all above requirements for a particular firm in 2011, then match the one that has a larger overlap in terms of clients.
I see that this matching practice is questionable from a conceptual perspective. However, I am asked to match firms along this way, and therefore I would be very happy if someone knows a way to do that. Perhaps there is even a user-written command/package to carry out such tasks? I have found the command nearmrg (available from ssc: ssc install nearmrg), but it only allows me to specify one variable along which the fuzzy match shall occur, whereas I have three fuzzy-match variables, of which one is not numeric, and also four exact-match variables...
I look forward to your replies!
I have two firm-level panel datasets; the first includes data from 2008-2010 and the second from 2011-2012. I want to create a panel dataset from 2008-2012. Both datasets do contain a firm ID variable that is consistent within the respective dataset, thus enabling panel analysis over the respective period. However, the firms do not have a consistent ID across datasets; in the first, the ID variable starts from 1, in the second from roughly 1,000,000. At the same time, I know that at least a fraction of firms appears in both datasets, but given the circumstances just described, I cannot use the firmID variable(s) to match the firms across datasets (i.e., I cannot simply append one dataset to the other, sort by firmID year and be done.)
There are four variables (call them A, B, C, D) that appear in both datasets and are time-invariant (founding year of the firm, post code, sector and legal status) (Note: I dropped those that change any of the above over time). Further, there are many variables that are time-variant and appear in both datasets (E, F, G, ..., W). At last, there is also a string variable X that lists the clients of the firm (in numbers=IDs, i.e. e.g. "1234, 2280, 34873"). What I aim to do is to determine an algorithm that does the following:
Match (=assign the same unique ID) if:
A_2010=A_2011, B_2010=B_2011, C_2010=C_2011, D_2010=D_2011,
AND
for variables E,F (I could also use, for example, E,F,G,H) it holds that
E_2010 ϵ {E_2011*0.9 , E_2011*1.1} AND F_2010 ϵ {F_2011*0.9 , F_2011*1.1}
AND
{Clients_2010} must contain at least one element of {Clients_2011}.
AND
If two ore more firms from 2010 meet all above requirements for a particular firm in 2011, then match the one that has a larger overlap in terms of clients.
I see that this matching practice is questionable from a conceptual perspective. However, I am asked to match firms along this way, and therefore I would be very happy if someone knows a way to do that. Perhaps there is even a user-written command/package to carry out such tasks? I have found the command nearmrg (available from ssc: ssc install nearmrg), but it only allows me to specify one variable along which the fuzzy match shall occur, whereas I have three fuzzy-match variables, of which one is not numeric, and also four exact-match variables...
I look forward to your replies!

Comment