I am having trouble transforming a large dataset into the variables I need to use in my Master thesis and would like to request help as Googling has not lead to the right solution yet.
My variable is the percentage of a pharmaceutical firm's total drugs that have a patent awarded between 9 and 13 years ago in the given year in my time frame (2003 - 2017).
The dataset consists of 119k patents. The datasample for specific drugs I need is smaller, but still has an N of 9.000. Now, for a given year, I have a column with the firm name, the drug name and its patent year awarded. I have calculated the difference between the given year and the patent year, coded it with 0 if it is not between 9 and 13 years in 2003 and a 1 if it is. This all works fine, but I have a problem with the next step.
The number of firms in the sample is very large, and as there are 14 years, I can not manually calculate the variable for every firm. So I am looking for excel or stata to do this for me. I believe this might be possible with a formula that calculates the percentage per firm, but I do not know how to have excel or stata include only the percentage per every individual firm. I believe I could create a formula that includes only the number IF the value in the other column is for instance Merck. But for this sample I can not manually create this formula for every name in the dataset. So I hope that there is a function to calculate the percentage of the combined drugs, for every individual firm.
Would you have any suggestions for tackling this issue?
I hope my description of the issue was clear enough to comprehend.
My variable is the percentage of a pharmaceutical firm's total drugs that have a patent awarded between 9 and 13 years ago in the given year in my time frame (2003 - 2017).
The dataset consists of 119k patents. The datasample for specific drugs I need is smaller, but still has an N of 9.000. Now, for a given year, I have a column with the firm name, the drug name and its patent year awarded. I have calculated the difference between the given year and the patent year, coded it with 0 if it is not between 9 and 13 years in 2003 and a 1 if it is. This all works fine, but I have a problem with the next step.
The number of firms in the sample is very large, and as there are 14 years, I can not manually calculate the variable for every firm. So I am looking for excel or stata to do this for me. I believe this might be possible with a formula that calculates the percentage per firm, but I do not know how to have excel or stata include only the percentage per every individual firm. I believe I could create a formula that includes only the number IF the value in the other column is for instance Merck. But for this sample I can not manually create this formula for every name in the dataset. So I hope that there is a function to calculate the percentage of the combined drugs, for every individual firm.
Would you have any suggestions for tackling this issue?
I hope my description of the issue was clear enough to comprehend.
Comment