Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to create new columns for each country observation and count this number of countries?

    Dear all,

    I wanted your help with the following. I am working on a patent project and in the following link (https://1drv.ms/f/s!APdCpn4GgD8tfg) you can find the Excel dataset with the patents of one of the companies, under ‘Sheet1’ Excel sheet.

    Columns:
    column A: pn (patent number. two first characters of each patent number refer to the country the patent was registered in. eg. For AU8425677-A the country code is AU. However, ‘EP’ as country code refers to European patents, so this refers to all country codes under ‘EP_codes’ Excel sheet)
    column B: year
    column C: environ_patents (1 for environmental patents, 0 for non-environmental patents)

    The variables I want to create are the following
    1. Which countries it has registered patents in a year.
    2. The number of countries all patents in a year have been registered in.
    3. Which countries it has registered non-environmental patents in a year
    4.The number of countries non-environmental patents have been registered in a year
    5. Which countries it has registered environmental patents in a year
    6. The number of countries environmental patents have been registered in a year


    The most important variables are the first and second one, since once created, the rest variables will be much easier to be created.
    For example, for the below observations:
    pn year environ_patents
    WO2000077147-A1; BR9001712-U2; BR9001712-Y1 2000 0
    AU9050094-A; ZA9001404-A; ZA9001404-B; BR9003529-T1 2000 0
    AU8425677-A; EP126524-A; JP59231039-A; ZA8401939-A; EP126524-B; CA1215991-A; DE3462027-G; EP126524-B2 2000 1
    WO2000096237-A1; BE766843-A; FR2135376-A 2000 1
    IT766843-A; FR2135376-A 2001 0
    BE766843-A; WO2001102655-A1; FR2135376-A 2001 1
    I would like the result as below:
    1. Which countries it has registered patents in a year.
    2000: BR AU AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR JP ZA CA DE
    2001: IT FR BE
    1. The number of countries all patents in a year have been registered in.
    2000: 44
    2001: 3
    1. Which countries it has registered non-environmental patents in a year.
    2000: AU ZA BR
    2001: IT FR
    4.The number of countries non-environmental patents have been registered in a year
    2000: 3
    2001: 2
    1. Which countries it has registered environmental patents in a year
    2000: AU AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR JP ZA CA DE
    2001: BE FR
    1. The number of countries environmental patents have been registered in a year
    2000: 43
    2001: 2


    The above variables will ignore ‘WOxxxx’ codes(meaning World Organization), show only identical countries(not duplicates) per row and per year, and in case of ‘EPxxxx’ (stands for environmental patents, which includes European countries the company chose the patent to be registered in), to show the country codes under the ‘EP_codes’ Excel sheet.
    For the first variable, I would like to generate new columns(columns named: country_1, country_2, etc) for each identical country identified in the ‘pn’ column as in the results below -the below are columns that should appear on the right side of the previous table's columns:
    country_1 country_2 country_3 country_4 country_5 country_6 country_7 country_8 country_9 country_10 country_11 country_12 country_13 country_14 country_15 country_16 country_17 country_18 country_19 country_20 country_21 country_22 country_23 country_24 country_25 country_26 country_27 country_28 country_29 country_30 country_31 country_32 country_33 country_34 country_35 country_36 country_37 country_38 country_39 country_40 country_41 country_42 country_43
    BR
    AU ZA BR
    AU AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR JP ZA CA DE
    BE FR
    IT FR
    BE FR
    .Therefore, could you advise on the Stata commands required the get the results above?

    Thank you very much,
    C.
    Last edited by Christoforos Zardis; 24 Oct 2021, 12:41.

  • #2
    I think some structure like this may be easier for later analysis:

    Code:
    clear
    input str105 pn year environ_patents
    "WO2000077147-A1; BR9001712-U2; BR9001712-Y1" 2000 0
    "AU9050094-A; ZA9001404-A; ZA9001404-B; BR9003529-T1" 2000 0
    "AU8425677-A; EP126524-A; JP59231039-A; ZA8401939-A; EP126524-B; CA1215991-A; DE3462027-G; EP126524-B2" 2000 1
    "WO2000096237-A1; BE766843-A; FR2135376-A" 2000 1
    "IT766843-A; FR2135376-A" 2001 0
    "BE766843-A; WO2001102655-A1; FR2135376-A" 2001 1
    end
    
    save temp01, replace
    
    * Split:
    split pn, gen(tpn) p(;)
    
    * Extract only the first two symbols:
    foreach x of varlist tpn*{
    replace `x' = trim(`x')
    replace `x' = substr(`x', 1, 2)
    }
    
    * Add an unique id and then reshape to get a list of unique countries
    gen uid = _n
    reshape long tpn, i(uid) j(instance)
    levelsof tpn, local(country_list)
    reshape wide
    
    * Create country variables
    foreach co in `country_list'{
        gen country_`co' = ""
        foreach ins of varlist tpn*{
            replace country_`co' = "`co'" if `ins' == "`co'"
        }
    }
    drop tpn* country_WO
    
    list year environ_patents country_AU-country_DE, sep(0)
    Results:
    Code:
         +------------------------------------------------------------------------+
         | year   enviro~s   countr~U   count~BE   count~BR   count~CA   count~DE |
         |------------------------------------------------------------------------|
      1. | 2000          0                               BR                       |
      2. | 2000          0         AU                    BR                       |
      3. | 2000          1         AU                               CA         DE |
      4. | 2000          1                    BE                                  |
      5. | 2001          0                                                        |
      6. | 2001          1                    BE                                  |
         +------------------------------------------------------------------------+
    You will need to fix the EP problem yourself with a simple loop that creates/replaces the EU countries. I (and probably most users here) am not very excited about downloading data from an external link, so I was not able to find out the list is.

    Comment


    • #3
      Hi Ken,

      Many thanks for your help. You are correct, I am attaching here the full dataset split into two txt files('Sheet1-part1 and Sheet1-part2') and the 'EP_codes'..

      I used the codes on the full dataset(Sheet1) and I can see the new columns created corresponding to new identical countries. However would it be possible to advise on the EP issue? Please see attached the 'EP_codes'. In the following example,(copied from above):
      pn year environ_patents
      AU8425677-A; EP126524-A; JP59231039-A; ZA8401939-A; EP126524-B; CA1215991-A; DE3462027-G; EP126524-B2 2000 1


      Can you advise how can the 'EPxxxx' codes from above be replaced with the 'EP_codes' as below, but to show only identical countries(not duplicates) per row? For example 'DE' will show only once as below(same for 'EP_code' countries which will replace the three 'EPxxxx' codes but which will only appear once and not three times)

      country_1 country_2 country_3 country_4 country_5 country_6 country_7 country_8 country_9 country_10 country_11 country_12 country_13 country_14 country_15 country_16 country_17 country_18 country_19 country_20 country_21 country_22 country_23 country_24 country_25 country_26 country_27 country_28 country_29 country_30 country_31 country_32 country_33 country_34 country_35 country_36 country_37 country_38 country_39 country_40 country_41 country_42
      AU AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR JP ZA CA

      Regarding the six questions from my initial message, would it be possible to advise of the commands I can get the countries (identical ones/not duplicates) for questions: 1,3 and 5 and their number for questions: 2,4 and 6?




      Thank you very much,
      C.
      Attached Files
      Last edited by Christoforos Zardis; 25 Oct 2021, 06:19.

      Comment


      • #4
        Hi Ken,
        Please see below a revised version of my previous post, as I believe this is more explainable: Thank you!


        I used the codes on the full dataset(Sheet1) and I can see the new columns created corresponding to new identical countries. I am also attaching here the full dataset split into two txt files('Sheet1-part1 and Sheet1-part2') and the 'EP_codes'.,as you are correct about external links,.



        However, as a continuation to your codes, would it be possible to advise on the 'EPxxxx' codes (you can find attached the 'EP_codes')? In the following example,(copied from above):
        pn year environ_patents
        AU8425677-A; EP126524-A; JP59231039-A; ZA8401939-A; EP126524-B; CA1215991-A; DE3462027-G; EP126524-B2 2000 1



        Can you advise how can the 'EPxxxx' codes from above be replaced with the 'EP_codes' as below, but to show only identical countries(not duplicates) per row(as you already did with the existing initial example dataset)? For example 'DE' will show only once as below(same for 'EP_code' countries which will replace the three 'EPxxxx' codes but which will only appear once and not three times)


        country_1 country_2 country_3 country_4 country_5 country_6 country_7 country_8 country_9 country_10 country_11 country_12 country_13 country_14 country_15 country_16 country_17 country_18 country_19 country_20 country_21 country_22 country_23 country_24 country_25 country_26 country_27 country_28 country_29 country_30 country_31 country_32 country_33 country_34 country_35 country_36 country_37 country_38 country_39 country_40 country_41 country_42
        AU AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR JP ZA CA


        Regarding the six questions from my initial message, would it be possible to advise of the commands on how I can get the countries (identical ones/not duplicates) for questions: 1,3 and 5 and their number for questions: 2,4 and 6? Questions 1 and 2 are the most critical ones since then questions 3-6 I guess should be get with a sort of 'environ_patents' column.

        Comment


        • #5
          Please don’t use attachments without good reason. This is explained in detail in the FAQ Advice all posters are asked to read before posting. Your thread is already asking a lot and expecting that people open attachments with other software and then work out what that might mean in Stata is asking even more.

          Comment


          • #6
            Hi Nick, understood - thanks for explaining. I will only be providing small example datasets as my initial message, with which I still get the answers.

            Comment


            • #7
              This revised version should be able to convert the EP into the list of the countries in EP:

              Code:
              clear
              input str105 pn year environ_patents
              "WO2000077147-A1; BR9001712-U2; BR9001712-Y1" 2000 0
              "AU9050094-A; ZA9001404-A; ZA9001404-B; BR9003529-T1" 2000 0
              "AU8425677-A; EP126524-A; JP59231039-A; ZA8401939-A; EP126524-B; CA1215991-A; DE3462027-G; EP126524-B2" 2000 1
              "WO2000096237-A1; BE766843-A; FR2135376-A" 2000 1
              "IT766843-A; FR2135376-A" 2001 0
              "BE766843-A; WO2001102655-A1; FR2135376-A" 2001 1
              end
              
              save temp01, replace
              
              * Split:
              split pn, gen(tpn) p(;)
              
              * Extract only the first two symbols:
              foreach x of varlist tpn*{
              replace `x' = trim(`x')
              replace `x' = substr(`x', 1, 2)
              }
              
              * Add an unique id and then reshape to get a list of unique countries
              gen uid = _n
              reshape long tpn, i(uid) j(instance)
              levelsof tpn, local(country_list)
              reshape wide
              
              * Create country variables
              foreach co in `country_list'{
                  gen country_`co' = ""
                  foreach ins of varlist tpn*{
                      replace country_`co' = "`co'" if `ins' == "`co'"
                  }
              }
              drop tpn* country_WO
              
              * Set up the list of coutnries in EP:
              global eplist AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS ///
                            IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
              
              foreach x of global eplist{
                  capture gen country_`x' = ""
                  replace country_`x' = "`x'" if country_EP == "EP"
              }
              
              order country_*, alpha last
              drop country_EP
              And to be honest I didn't even realize that you're making this data set in order to answer those questions. For those questions, this format you requested is not the best way to get at them. A simply long format with repeated years is way better. My major issue is: for those questions, I'm very confused if you want the output to look that way, or if they are just data summary. This draft codes may be helpful, but you'd need to check the answers:

              Code:
              * Change to long form
              reshape long country_, i(uid) j(ins, string)
              drop if country_ == ""
              drop pn ins
              * Generate year-country tag
              egen yr_co = tag(year country_)
              * Generate year-country-env tag
              egen yr_co_en = tag(year country_ environ_patents)
              
              
              * Q1
              list country if year == 2000 & yr_co == 1, sep(0)
              list country if year == 2001 & yr_co == 1, sep(0)
              
              * Q2
              tab year yr_co if yr_co == 1
              
              * Q3
              list country if year == 2000 & yr_co_en == 1 & environ_patents == 0, sep(0)
              list country if year == 2001 & yr_co_en == 1 & environ_patents == 0, sep(0)
              
              * Q4
              tab year yr_co_en if environ_patents == 0 & yr_co_en == 1
              
              * Q5
              list country if year == 2000 & yr_co_en == 1 & environ_patents == 1, sep(0)
              list country if year == 2001 & yr_co_en == 1 & environ_patents == 1, sep(0)
              
              * Q6
              tab year yr_co_en if environ_patents == 1 & yr_co_en == 1

              Comment


              • #8
                Hi Ken,

                Many thanks, the codes return the results I wanted. I know there are more straight forward ways to get them, I just wanted this format as it will help in the future to see which of these countries are developing and underdeveloped.

                Thank you again
                C.

                Comment

                Working...
                X