Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract country names from affiliations

    Hi,

    I have a dataset of about 1000 articles with variables such as id, title, abstract and affiliation. I was unable to get a dataex due to an error:
    Code:
    input strL affiliation
    data width (7267 chars) exceeds max linesize. Try specifying fewer variables
    r(1000);
    I have to separate the affiliation details (e.g. school, department, email), street address and country names into different columns, but my main interest is in the country names for each row. Is it possible to do so using Stata 16? Here are first 2 rows.

    Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and University Department of Infectious and Tropical Diseases, University of Brescia and and ASST Spedali Civili, Brescia, Italy. and Department of Molecular and Translational Medicine and Clinical Chemistry and Laboratory ASST Spedali Civili, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy; [email protected]. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy.
    Division of Allergy, Pulmonary, and Critical Care Medicine, Vanderbilt University and Medical Center, Nashville, TN, USA. Electronic address: and [email protected]. and Vanderbilt Center for Biomedical Ethics and Society, Vanderbilt University and Medical Center, Nashville, TN, USA. and Office of Emergency Care Research, National Institute of Neurological Disorders and and Stroke, Division of Clinical Research, National Institutes of Health, and Bethesda, MD, USA.
    Last edited by Sonnen Blume; 26 Jun 2022, 23:32.

  • #2
    The issue here is that you should have an idea about the geography.
    The last two columns produced by -split- van be helpful.
    Code:
    . split var1 , p()
    variables created as string: 
    var11    var19    var117   var125   var133   var141   var149   var157   var165   var173   var181   var189   var197   var1105
    var12    var110   var118   var126   var134   var142   var150   var158   var166   var174   var182   var190   var198   var1106
    var13    var111   var119   var127   var135   var143   var151   var159   var167   var175   var183   var191   var199   var1107
    var14    var112   var120   var128   var136   var144   var152   var160   var168   var176   var184   var192   var1100  var1108
    var15    var113   var121   var129   var137   var145   var153   var161   var169   var177   var185   var193   var1101  var1109
    var16    var114   var122   var130   var138   var146   var154   var162   var170   var178   var186   var194   var1102  var1110
    var17    var115   var123   var131   var139   var147   var155   var163   var171   var179   var187   var195   var1103
    var18    var116   var124   var132   var140   var148   var156   var164   var172   var180   var188   var196   var1104
    
    . list var1109 var1110
    
         +--------------------+
         |  var1109   var1110 |
         |--------------------|
      1. | Brescia,    Italy. |
      2. |                    |
         +--------------------+
    
    .
    I must confess that Brescia was really easy for me to spot: it's a nice town about 100 Km north-east from Milan (on the A4 motorway that links Milan to Venice) . In addition, some years ago I was involved in two projects with physicians working at Spedali Civili, which is the teaching hospital of the local School of Medicine,
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Other than Carlo's advice, I have some ideas to share. In #1, the first obs contains 7 authors, and the second obs has 3 authors. Authors may come from different countries, so I think it's better to generate countries for all authors. I'm not sure if other observations have similar patterns, but at least the two lines show that a country name appears after ", " and ends with "." or ";". So my idea is to extract country names for all authors and then split them into multiple variables.

      Code:
      clear
      insobs 2
      
      gen affiliation = "Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and University Department of Infectious and Tropical Diseases, University of Brescia and and ASST Spedali Civili, Brescia, Italy. and Department of Molecular and Translational Medicine and Clinical Chemistry and Laboratory ASST Spedali Civili, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy; [email protected]. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy." in 1
      
      replace affiliation = "Division of Allergy, Pulmonary, and Critical Care Medicine, Vanderbilt University and Medical Center, Nashville, TN, USA. Electronic address: and [email protected]. and Vanderbilt Center for Biomedical Ethics and Society, Vanderbilt University and Medical Center, Nashville, TN, USA. and Office of Emergency Care Research, National Institute of Neurological Disorders and and Stroke, Division of Clinical Research, National Institutes of Health, and Bethesda, MD, USA." in 2
      
      * START HERE
      generate country = ustrregexra(affiliation, ".*?,\s(\w+)[\.|;]", " $1 ")
      split country
      
      . list country*
      
           +-------------------------------------------------------------------------+
        1. |                                           country | country1 | country2 |
           |  Italy  Italy  Italy  Italy  Italy  Italy  Italy  |    Italy |    Italy |
           |-------------------------------------------------------------------------|
           |  country3   |   country4   |   country5   |   country6   |   country7   |
           |     Italy   |      Italy   |      Italy   |      Italy   |      Italy   |
           +-------------------------------------------------------------------------+
      
           +-------------------------------------------------------------------------+
        2. |                                           country | country1 | country2 |
           |                                    USA  USA  USA  |      USA |      USA |
           |-------------------------------------------------------------------------|
           |  country3   |   country4   |   country5   |   country6   |   country7   |
           |       USA   |              |              |              |              |
           +-------------------------------------------------------------------------+

      Comment


      • #4
        Originally posted by Carlo Lazzaro View Post
        The issue here is that you should have an idea about the geography.
        The last two columns produced by -split- van be helpful.
        Code:
        . split var1 , p()
        variables created as string:
        var11 var19 var117 var125 var133 var141 var149 var157 var165 var173 var181 var189 var197 var1105
        var12 var110 var118 var126 var134 var142 var150 var158 var166 var174 var182 var190 var198 var1106
        var13 var111 var119 var127 var135 var143 var151 var159 var167 var175 var183 var191 var199 var1107
        var14 var112 var120 var128 var136 var144 var152 var160 var168 var176 var184 var192 var1100 var1108
        var15 var113 var121 var129 var137 var145 var153 var161 var169 var177 var185 var193 var1101 var1109
        var16 var114 var122 var130 var138 var146 var154 var162 var170 var178 var186 var194 var1102 var1110
        var17 var115 var123 var131 var139 var147 var155 var163 var171 var179 var187 var195 var1103
        var18 var116 var124 var132 var140 var148 var156 var164 var172 var180 var188 var196 var1104
        
        . list var1109 var1110
        
        +--------------------+
        | var1109 var1110 |
        |--------------------|
        1. | Brescia, Italy. |
        2. | |
        +--------------------+
        
        .
        I must confess that Brescia was really easy for me to spot: it's a nice town about 100 Km north-east from Milan (on the A4 motorway that links Milan to Venice) . In addition, some years ago I was involved in two projects with physicians working at Spedali Civili, which is the teaching hospital of the local School of Medicine,
        Thank you Carlo for the code. Nice co-incidence about Brescia!

        Comment


        • #5
          Originally posted by Fei Wang View Post
          Other than Carlo's advice, I have some ideas to share. In #1, the first obs contains 7 authors, and the second obs has 3 authors. Authors may come from different countries, so I think it's better to generate countries for all authors. I'm not sure if other observations have similar patterns, but at least the two lines show that a country name appears after ", " and ends with "." or ";". So my idea is to extract country names for all authors and then split them into multiple variables.

          Code:
          clear
          insobs 2
          
          gen affiliation = "Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and University Department of Infectious and Tropical Diseases, University of Brescia and and ASST Spedali Civili, Brescia, Italy. and Department of Molecular and Translational Medicine and Clinical Chemistry and Laboratory ASST Spedali Civili, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy; [email protected]. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy." in 1
          
          replace affiliation = "Division of Allergy, Pulmonary, and Critical Care Medicine, Vanderbilt University and Medical Center, Nashville, TN, USA. Electronic address: and [email protected]. and Vanderbilt Center for Biomedical Ethics and Society, Vanderbilt University and Medical Center, Nashville, TN, USA. and Office of Emergency Care Research, National Institute of Neurological Disorders and and Stroke, Division of Clinical Research, National Institutes of Health, and Bethesda, MD, USA." in 2
          
          * START HERE
          generate country = ustrregexra(affiliation, ".*?,\s(\w+)[\.|;]", " $1 ")
          split country
          
          . list country*
          
          +-------------------------------------------------------------------------+
          1. | country | country1 | country2 |
          | Italy Italy Italy Italy Italy Italy Italy | Italy | Italy |
          |-------------------------------------------------------------------------|
          | country3 | country4 | country5 | country6 | country7 |
          | Italy | Italy | Italy | Italy | Italy |
          +-------------------------------------------------------------------------+
          
          +-------------------------------------------------------------------------+
          2. | country | country1 | country2 |
          | USA USA USA | USA | USA |
          |-------------------------------------------------------------------------|
          | country3 | country4 | country5 | country6 | country7 |
          | USA | | | | |
          +-------------------------------------------------------------------------+
          Hi Fei, the code worked like Magic! Thank you so much. It makes me curious, could you please tell me a little about how this part works:
          ".*?,\s(\w+)[\.|;]", " $1 "

          Comment


          • #6
            Sonnen, it's an example of regular expression telling Stata to search through a string for particular patterns. ".*?,\s(\w+)[\.|;]" defines a pattern: any set of characters (.*?) followed by a comma (,), a space (\s), and a set of letters and numbers (\w+), ending with a period or a semicolon ([\.|;]). " $1 " tells Stata to extract the part of (\w+) from any matched patterns and place a leading and a trailing space around it. You may find online many tutorials of regular expression and it's really worth learning comprehensively.

            Comment

            Working...
            X