Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Re-elaboration of string variable

    Hi all,

    my data look as follows:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long docdb_family_id strL new_a2 str46 person_ctrycode
     569328 "[1984 1983 1981 1982]"      "['  ']"     
     574660 "[1989 1994 1987 1990 1991]" "['  ']"     
    1187498 "[1998 1999 1996]"           "['RU']"     
    1226468 "[1996 1992 1994 1993 1997]" "['DE']"     
    1236571 "[1991 1993 1990 1996]"      "['US']"     
    1239098 "[1995 1992 1993]"           "['US']"     
    1239277 "[1996 1993 1992]"           "['IL']"     
    1239483 "[1991 1994 1996 1992]"      "['US']"     
    1239622 "[1995 1997 1992 1994 1990]" "['FI']"     
    1239624 "[1992 1993 1997 1994]"      "['US']"     
    1239749 "[1995 1993 1992 1997]"      "['JP']"     
    1334477 "[1985]"                     "['  ']"     
    1340405 "[1992 1994 1995]"           "['US']"     
    1340418 "[1997 1994 1991 1995]"      "['GB']"     
    1340462 "[1993 1995 1994]"           "['US']"     
    1340471 "[1994 1993 1995 1996]"      "['SE' 'US']"
    1340485 "[1993 1994 1995]"           "['US']"     
    1340488 "[1994 1993 1995 1996]"      "['US' '  ']"
    1340508 "[1997 1995 1993 1994]"      "['US']"     
    1340519 "[1995 1994 1993 1996]"      "['US']"     
    1340541 ""                           "['US']"     
    1340647 "[1998 1996 1994 1995]"      "['US' 'CA']"
    1340659 "[1995 1997 1996]"           "['CA' '  ']"
    1340673 "[1997 1995 1994 1996]"      "['FI' 'US']"
    1340826 "[1995 1997]"                "['US']"     
    1341006 "[1996 1995 1997]"           "['US']"     
    1341170 "[1997 1996 1998]"           "['US']"     
    1341175 "[1998 1997]"                "['US']"     
    1341188 "[1998 1996 2000]"           "['  ']"     
    1578920 "[1972 1971 1970]"           "['DE' 'NL']"
    1763337 "[1997 1995 1989]"           "['US' '  ']"
    2185207 "[1972]"                     "['  ']"     
    2267654 "[1972]"                     "['  ']"     
    2410050 "[1972]"                     "['  ']"     
    2456718 "[1972]"                     "['  ']"     
    2463680 "[1995 1993 1994 1997]"      "['RU']"     
    2588926 "[1972]"                     "['  ']"     
    2590654 "[1972]"                     "['  ']"     
    2856998 "[1972]"                     "['  ']"     
    3411930 "[1996 1993 1994]"           "['JP']"     
    3459794 "[1972]"                     "['  ']"     
    3459918 "[1998]"                     "['JP']"     
    3460123 "[1972]"                     "['  ']"     
    3460237 "[1995]"                     "['JP']"     
    3460351 "[1972]"                     "['  ']"     
    3460409 "[1972]"                     "['  ']"     
    3460621 "[2008 2004 2005 2003]"      "['ZW']"     
    3460636 "[1989]"                     "['GB']"     
    3460679 "[1992]"                     "['IT']"     
    3460785 "[2001]"                     "['AR']"     
    3460796 "[2001 2002 2003]"           "['AR']"     
    3460811 "[2001 2002 2004 2003]"      "['AR']"     
    3460812 "[2005 2001 2004 2002]"      "['AR']"     
    3460829 "[2002 2001 2005]"           "['AR']"     
    3460972 "[1998 2000]"                "['AR']"     
    3460979 "[2000 1998 1999]"           "['AR']"     
    3460980 "[2000]"                     "['AR']"     
    3460981 "[2000]"                     "['AR']"     
    3461002 "[1998 2000 2002 1999]"      "['  ']"     
    3461010 "[1998 1999 2000]"           "['AR' '  ']"
    3461018 "[2002 1999 1998]"           "['AR']"     
    3461040 "[1999 2001]"                "['US' 'AR']"
    3461044 "[1999]"                     "['AR']"     
    3461068 "[2000]"                     "['AR']"     
    3461084 "[1998 1999 2000]"           "['JP']"     
    3461088 "[1999]"                     "['AR']"     
    3461099 "[2001]"                     "['AR']"     
    3461127 "[2003 2001 2004]"           "['AR']"     
    3461128 "[2001 2000 2003 1999]"      "['AR' '  ']"
    3461146 "[2001]"                     "['AR']"     
    3461172 "[2002 2000]"                "['AR' '  ']"
    3461173 "[2002 2001]"                "['AR']"     
    3461192 "[2003 2002]"                "['AR']"     
    3461205 "[2001 2005]"                "['AR']"     
    3461208 "[2001]"                     "['  ' 'AR']"
    3461508 "[1970]"                     "['  ']"     
    3461552 "[1971]"                     "['  ']"     
    3461559 "[1973 1971]"                "['  ']"     
    3461677 "[1972]"                     "['  ']"     
    3461734 "[1973]"                     "['  ']"     
    3462350 "[1974]"                     "['  ']"     
    3463151 "[1975]"                     "['  ']"     
    3463640 "[1974 1973 1976 1975]"      "['  ']"     
    3463716 "[1977 1974 1973]"           "['  ']"     
    3464348 "[1975 1977 1976 1974 1978]" "['  ']"     
    3464855 "[1975 1976 1978 1974]"      "['AR']"     
    3466402 "[1974 1978]"                "['AR']"     
    3466535 "[1978 1979 1977]"           "['AR']"     
    3467766 "[1979 1978]"                "['  ']"     
    3467767 "[1978 1980]"                "['AR']"     
    3467833 "[1978 1977 1980 1981]"      "['AR']"     
    3467977 "[1977 1978 1976]"           "['AR']"     
    3467987 "[1977 1979 1985 1976 1978]" "['  ']"     
    3468001 "[1979 1978]"                "['  ']"     
    3468888 "[1978 1976]"                "['  ']"     
    3469170 "[1976 1977 1978]"           "['  ']"     
    3469171 "[1978 1976]"                "['AR']"     
    3469272 "[1976 1978]"                "['AR']"     
    3470490 "[1981 1979 1977 1982]"      "['AR']"     
    3471419 "[1978 1979 1980]"           "['AR']"     
    end
    I would like however to split first the string person_ctrycode and, whenever more than a country code is present, to put them into two separate lines. For instance,

    Code:
    input long docdb_family_id strL new_a2 str46 person_ctrycode
    1340471 "[1994 1993 1995 1996]"      "['SE' 'US']"
    should become:

    Code:
    input long docdb_family_id strL new_a2 str46 person_ctrycode  person_ctrycode2 
    1340471 "[1994 1993 1995 1996]"      "['SE' 'US']"                       "SE"
    1340471 "[1994 1993 1995 1996]"      "['SE' 'US']"                       "US"
    and so on.

    After this, in a separate database, I should repeat the exercise for the variable new_a2. Following the example above, this

    Code:
    input long docdb_family_id strL new_a2 str46 person_ctrycode
    1340471 "[1994 1993 1995 1996]"      "['SE' 'US']"
    should become:

    Code:
    input long docdb_family_id strL new_a2 str46 person_ctrycode year2
    1340471 "[1994 1993 1995 1996]"      "['SE' 'US']"                       1994
    1340471 "[1994 1993 1995 1996]"      "['SE' 'US']"                       1993
    1340471 "[1994 1993 1995 1996]"      "['SE' 'US']"                       1995
    1340471 "[1994 1993 1995 1996]"      "['SE' 'US']"                       1996
    Thank you

  • #2
    Code:
    replace person_ctrycode = ustrregexra(person_ctrycode, "[\[\]']", "")
    gen n_countries = wordcount(person_ctrycode)
    expand = n_countries, gen(new)
    gen nr = new + 1
    gen person_ctrycode2 = word(person_ctrycode, nr) if n_countries != 0
    The expansion with years should work analogously. And I was not sure how you would like to treat the missing values.

    All best.

    Comment


    • #3
      Thank you for the reply. Missing values should be dropped

      Comment


      • #4
        Adam Ansel what if I would like to replicate the same exercise with the oecd_fields variable here:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input long docdb_family_id str72 oecd_fields
        1340541 "[' Medical and Health Sciences']"                    
        3483406 "[' Natural Sciences']"                               
        3498148 "[' Engineering and Technology']"                     
        3499968 "[' Natural Sciences']"                               
        3500677 "[' Medical and Health Sciences']"                    
        3505151 "[' Engineering and Technology']"                     
        3506722 "[' Medical and Health Sciences']"                    
        3509337 "[' Natural Sciences']"                               
        3513514 "[' Medical and Health Sciences']"                    
        3515876 "[' Engineering and Technology']"                     
        3518628 "[' Natural Sciences']"                               
        3519452 "[' Natural Sciences' ' Medical and Health Sciences']"
        3520820 "[' Natural Sciences']"                               
        3522546 "[' Medical and Health Sciences' ' Natural Sciences']"
        3542434 "[' Natural Sciences']"                               
        3552234 "[' Engineering and Technology']"                     
        3553805 "[' Natural Sciences']"                               
        3636418 "[' Natural Sciences' ' Medical and Health Sciences']"
        3673165 "[' Engineering and Technology']"                     
        3680127 "[' Natural Sciences']"                               
        3688953 "[' Medical and Health Sciences' ' Natural Sciences']"
        3689983 "[' Natural Sciences']"                               
        3700898 "[' Engineering and Technology']"                     
        3768731 "[' Natural Sciences']"                               
        3769689 "[' Natural Sciences']"                               
        3769871 "[' Natural Sciences']"                               
        3770022 "[' Medical and Health Sciences' ' Natural Sciences']"
        3770463 "[' Medical and Health Sciences']"                    
        3770652 "[' Medical and Health Sciences']"                    
        3771195 "[' Natural Sciences']"                               
        3771350 "[' Medical and Health Sciences']"                    
        3771382 "[' Natural Sciences']"                               
        3771404 "[' Natural Sciences']"                               
        3771425 "[' Medical and Health Sciences']"                    
        3771432 "[' Natural Sciences']"                               
        3771495 "[' Medical and Health Sciences']"                    
        3771604 "[' Natural Sciences']"                               
        3772167 "[' Engineering and Technology']"                     
        3772274 "[' Engineering and Technology']"                     
        3772454 "[' Medical and Health Sciences']"                    
        3772510 "[' Engineering and Technology']"                     
        3772810 "[' Natural Sciences']"                               
        3772813 "[' Medical and Health Sciences']"                    
        3772940 "[' Engineering and Technology']"                     
        3773305 "[' Natural Sciences']"                               
        3773311 "[' Natural Sciences']"                               
        3773565 "[' Natural Sciences' ' Medical and Health Sciences']"
        3773807 "[' Natural Sciences']"                               
        3774869 "[' Natural Sciences']"                               
        3775109 "[' Natural Sciences']"                               
        3775561 `""[' Natural Sciences' ' Agricultural Sciences'"'    
        3776794 "[' Medical and Health Sciences']"                    
        3777101 "[' Natural Sciences']"                               
        3779413 "[' Engineering and Technology']"                     
        3780242 "[' Natural Sciences' ' Humanities']"                 
        3783526 "[' Engineering and Technology']"                     
        3783583 "[' Medical and Health Sciences']"                    
        3784332 "[' Natural Sciences']"                               
        3784469 "[' Engineering and Technology']"                     
        3786016 "[' Medical and Health Sciences']"                    
        3786144 "[' Agricultural Sciences']"                          
        3787179 "[' Natural Sciences']"                               
        3788008 "[' Medical and Health Sciences']"                    
        3790639 "[' Natural Sciences']"                               
        3790670 "[' Natural Sciences']"                               
        3792458 "[' Natural Sciences' ' Medical and Health Sciences']"
        3794389 "[' Natural Sciences']"                               
        3795015 "[' Natural Sciences']"                               
        3797904 "[' Medical and Health Sciences']"                    
        3798105 "[' Medical and Health Sciences']"                    
        3799670 "[' Natural Sciences']"                               
        3800683 "[' Natural Sciences']"                               
        3802132 "[' Medical and Health Sciences' ' Natural Sciences']"
        3802281 "[' Natural Sciences' ' Engineering and Technology']" 
        3802595 "[' Natural Sciences']"                               
        3803286 "[' Natural Sciences']"                               
        3803326 "[' Engineering and Technology']"                     
        3803551 "[' Natural Sciences']"                               
        3803728 "[' Natural Sciences']"                               
        3804321 "[' Natural Sciences']"                               
        3806474 "[' Natural Sciences']"                               
        3808232 "[' Engineering and Technology']"                     
        3808684 "[' Natural Sciences']"                               
        3809416 "[' Engineering and Technology']"                     
        3810055 "[' Natural Sciences']"                               
        3810064 "[' Natural Sciences']"                               
        3810114 "[' Natural Sciences' ' Medical and Health Sciences']"
        3810403 "[' Natural Sciences']"                               
        3810855 "[' Natural Sciences']"                               
        3810974 "[' Natural Sciences']"                               
        3811389 "[' Engineering and Technology']"                     
        3811482 "[' Agricultural Sciences']"                          
        3812435 "[' Natural Sciences']"                               
        3812594 "[' Natural Sciences']"                               
        3813073 "[' Humanities']"                                     
        3813963 "[' Natural Sciences']"                               
        3814849 "[' Natural Sciences']"                               
        3814850 "[' Natural Sciences']"                               
        3815498 "[' Engineering and Technology']"                     
        3815502 "[' Natural Sciences']"                               
        end
        so that for instance
        Code:
        3810114 "[' Natural Sciences' ' Medical and Health Sciences']"

        becomes

        Code:
        3810114 "Natural Sciences"
        3810114 "Medical and Health Sciences"

        Comment


        • #5
          In that case, you need a bit more cleaning:

          Code:
          replace oecd_fields = ustrregexra(oecd_fields, `"[\[\]\"]"', "") // remove unnecessary signs
          replace oecd_fields = ustrregexra(oecd_fields, " ", "_") // replace blanks
          replace oecd_fields = ustrregexra(oecd_fields, "'_'", " ") // leave one blank for the word count
          gen n_fields = wordcount(oecd_fields)
          expand = n_fields, gen(new)
          gen nr = new + 1
          gen oecd_fields2 = word(oecd_fields, nr) if n_fields != 0
          replace oecd_fields2 = ustrregexra(oecd_fields2, "_", " ")
          replace oecd_fields2 = ustrregexra(oecd_fields2, "'", " ")
          HTH!

          Comment

          Working...
          X