Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to rearrange the values of string variables according to a criterion?

    Hello,

    I want to rearrange the values of 314 string variables (containing 107 to 166 country names) so that each specific value of each variable is in the same observation line of all other variables for that specific value. The string variables represent different calendar dates on which a country is present.

    Each country is intended to be rearranged in one line, either present or absent in each of the string variables. I am using Stata SE 14.2.

    The data set is a list of country names present or absent in calendar dates. On the first date, 107 countries are present in full data (9 countries in sample data below). On each of the next dates, some additional countries are also present, while some of the countries present on previous dates might be absent. As such, each of the string variables represents a date on which each country is present or absent. Within each date, country names are sorted alphabetically. Variable names include the dates, e.g., the variable named countries20200417 shows the countries present on 17 April 2020.

    The objective is to generate (a) a variable that shows every country present on any date (a list of all countries), (b) a variable that shows the earliest date each country was present, and (c) a set of variables that show dates on which each country was absent (after their first presence). In the second data set below, these variables are named (a) countries_all, (b) date_present_first, and (c) date_absent_1 date_absent_2

    The question is how to go from the original data set to the described wanted data set. Could not find commands that rearrange the values of string variables as such.

    A sample of the original data is as follows.

    Code:
     
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str22(countries20200417 countries20200424) str24 countries20220915
    "Afghanistan"   ""                     "Afghanistan"           
    "Albania"       ""                     ""                      
    "Canada"        "Canada"               "Bosnia and Herzegovina"
    "Chile"         "Chile"                "Botswana"              
    "Colombia"      "Colombia"             "Brazil"                
    "Costa Rica"    "Congo (Brazzaville)"  "Bulgaria"              
    "Cote d'Ivoire" "Congo (Kinshasa)"     "Burkina Faso"          
    "Croatia"       "Costa Rica"           "Burundi"               
    "Cuba"          "Cote d'Ivoire"        "Cabo Verde"            
    ""              "United Arab Emirates" "Nigeria"               
    ""              "United Kingdom"       "North Macedonia"       
    ""              "Uruguay"              "Norway"                
    ""              "Uzbekistan"           "Oman"                  
    ""              "Venezuela"            "Pakistan"              
    ""              "Vietnam"              ""                      
    ""              ""                     "Papua New Guinea"      
    ""              ""                     "Paraguay"              
    ""              ""                     "US"                    
    ""              ""                     "Uganda"                
    ""              ""                     "Ukraine"               
    ""              ""                     "United Arab Emirates"  
    ""              ""                     "United Kingdom"        
    ""              ""                     "Uruguay"               
    ""              ""                     "Uzbekistan"            
    ""              ""                     "Venezuela"             
    ""              ""                     "Vietnam"               
    ""              ""                     "Yemen"                 
    ""              ""                     "Zambia"                
    ""              ""                     "Zimbabwe"              
    end

    The wanted data for the sample data is as follows.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str13 countries20200417 str20 countries20200424 str22(countries20220915 countries_all) long(date_present_first date_absent_1 date_absent_2)
    "Afghanistan"   ""                     "Afghanistan"            "Afghanistan"            20200417 20200424        .
    "Albania"       ""                     ""                       "Albania"                20200417 20200424 20220915
    ""              ""                     "Bosnia and Herzegovina" "Bosnia and Herzegovina" 20220915        .        .
    ""              ""                     "Botswana"               "Botswana"               20220915        .        .
    ""              ""                     "Brazil"                 "Brazil"                 20220915        .        .
    ""              ""                     "Bulgaria"               "Bulgaria"               20220915        .        .
    ""              ""                     "Burkina Faso"           "Burkina Faso"           20220915        .        .
    ""              ""                     "Burundi"                "Burundi"                20220915        .        .
    ""              ""                     "Cabo Verde"             "Cabo Verde"             20220915        .        .
    "Canada"        "Canada"               ""                       "Canada"                 20200417 20220915        .
    "Chile"         "Chile"                ""                       "Chile"                  20200417 20220915        .
    "Colombia"      "Colombia"             ""                       "Colombia"               20200417 20220915        .
    ""              "Congo (Brazzaville)"  ""                       "Congo (Brazzaville)"    20200424 20220915        .
    ""              "Congo (Kinshasa)"     ""                       "Congo (Kinshasa)"       20200424 20220915        .
    "Costa Rica"    "Costa Rica"           ""                       "Costa Rica"             20200417 20220915        .
    "Cote d'Ivoire" "Cote d'Ivoire"        ""                       "Cote d'Ivoire"          20200417 20220915        .
    "Croatia"       ""                     ""                       "Croatia"                20200417 20200424 20220915
    "Cuba"          ""                     ""                       "Cuba"                   20200417 20200424 20220915
    ""              ""                     "Nigeria"                "Nigeria"                20220915        .        .
    ""              ""                     "North Macedonia"        "North Macedonia"        20220915        .        .
    ""              ""                     "Norway"                 "Norway"                 20220915        .        .
    ""              ""                     "Oman"                   "Oman"                   20220915        .        .
    ""              ""                     "Pakistan"               "Pakistan"               20220915        .        .
    ""              ""                     "Papua New Guinea"       "Papua New Guinea"       20220915        .        .
    ""              ""                     "Paraguay"               "Paraguay"               20220915        .        .
    ""              ""                     "Uganda"                 "Uganda"                 20220915        .        .
    ""              ""                     "Ukraine"                "Ukraine"                20220915        .        .
    ""              "United Arab Emirates" "United Arab Emirates"   "United Arab Emirates"   20200424        .        .
    ""              "United Kingdom"       "United Kingdom"         "United Kingdom"         20200424        .        .
    ""              "Uruguay"              "Uruguay"                "Uruguay"                20200424        .        .
    ""              ""                     "US"                     "US"                     20220915        .        .
    ""              "Uzbekistan"           "Uzbekistan"             "Uzbekistan"             20200424        .        .
    ""              "Venezuela"            "Venezuela"              "Venezuela"              20200424        .        .
    ""              "Vietnam"              "Vietnam"                "Vietnam"                20200424        .        .
    ""              ""                     "Yemen"                  "Yemen"                  20220915        .        .
    ""              ""                     "Zambia"                 "Zambia"                 20220915        .        .
    ""              ""                     "Zimbabwe"               "Zimbabwe"               20220915        .        .
    end
    Thank you,
    Farshad

  • #2
    You are starting with a data set that is very dysfunctionally organized for purposes of Stata data management and analysis and you are looking to create another one that is nearly as dysfunctional. I recommend that you go for a different, more usable, layout in the final result:

    Code:
    gen long obs_no = _n
    reshape long countries, i(obs_no) j(str_date) string
    gen int real_date = daily(str_date, "YMD")
    assert missing(str_date) == missing(real_date)
    format real_date %td
    drop str_date
    
    rename countries country
    drop if missing(country)
    fillin country real_date
    rename _fillin absent
    drop obs_no
    
    by country (real_date), sort: egen date_present_first = min(cond(!absent, real_date, .))
    format date_present_first %td
    
    by country (real_date): gen absent_dates = ///
        cond(absent, string(real_date, "%tdCCYYNNDD"), "") if _n == 1
    by country (real_date): replace absent_dates = ///
        absent_dates[_n-1] + cond(absent, " " + string(real_date, "%tdCCYYNNDD"), "") ///
        if _n > 1
    by country (real_date): replace absent_dates = absent_dates[_N]
    split absent_dates, gen(date_absent)
    drop absent_dates
    isid country real_date, sort
    In fact, I suspect that for most, if not all, of what you will want to do with this date, you don't even need the date_present_first and date_absent* variables because the information is readily obtainable from the variables real_date and absent, but they are there as a possibly convenient alternative.

    If you really want to go back to the layout you show in your example, you can -reshape wide-. But I think you will just regret it as you proceed with further work on this data set. The layout you are looking for is suitable for working with spreadsheets. But Stata is not a spreadsheet, and trying to work with it as if it were usually leads to frustration and errors.
    Last edited by Clyde Schechter; 27 Oct 2022, 13:01.

    Comment


    • #3
      Thank you

      Comment

      Working...
      X