Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate new variables with multiple for and if conditions

    Hello all!
    I'm fairly new to Stata. I have a bunch of farming data for each crop. I have 12 crops and 10 different pieces of information for each crop. So I have a variables crop_name_'name' like this:

    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str11 crop_name_1 str17(crop_name_2 crop_name_3 crop_name_4 crop_name_5)
    "Cow Peas" "Pigeon Peas" "Green Gram" "Maize" "Nyota Nyayo Beans"
    "Cow Peas" "Green Gram" "Millet" "" ""
    "Green Gram" "Maize" "Sorghum" "" ""
    "Green Gram" "" "" "" ""
    "Green Gram" "Maize" "" "" ""
    "Cow Peas" "Green Gram" "" "" ""
    "Green Gram" "Maize" "" "" ""
    "Cow Peas" "Green Gram" "" "" ""
    "Cow Peas" "Green Gram" "Millet" "" ""

    Now, for each crop, I have corresponding information about various things (type of seed, amount harvested etc.) Each variable about crop information is of a different type. Here is what data for crop_name_1 looks like:

    "Less than half of the land used of cultivation" "All year-around" "Respondent does not know the size of harvest in bags or kgs" . . . . . "No" . .
    "Less than half of the land used of cultivation" "Two seasons in the year" "Kilos" . 40 . . . "Yes" 30 .
    "Half of the land used for cultivation" "All year-around" "Kilos" . 50 . . . "No" . .
    "All land used for cultivation" "Two seasons in the year" "Kilos" . 0 . . . "" . .
    "More than half of the land used for cultivation" "Two seasons in the year" "Kilos" . 10 . . . "No" . .
    "Half of the land used for cultivation" "Two seasons in the year" "Kilos" . 200 . . . "No" . .
    "Less than half of the land used of cultivation" "Two seasons in the year" "Kilos" . 28 . . . "Yes" 14 .
    "Half of the land used for cultivation" "Two seasons in the year" "Kilos" . 20 . . . "No" . .
    "Half of the land used for cultivation" "One season in the year" "Bags" 90 . 3 . . "Yes" 2.5 .
    "Less than half of the land used of cultivation" "Two seasons in the year" "Kilos" . 70 . . . "Yes" 70 .
    "Less than half of the land used of cultivation" "One season in the year" "Respondent does not know the size of harvest in bags or kgs" . . . . . "No" . .
    "Half of the land used for cultivation" "Two seasons in the year" "Kilos"

    I also have some questions that take on multiple choice (1/0) for each crop.
    input byte(b14_h_1_1 b14_h_2_1 b14_h_3_1 b14_h_4_1 b14_h_5_1 b14_h__97_1)
    . . . . . .
    0 0 1 0 0 0
    . . . . . .
    . . . . . .
    . . . . . .
    . . . . . .
    0 0 1 0 0 0
    . . . . . .
    0 0 1 0 0 0
    0 0 1 0 0 0
    . . . . . .
    . . . . . .
    . . . . . .
    . . . . . .
    . . . . . .
    . . . . . .
    1 0 0 0 0 0
    0 0 0 0 0 0
    . . . . . .
    . . . . . .
    . . . . . .
    0 0 1 0 0 0
    0 0 1 0 0 0

    As you can imagine. this is a nightmare for analysis. I have figured a way around this. I want to generate a variable for each of the 12 crops based on the value in "crop_name_#". I could use the following code:

    gen b14a_cowpeas = b14_a_1 if crop_name_1 == "Cow Peas"
    replace b14_a_cowpeas = b14_a_2 if crop_name_2 == "Cow Peas"

    However, I cannot figure out how to put this in a loop for each crop and each variable. I
    How would I go about writing this loop?
    Last edited by Summer Evans; 24 Jul 2022, 01:55.

  • #2
    Are these pieces of information in different datasets? Creating more variables would cause more confusion, I think that all you need to do is some sort of reshape. Is there an identifier for each observation in case they are in different datasets? What I recommend is to extract a sample of 3 crops and their corresponding pieces of information and present here using dataex. The particular information does not matter to illustrate a technique, you can replace the text with "xxx" so that you do not hit dataex limits. So something like:

    Code:
    preserve
    replace text1="xxx"
    replace text2= "yyy"
    replace text3= "zzz"
    dataex crop_name_1 crop_name_2 crop_name_3 text1 text2 text3 b14_h_1_1 b14_h_2_1 b14_h_3_1 in 1/20
    restore
    This should be sufficient to suggest some strategy for the full dataset.

    Comment


    • #3
      Originally posted by Andrew Musau View Post
      Are these pieces of information in different datasets? Creating more variables would cause more confusion, I think that all you need to do is some sort of reshape. Is there an identifier for each observation in case they are in different datasets? What I recommend is to extract a sample of 3 crops and their corresponding pieces of information and present here using dataex. The particular information does not matter to illustrate a technique, you can replace the text with "xxx" so that you do not hit dataex limits. So something like:

      Code:
      preserve
      replace text1="xxx"
      replace text2= "yyy"
      replace text3= "zzz"
      dataex crop_name_1 crop_name_2 crop_name_3 text1 text2 text3 b14_h_1_1 b14_h_2_1 b14_h_3_1 in 1/20
      restore
      This should be sufficient to suggest some strategy for the full dataset.
      To answer your question: No, these are in the same dataset. What I need to do is consolidate each variable depending on crop_name. As you can see in the example below, names can appear in any column ("Green Gram" appears under both crop_name_1, crop_name_2 and crop_name_3).

      crop_name_1 crop_name_2 crop_name_3
      "Cow Peas" "Pigeon Peas" "Green Gram"
      "Cow Peas" "Green Gram" "Millet"
      "Green Gram" "Maize" "Sorghum"
      "Green Gram" "" ""
      "Green Gram" "Maize" ""
      "Cow Peas" "Green Gram" ""
      "Green Gram" "Maize" ""
      "Cow Peas" "Green Gram" ""
      "Cow Peas" "Green Gram" "Millet"

      All variables that follow have suffixes _1, _2, _3 etc, and correspond to the crop in crop_name_*.

      b14_a_2 b14_a_3 b14_b_3 b14_c_unit_2
      3 2 1 1
      3 2 2 2
      3 3 2 1
      . . . .
      4 . . 2
      1 . . 2
      3 . . 2
      3 . . 2
      2 4 2 1
      3 . . 2


      Ideally, I would consolidate all information variables such that I'd get one 'master variable' for each crop (b14_a_cowpea, b14_b_cowpea, b14_e_greengram).

      Does this additional information help?

      Comment


      • #4
        Originally posted by Summer Evans View Post
        Does this additional information help?
        It helps somewhat, but adding variables incorporating information on type of seed, amount harvested etc. would have made your implementation a lot easier. reshape needs stubs, and below you will see how I identify these for these two sets of variables. You need to extend this for the omitted variables.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str11 crop_name_1 str17(crop_name_2 crop_name_3 crop_name_4 crop_name_5) byte(b14_h_1_1 b14_h_2_1 b14_h_3_1 b14_h_4_1 b14_h_5_1)
        "Cow Peas"   "Pigeon Peas" "Green Gram" "Maize" "Nyota Nyayo Beans" 0 0 1 0 0
        "Cow Peas"   "Green Gram"  "Millet"     ""      ""                  1 0 1 0 1
        "Green Gram" "Maize"       "Sorghum"    ""      ""                  0 0 1 0 0
        "Green Gram" ""            ""           ""      ""                  1 1 1 0 0
        "Green Gram" "Maize"       ""           ""      ""                  0 0 1 0 0
        "Cow Peas"   "Green Gram"  ""           ""      ""                  1 1 1 1 1
        "Green Gram" "Maize"       ""           ""      ""                  0 0 0 0 0
        "Cow Peas"   "Green Gram"  ""           ""      ""                  1 0 1 0 0
        "Cow Peas"   "Green Gram"  "Millet"     ""      ""                  0 0 1 0 1
        end
        
        g farm=_n
        rename b14_h_*_1 b14_h_1_*
        reshape long crop_name_ b14_h_1_, i(farm) j(which)
        drop if missing(crop_name_)
        replace crop_name_ = strtoname( crop_name_)
        drop which
        reshape wide b14_h_1_, i(farm) j(crop_name_) string
        rename b14_h_1_* *
        Res.:

        Code:
        . l, sep(0)
        
             +-----------------------------------------------------------------------------+
             | farm   Cow_Peas   Green_~m   Maize   Millet   Nyota_~s   Pigeon~s   Sorghum |
             |-----------------------------------------------------------------------------|
          1. |    1          0          1       0        .          0          0         . |
          2. |    2          1          0       .        1          .          .         . |
          3. |    3          .          0       0        .          .          .         1 |
          4. |    4          .          1       .        .          .          .         . |
          5. |    5          .          0       0        .          .          .         . |
          6. |    6          1          1       .        .          .          .         . |
          7. |    7          .          0       0        .          .          .         . |
          8. |    8          1          0       .        .          .          .         . |
          9. |    9          0          0       .        1          .          .         . |
             +-----------------------------------------------------------------------------+
        
        .

        Ideally, I would consolidate all information variables such that I'd get one 'master variable' for each crop (b14_a_cowpea, b14_b_cowpea, b14_e_greengram).

        With more stubs, include them and the final rename is not needed. In case you are still having problems, present an example including these using dataex.
        Last edited by Andrew Musau; 24 Jul 2022, 04:36.

        Comment


        • #5
          I am having trouble uploading the complete dataset using dataex. I tried to follow your instructions but I am getting an initial set of variables with crop names that do not necessarily match with the values in b14_h*.
          I have attached the subset of data I would like to reshape to explain it better.

          1. Each row (farmer) should have 1/0 for whether the crop was grown on not. This information can be attained from crop_name_* (which contains the name of the crop the following set of questions correspond to).

          2. Following this, I would want one variable that gives me information about each crop: b14_a_cowpeas, b14_b_cowpeas, b14_c_unit_cowpeas, b14_c_bagsize_cowpeas, b14_c_bags_cowpeas, b14_c_max_cowpeas, b14_d_cowpeas, b14_d_trade_cowpeas, b14_e_cowpeas, b14_d_price_cowpeas, b14_f_cowpeas, b14_time_1_cowpeas b14_time_unit_1_cowpeas b14_time_2_cowpeas b14_time_unit_2_cowpeas b14_time_3_cowpeas b14_time_unit_3_cowpeas b14_time_4_cowpeas b14_time_unit_4_cowpeas b14_time_5_cowpeas b14_time_unit_5_cowpeas ... and so on for EACH crop (so I would have other variables: b14_a_greengram, b14_a_yellowbeans etc for all b14_* variables).

          Here, the '_1' would be the name of crop_name_1.

          3. Then, if crop_name_2 is a crop that already has a variable (say cowpeas), the code should assign those values to the relevant variable. For example; if crop_name_2 = Green Gram, then b14_a_2 should be assigned to b14_a_greengram.

          In this way, I would end up with a set of initial variables that have each crop name and take on 1/0 if they were grown by that farmer. I'd then have b14_a_cropname to b14_i_cropname for each crop.

          Reshape seems the only way to go, but I am not sure how I would create a variables for every name under crop_name_*. I am thinking this would require a loop, which I can't seem to figure out!
          Attached Files

          Comment


          • #6
            This will get you starting, then you will have to clean up the dataset before proceeding. For example, merging the information in b14_* and b14_*_label and getting rid of unneeded variables. Then it will be easier to get the structure that you want.

            Code:
            import excel "crop_merge.xlsx", sheet("Sheet1") firstrow clear
            drop formdef_version -endtime
            qui ds *_?
            local stubs
            foreach stub in `r(varlist)'{
                local stubs "`stubs'  `=ustrregexra("`stub'", "(.*)(\_)(\d{1}$)", "$1$2")'"
            }
            local stubs: list uniq stubs
            g id=_n, before(croplist_count)
            foreach var of varlist *{
                cap tostring `var', replace
            }
            reshape long `stubs', i(id) j(which)

            Comment


            • #7
              Thank you for this!

              I tried cleaning up my dataset a little beforehand by removing the label variables. I still get some ambiguous additional variables (v179, v189 etc.). Do you have any advice on how to properly identify what these variables are?

              Comment


              • #8
                They seem to be variables that identify the status of produce sales. Very few respondents answered these questions e.g., 3/80 for v179, so maybe just drop them. Or if there is a variable for sales status, then merge these responses.

                Code:
                import excel "crop_merge.xlsx", sheet("Sheet1") firstrow clear
                drop formdef_version -endtime
                foreach v of varlist v*{
                    list `v' if !missing(`v') , notrim
                }
                Res.:

                Code:
                                 v179  
                 18.      Didn't sell  
                 41.   Sold to school  
                 60.        Neighbour  
                
                                          v189  
                 18.               Didn't sell  
                 65.   I took to the container  
                
                                                                               v202  
                 18.                                                    Didn't sell  
                 65.   They did not come for my produce so I decided to take myself  
                
                                         v266  
                 64.   I sold to my neighbour  
                
                                            v276  
                 65.   I took to their container  
                
                                                   v289  
                 65.   They did not come for my harvest  
                
                            v353  
                 21.   Neighbour  
                 49.    Expoters  
                
                                                                                                                                                        v376  
                 49.   He sells highly perishable mangoes produce to the local market and the more durable mangoes and of high quality produce to exporters.  
                
                           v527  
                  1.   A friend  
                
                .

                Comment

                Working...
                X