Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unusual table to reshape long

    Hi, I have an unusual table I would like to reshape long where the first column represents the identifier "AEA000101013" and so on. The second column the year. The third column the month. The rest of the columns would be the variables' names which are "Dividend Yield - Common - Net - Issue - %" and "Asset Turnover". I could not find any satisfactory answer in previous comments about reshaping.
    Please find a piece of the data below.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str18 v1 str41 v2 str16 v3 str46 v4 str22 v5
    "A"                  "AEA000101013"                              "C"              "BBF000101901"                              "E"             
    ""                   "Dividend Yield - Common - Net - Issue - %" "Asset Turnover" "Dividend Yield - Common - Net - Issue - %" "Asset Turnover"
    "Date"               ""                                          ""               ""                                          ""              
    "31dec1998 00:00:00" "1"                                         ".3"             "2"                                         ".01"           
    "31dec1999 00:00:00" "5"                                         ".01"            "0"                                         ".3"            
    "31mar2000 00:00:00" ""                                          ""               ""                                          ""              
    "30jun2000 00:00:00" ""                                          ""               ""                                          ""              
    "31aug2000 00:00:00" ""                                          ""               ""                                          ""              
    "30sep2000 00:00:00" ""                                          ""               ""                                          ""              
    "31oct2000 00:00:00" ""                                          ""               ""                                          ""              
    "31dec2000 00:00:00" "4.8"                                       ".21"            "1.1"                                       ".03"           
    "31mar2001 00:00:00" ""                                          ""               ""                                          ""              
    "30jun2001 00:00:00" ""                                          ""               ""                                          ""              
    "31aug2001 00:00:00" ""                                          ""               ""                                          ""              
    "30sep2001 00:00:00" ""                                          ""               ""                                          ""              
    "31oct2001 00:00:00" ""                                          ""               ""                                          ""              
    "31dec2001 00:00:00" "0"                                         ".11"            "8"                                         ".23"           
    end

  • #2
    Here is a general guideline - nothing seems complicated here, it's just tedious data cleaning.

    1. Rename the variables as you describe them.

    Code:
    help rename
    2. Drop observations 1 to 3

    Code:
    drop in 1/3
    3. Extract the year and month from the string and then reshape. For example:

    Code:
    display year(dofc(tc("31dec2001 00:00:00")))
    display month(dofc(tc("31dec2001 00:00:00")))
    Res.:

    Code:
    . display year(dofc(tc("31dec2001 00:00:00")))
    2001
    
    . display month(dofc(tc("31dec2001 00:00:00")))
    12
    see

    Code:
    help  datetime

    Comment


    • #3
      Thank Andrew Musau for your quick answer. I agree that it is a matter of rearragment. However, you will be dropping the identifier (first line) and the variable name (second line). Also, the variables keep repeating at a frequent space with a new identifier.

      Comment


      • #4
        For Stata variable names, characters are at a premium as you are limited to 32 characters total. See

        Code:
        help naming_conventions
        If your goal is to totally automate data cleaning, you will not achieve that goal. However, the following may give you some tips on how to make progress and assumes consistency in the data structure.
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str18 v1 str41 v2 str16 v3 str46 v4 str22 v5
        "A"                  "AEA000101013"                              "C"              "BBF000101901"                              "E"            
        ""                   "Dividend Yield - Common - Net - Issue - %" "Asset Turnover" "Dividend Yield - Common - Net - Issue - %" "Asset Turnover"
        "Date"               ""                                          ""               ""                                          ""              
        "31dec1998 00:00:00" "1"                                         ".3"             "2"                                         ".01"          
        "31dec1999 00:00:00" "5"                                         ".01"            "0"                                         ".3"            
        "31mar2000 00:00:00" ""                                          ""               ""                                          ""              
        "30jun2000 00:00:00" ""                                          ""               ""                                          ""              
        "31aug2000 00:00:00" ""                                          ""               ""                                          ""              
        "30sep2000 00:00:00" ""                                          ""               ""                                          ""              
        "31oct2000 00:00:00" ""                                          ""               ""                                          ""              
        "31dec2000 00:00:00" "4.8"                                       ".21"            "1.1"                                       ".03"          
        "31mar2001 00:00:00" ""                                          ""               ""                                          ""              
        "30jun2001 00:00:00" ""                                          ""               ""                                          ""              
        "31aug2001 00:00:00" ""                                          ""               ""                                          ""              
        "30sep2001 00:00:00" ""                                          ""               ""                                          ""              
        "31oct2001 00:00:00" ""                                          ""               ""                                          ""              
        "31dec2001 00:00:00" "0"                                         ".11"            "8"                                         ".23"          
        end
        
        rename v1 date
        local i 2
        while `i'<=`c(k)'{
            if !mod(`i'/2, 1) & !missing(v`i'[2]){
                local current DY`=v`i'[1]'
                rename v`i' DY`=v`i'[1]'
                local ++i
            }
            if  mod(`i'/2, 1) & !missing(v`i'[2]){
                rename v`i' AT`=`current'[1]'
                local ++i
            }  
        }
        rename date strdate
        drop in 1/3
        gen date= dofc(clock(strdate, "DMYhms")), before(strdate)
        format date %td
        drop strdate
        reshape long DY AT, i(date) j(ID) string
        destring DY AT, replace
        rename(DY AT) (Dividend_Yield Asset_Turnover)
        Res.:

        Code:
        . sort ID date
        
        . l, sepby(ID)
        
             +------------------------------------------------+
             |      date             ID   Divide~d   Asset_~r |
             |------------------------------------------------|
          1. | 31dec1998   AEA000101013          1         .3 |
          2. | 31dec1999   AEA000101013          5        .01 |
          3. | 31mar2000   AEA000101013          .          . |
          4. | 30jun2000   AEA000101013          .          . |
          5. | 31aug2000   AEA000101013          .          . |
          6. | 30sep2000   AEA000101013          .          . |
          7. | 31oct2000   AEA000101013          .          . |
          8. | 31dec2000   AEA000101013        4.8        .21 |
          9. | 31mar2001   AEA000101013          .          . |
         10. | 30jun2001   AEA000101013          .          . |
         11. | 31aug2001   AEA000101013          .          . |
         12. | 30sep2001   AEA000101013          .          . |
         13. | 31oct2001   AEA000101013          .          . |
         14. | 31dec2001   AEA000101013          0        .11 |
             |------------------------------------------------|
         15. | 31dec1998   BBF000101901          2        .01 |
         16. | 31dec1999   BBF000101901          0         .3 |
         17. | 31mar2000   BBF000101901          .          . |
         18. | 30jun2000   BBF000101901          .          . |
         19. | 31aug2000   BBF000101901          .          . |
         20. | 30sep2000   BBF000101901          .          . |
         21. | 31oct2000   BBF000101901          .          . |
         22. | 31dec2000   BBF000101901        1.1        .03 |
         23. | 31mar2001   BBF000101901          .          . |
         24. | 30jun2001   BBF000101901          .          . |
         25. | 31aug2001   BBF000101901          .          . |
         26. | 30sep2001   BBF000101901          .          . |
         27. | 31oct2001   BBF000101901          .          . |
         28. | 31dec2001   BBF000101901          8        .23 |
             +------------------------------------------------+

        Comment


        • #5
          Thank you Andrew Musau, your code works very well.

          Comment


          • #6
            Hi Andrew Musau, I would like to generalize your code to more variables like in this exemples. Could you help me with that?
            Thanks

            ----------------------- copy starting from the next line -----------------------
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str18 v1 str41 v2 str16 v3 str46 v4 str22 v5 str41 v6 str16 v7 str46 v8 str22 v9
            "A"                  "AEA000101013"     "C"              "D"             "E"                "BBF000101901"   "G"                    "H"                        "I"                        
            ""                   "Dividend Yield " "Asset Turnover"   "Book Value"    "Debt ratio"        "Dividend Yield" "Asset Turnover"    "Book Value"    "Debt ratio"
            "Date"               ""                 ""                ""               ""              ""               ""                  ""                      ""      
            "31dec1998 00:00:00" "1"                ".3"              ".01"             "32"           "2"              ".01"               ".01"                   ".01"   
            "31dec1999 00:00:00" "5"                ".01"             ".3"             ".30"            "0"              ".3"                ".3"                    ".3"    
            "31mar2000 00:00:00" ""                 ""                ""               ""              ""               ""                  ""                      ""      
            "30jun2000 00:00:00" ""                 ""                ""               ""              ""               ""                  ""                      ""      
            "31aug2000 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""     
            "30sep2000 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""     
            "31oct2000 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""     
            "31dec2000 00:00:00" "4.8"              ".21"              ".03"            "10"          "1.1"             ".03"               ".03"                   ".03"  
            "31mar2001 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""     
            "30jun2001 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""     
            "31aug2001 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""     
            "30sep2001 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""     
            "31oct2001 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""     
            "31dec2001 00:00:00" "0"                ".11"              ".23"            "0"              "8"               ".23"               ".23"                   ".23"  
            end
            ------------------ copy up to and including the previous line ------------------

            Comment


            • #7
              Assuming the first 15 characters define distinct variable names, you can use the following. Otherwise, you can increase this limit, but 32 characters is the maximum variable length, i.e., length of identifier+variable name prior to reshape.

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str18 v1 str41 v2 str16 v3 str46 v4 str22 v5 str41 v6 str16 v7 str46 v8 str22 v9
              "A"                  "AEA000101013"     "C"              "D"             "E"                "BBF000101901"   "G"                    "H"                        "I"                        
              ""                   "Dividend Yield " "Asset Turnover"   "Book Value"    "Debt ratio"        "Dividend Yield" "Asset Turnover"    "Book Value"    "Debt ratio"
              "Date"               ""                 ""                ""               ""              ""               ""                  ""                      ""      
              "31dec1998 00:00:00" "1"                ".3"              ".01"             "32"           "2"              ".01"               ".01"                   ".01"  
              "31dec1999 00:00:00" "5"                ".01"             ".3"             ".30"            "0"              ".3"                ".3"                    ".3"    
              "31mar2000 00:00:00" ""                 ""                ""               ""              ""               ""                  ""                      ""      
              "30jun2000 00:00:00" ""                 ""                ""               ""              ""               ""                  ""                      ""      
              "31aug2000 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""    
              "30sep2000 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""    
              "31oct2000 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""    
              "31dec2000 00:00:00" "4.8"              ".21"              ".03"            "10"          "1.1"             ".03"               ".03"                   ".03"  
              "31mar2001 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""    
              "30jun2001 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""    
              "31aug2001 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""    
              "30sep2001 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""    
              "31oct2001 00:00:00" ""                 ""                 ""               ""             ""                ""                  ""                      ""    
              "31dec2001 00:00:00" "0"                ".11"              ".23"            "0"              "8"               ".23"               ".23"                   ".23"  
              end
              
              local stubs
              rename v1 date
              local i 2
              while `i'<=`c(k)'{
                 *4 below is a count of number of variables [MANUAL INPUT]
                  if !mod((`i'+2)/4, 1) & !missing(v`i'[2]){
                      local vname= substr(strtoname(trim(itrim("`=v`i'[2]'"))), 1, 15)
                      local stubs `stubs' `vname'
                      local stubs: list uniq stubs
                      local current `vname'`=v`i'[1]'
                      rename v`i' `vname'`=v`i'[1]'
                      local ++i
                  }
                  if  mod((`i'+2)/4, 1) & !missing(v`i'[2]){
                      local vname= substr(strtoname(trim(itrim("`=v`i'[2]'"))), 1, 15)
                      local stubs `stubs' `vname'
                      local stubs: list uniq stubs
                      rename v`i' `vname'`=`current'[1]'
                      local ++i
                  }  
              }
              
              rename date strdate
              drop in 1/3
              gen date= dofc(clock(strdate, "DMYhms")), before(strdate)
              format date %td
              drop strdate
              reshape long `stubs', i(date) j(ID) string
              destring `stubs', replace
              Res.:

              Code:
              . sort ID date
              
              . l, sepby(ID)
              
                   +----------------------------------------------------------------------+
                   |      date             ID   Divide~d   Asset_~r   Book_V~e   Debt_r~o |
                   |----------------------------------------------------------------------|
                1. | 31dec1998   AEA000101013          1         .3        .01         32 |
                2. | 31dec1999   AEA000101013          5        .01         .3         .3 |
                3. | 31mar2000   AEA000101013          .          .          .          . |
                4. | 30jun2000   AEA000101013          .          .          .          . |
                5. | 31aug2000   AEA000101013          .          .          .          . |
                6. | 30sep2000   AEA000101013          .          .          .          . |
                7. | 31oct2000   AEA000101013          .          .          .          . |
                8. | 31dec2000   AEA000101013        4.8        .21        .03         10 |
                9. | 31mar2001   AEA000101013          .          .          .          . |
               10. | 30jun2001   AEA000101013          .          .          .          . |
               11. | 31aug2001   AEA000101013          .          .          .          . |
               12. | 30sep2001   AEA000101013          .          .          .          . |
               13. | 31oct2001   AEA000101013          .          .          .          . |
               14. | 31dec2001   AEA000101013          0        .11        .23          0 |
                   |----------------------------------------------------------------------|
               15. | 31dec1998   BBF000101901          2        .01        .01        .01 |
               16. | 31dec1999   BBF000101901          0         .3         .3         .3 |
               17. | 31mar2000   BBF000101901          .          .          .          . |
               18. | 30jun2000   BBF000101901          .          .          .          . |
               19. | 31aug2000   BBF000101901          .          .          .          . |
               20. | 30sep2000   BBF000101901          .          .          .          . |
               21. | 31oct2000   BBF000101901          .          .          .          . |
               22. | 31dec2000   BBF000101901        1.1        .03        .03        .03 |
               23. | 31mar2001   BBF000101901          .          .          .          . |
               24. | 30jun2001   BBF000101901          .          .          .          . |
               25. | 31aug2001   BBF000101901          .          .          .          . |
               26. | 30sep2001   BBF000101901          .          .          .          . |
               27. | 31oct2001   BBF000101901          .          .          .          . |
               28. | 31dec2001   BBF000101901          8        .23        .23        .23 |
                   +----------------------------------------------------------------------+
              
              .
              Last edited by Andrew Musau; 15 Dec 2023, 16:15.

              Comment


              • #8
                Another way is to replace entries of the 2nd observation (row) for each variable (column) with entries of the previous variable from left to right, if the entry is less than 10 characters long. Then concatenate the 2nd and 3rd observations for each variable and rename based on the concatenated strings. Finally reshape. This will involve far less coding.
                Last edited by Andrew Musau; 15 Dec 2023, 16:43.

                Comment


                • #9
                  Thank you very much Andrew Musau

                  Comment

                  Working...
                  X