Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshape problem data from Excel file to long format

    Hello

    I have unfortunately (again) an problem to reshape a data file that is present in quite a weird format (see example):

    1. line: empty (can be deleted)
    2. line: Variable Name (Single for constant values like Names etc. ; repeated for time changing values like Sales)
    3. line: Variable ID (can be deleted)
    4. line: Time Variable (only for variables that change over time otherwise empty)

    My concrete Problem is that I have two lines in which date and variable names are stored separately (YearQuarter format).

    I want to reshape the data in long format to have a Panel Dataset in the end.

    I extracted an example from Stata here. I had to censor the data that is why there is always written RN , the variable might be string or numeric though. Also the variables are named from A to MF. I included some variables to illustrate the problem with the time Variable that is in the 4th row only for some Variables.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str17 A str16 B str13 C str26 D str21 E str17 F str20 G str25(EG EH EI)
    "Source"            ""                 ""              ""                           ""                      ""                  ""                     ""                          ""                          ""                        
    "Institution Name " "Institution Key " "Ticker "       "Trading Symbol & Exchange " "S&P CapIQ Company ID " "Ownership Status " "Company Name, Short " "Total Deposits (Reported)" "Total Deposits (Reported)" "Total Deposits (Reported)"
    "130509"            "130992"           "131166"        "139474"                     "263978"                "255389"            "131160"               "132288"                    "132288"                    "132288"                  
    ""                  ""                 ""              ""                           ""                      ""                  ""                     "2008Q4"                    "2008Q3"                    "2008Q2"                  
    ""                  ""                 ""              ""                           ""                      ""                  ""                     "Current/Restated"          "Current/Restated"          "Current/Restated"        
    "Company1"          "RANDOM NUMBER"    "RANDOM NUMBER" "RANDOM NUMBER"              "RANDOM NUMBER"         "RANDOM NUMBER"     "RANDOM NUMBER"        "RANDOM NUMBER"             "RANDOM NUMBER"             "RANDOM NUMBER"            
    "Company2"          "RANDOM NUMBER"    "RANDOM NUMBER" "RANDOM NUMBER"              "RANDOM NUMBER"         "RANDOM NUMBER"     "RANDOM NUMBER"        "RANDOM NUMBER"             "RANDOM NUMBER"             "RANDOM NUMBER"            
    "Company3"          "RANDOM NUMBER"    "RANDOM NUMBER" "RANDOM NUMBER"              "RANDOM NUMBER"         "RANDOM NUMBER"     "RANDOM NUMBER"        "RANDOM NUMBER"             "RANDOM NUMBER"             "RANDOM NUMBER"            
    "Company4"          "RANDOM NUMBER"    "RANDOM NUMBER" "RANDOM NUMBER"              "RANDOM NUMBER"         "RANDOM NUMBER"     "RANDOM NUMBER"        "RANDOM NUMBER"             "RANDOM NUMBER"             "RANDOM NUMBER"            
    "Company5"          "RANDOM NUMBER"    "RANDOM NUMBER" "RANDOM NUMBER"              "RANDOM NUMBER"         "RANDOM NUMBER"     "RANDOM NUMBER"        "RANDOM NUMBER"             "RANDOM NUMBER"             "RANDOM NUMBER"            
    "Company6"          "RANDOM NUMBER"    "RANDOM NUMBER" "RANDOM NUMBER"              "RANDOM NUMBER"         "RANDOM NUMBER"     "RANDOM NUMBER"        "RANDOM NUMBER"             "RANDOM NUMBER"             "RANDOM NUMBER"            
    "Company7"          "RANDOM NUMBER"    "RANDOM NUMBER" "RANDOM NUMBER"              "RANDOM NUMBER"         "RANDOM NUMBER"     "RANDOM NUMBER"        "RANDOM NUMBER"             "RANDOM NUMBER"             "RANDOM NUMBER"            
    "Company8"          "RANDOM NUMBER"    "RANDOM NUMBER" "RANDOM NUMBER"              "RANDOM NUMBER"         "RANDOM NUMBER"     "RANDOM NUMBER"        "RANDOM NUMBER"             "RANDOM NUMBER"             "RANDOM NUMBER"            
    "Company9"          "RANDOM NUMBER"    "RANDOM NUMBER" "RANDOM NUMBER"              "RANDOM NUMBER"         "RANDOM NUMBER"     "RANDOM NUMBER"        "RANDOM NUMBER"             "RANDOM NUMBER"             "RANDOM NUMBER"            
    "Company10"         "RANDOM NUMBER"    "RANDOM NUMBER" "RANDOM NUMBER"              "RANDOM NUMBER"         "RANDOM NUMBER"     "RANDOM NUMBER"        "RANDOM NUMBER"             "RANDOM NUMBER"             "RANDOM NUMBER"            
    end

    Edit: Since I recognized the wording above might be misleading here a more precise example:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str17 A str16 B str10 C str26 D str21 E str23(AO AP AQ AR AS)
    "Source"            ""                 ""           ""                           ""                      ""                        ""                        ""                        ""                        ""                      
    "Institution Name " "Institution Key " "Ticker "    "Trading Symbol & Exchange " "S&P CapIQ Company ID " "Total Assets (Reported)" "Total Assets (Reported)" "Total Assets (Reported)" "Total Assets (Reported)" "Total Assets (Reported)"
    "130509"            "130992"           "131166"     "139474"                     "263978"                "132264"                  "132264"                  "132264"                  "132264"                  "132264"                
    ""                  ""                 ""           ""                           ""                      "2012Q4"                  "2012Q3"                  "2012Q2"                  "2012Q1"                  "2011Q4"                
    ""                  ""                 ""           ""                           ""                      "Current/Restated"        "Current/Restated"        "Current/Restated"        "Current/Restated"        "Current/Restated"      
    "Company1"          "Identifier"       "Identifier" "Identifier"                 "Identifier"            "1"                       "1"                       "1"                       "1"                       "1"                      
    "Company2"          "Identifier"       "Identifier" "Identifier"                 "Identifier"            "1"                       "1"                       "1"                       "1"                       "1"                      
    "Company3"          "Identifier"       "Identifier" "Identifier"                 "Identifier"            "1"                       "1"                       "1"                       "1"                       "1"                      
    "Company4"          "Identifier"       "Identifier" "Identifier"                 "Identifier"            "1"                       "1"                       "1"                       "1"                       "1"                      
    "Company5"          "Identifier"       "Identifier" "Identifier"                 "Identifier"            "1"                       "1"                       "1"                       "1"                       "1"                      
    "Company6"          "Identifier"       "Identifier" "Identifier"                 "Identifier"            "1"                       "1"                       "1"                       "1"                       "1"                      
    "Company7"          "Identifier"       "Identifier" "Identifier"                 "Identifier"            "1"                       "1"                       "1"                       "1"                       "1"                      
    "Company8"          "Identifier"       "Identifier" "Identifier"                 "Identifier"            "1"                       "1"                       "1"                       "1"                       "1"                      
    "Company9"          "Identifier"       "Identifier" "Identifier"                 "Identifier"            "1"                       "1"                       "1"                       "1"                       "1"                      
    "Company10"         "Identifier"       "Identifier" "Identifier"                 "Identifier"            "1"                       "1"                       "1"                       "1"                       "1"                      
    end





    (extracted by using dataex: https://ideas.repec.org/c/boc/bocode...ml#biblio-body)

    Thank you for any suggestions or tips how to circumvent manually reshaping this in excel.

    Best

    Justus
    Last edited by Justus F.C. Meyer; 28 Mar 2017, 07:18.

  • #2
    It seems your first problem is related to properly importing data from Excel than reshaping it. Your data has both string and numeric in the same column, which Stata will always read a string. First, you need to work on the Excel sheet to remove any string values from the numeric variables, remove empty cells at the top of columns and then import. Once your data is properly imported, then show us an example of your data using dataex again.
    Regards
    --------------------------------------------------
    Attaullah Shah, PhD.
    Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
    FinTechProfessor.com
    https://asdocx.com
    Check out my asdoc program, which sends outputs to MS Word.
    For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

    Comment


    • #3
      To the excellent advice given in #2 I would add the following.

      In Excel, use the row with the time variable to enter what will become variable names when the data is imported into Stata. In the repeated entries for columns like "Total Assets (Reported)" include the date at the end of the variable name. So, for your example, your spreadsheet would look something like this (where I have shortened "Identifier" to "Id" to reduce the size).




      To import this spreadsheet into Stata:
      Code:
      clear
      import excel example, firstrow
      Here the Excel data after being imported into Stata. Note that by getting rid of the extra rows of text, import excel now gets the TA values as numbers rather than strings.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str17 InstName str16 InstKey str10 Ticker str26 TSandE str21 SandP byte(TA2012Q4 TA2012Q3 TA2012Q2 TA2012Q1 TA2011Q4)
      "Company1"  "Id" "Id" "Id" "Id" 1 1 1 1 1
      "Company2"  "Id" "Id" "Id" "Id" 1 1 1 1 1
      "Company3"  "Id" "Id" "Id" "Id" 1 1 1 1 1
      "Company4"  "Id" "Id" "Id" "Id" 1 1 1 1 1
      "Company5"  "Id" "Id" "Id" "Id" 1 1 1 1 1
      "Company6"  "Id" "Id" "Id" "Id" 1 1 1 1 1
      "Company7"  "Id" "Id" "Id" "Id" 1 1 1 1 1
      "Company8"  "Id" "Id" "Id" "Id" 1 1 1 1 1
      "Company9"  "Id" "Id" "Id" "Id" 1 1 1 1 1
      "Company10" "Id" "Id" "Id" "Id" 1 1 1 1 1
      end
      The following code applied to this data
      Code:
      clear
      import excel example, firstrow
      generate id = _n
      reshape long TA, i(id) j(datestr) string
      generate quarter = quarterly(datestr,"YQ")
      format quarter %tq
      creates a dataset that looks like this
      Code:
      . list in 1/10, sepby(id)
      
           +----------------------------------------------------------------------------+
           | id   datestr   InstName   InstKey   Ticker   TSandE   SandP   TA   quarter |
           |----------------------------------------------------------------------------|
        1. |  1    2011Q4   Company1        Id       Id       Id      Id    1    2011q4 |
        2. |  1    2012Q1   Company1        Id       Id       Id      Id    1    2012q1 |
        3. |  1    2012Q2   Company1        Id       Id       Id      Id    1    2012q2 |
        4. |  1    2012Q3   Company1        Id       Id       Id      Id    1    2012q3 |
        5. |  1    2012Q4   Company1        Id       Id       Id      Id    1    2012q4 |
           |----------------------------------------------------------------------------|
        6. |  2    2011Q4   Company2        Id       Id       Id      Id    1    2011q4 |
        7. |  2    2012Q1   Company2        Id       Id       Id      Id    1    2012q1 |
        8. |  2    2012Q2   Company2        Id       Id       Id      Id    1    2012q2 |
        9. |  2    2012Q3   Company2        Id       Id       Id      Id    1    2012q3 |
       10. |  2    2012Q4   Company2        Id       Id       Id      Id    1    2012q4 |
           +----------------------------------------------------------------------------+
      Note that quarter contains the date in Stata Internal Format as a number, which is what you will need for your work.

      Comment


      • #4
        I would take a slightly different approach from #2. The Excel data is a mish-mosh that was clearly created for human visual inspection, not with data analysis in mind. And you will actually need to import different pieces of it separately into different Stata files and then piece the results together with appropriate -merge- commands. For example, columns A through E of the spreadsheet have a different data structure from that of the rightmost columns. You will need to do some -reshaping- on the latter, but probably not on the former.

        But rather than doing "surgery" on the spreadsheet, I would take advantage of the -cellrange()- option of -import excel- and separately bring in the different parts of the data into different Stata files, clean them separately, and then -merge 1:m- them together at the end. The advantage of this is that whatever you do in Excel goes undocumented, and you can never reconstruct what happened if you need to later. By contrast, your do-file will contain all of your -import excel- commands, so you can see what was taken from where, and where it went. Moreover, if you later decide that a different approach to some aspect of it was needed, you can just modify the relevant sections of the code rather than having to start everything from scratch. Also, it is really easy to make copy/cut/paste errors and not notice that you have one too many or one too few rows/columns.

        In fact, my own practice on this is quite strong and regular: the first thing I do with any Excel spreadsheet I get for use in analysis is make it a read-only file. All data extraction, simple or complex, is then done in Stata. I can only recall one or two occasions where things got so complicated working this way that I backed off and did some data rearrangements in Excel. (The one thing that I regularly do manually, because I know of no way to do it in Stata, nor in Stat Transfer, is eliminate password protection.)

        Comment


        • #5
          I agree with Clyde Schechter to do every possible task in Stata for the sake of having a record of the process. However, I have observed my students struggling with simple import of data from Excel, let alone complicated cases such as reported above. I also accept that one has to invest time in Stata learning, but occasional users would care less as they would see the payoff from such an investment fairly meager, especially when they do not intend to do such exercise on frequent basis.
          Regards
          --------------------------------------------------
          Attaullah Shah, PhD.
          Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
          FinTechProfessor.com
          https://asdocx.com
          Check out my asdoc program, which sends outputs to MS Word.
          For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

          Comment


          • #6
            I too would, for my own work, take an approach like that described by Clyde Schechter in post #4. But as Attaullah Shah suggests in post #5, that requires a substantial investment in Stata knowledge, beyond what I wanted to invest in in post $3. There, my objective was to solve what I saw as the crux of the problem, somenow associating the row of dates with the resultant values in the column below. Building the dates into the Stata variable names seemed an effective approach.

            And, also, the manipulation I performed manually in Excel could have been done more systematically, and reliably, using Excel formulas to build the TA variable names by concatenating "TA" and the quarterly date string. So the decision to adopt the approach suggested in post #4 is one best made by the user, balancing skills in Stata with those in Excel. My experience is that the benefits of the programmatic approach are best learned by the experience of investing time in the manual approach for what is understood to be a "one time task", only to be told "that's great, we'll update that data every quarter and you can update your analysis", facing a view of an uninteresting quarterly task for the foreseeable future. Or as a friend once said, the programmer should never be in the middle of a loop.

            Comment

            Working...
            X