Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • vertical concat/ combination of two string observations

    Hello everybody,
    I hope you can give me some ideas about a small data cleaning problem which I can't get fixed even after doing research.
    I use STATA 13, and importing data of the Freedom in the World Index, the structure looks like this:

    Click image for larger version

Name:	statalist.JPG
Views:	1
Size:	20.3 KB
ID:	1369937


    Now I would like to combine row 1 and 2 to get e.g. "PR_2013".
    I tried something like gen newvar = B[_n]+B[_n+1] and tried to use foreach var of varlist
    but nothing worked out...
    Probably there is a very simple command which joins string values of two cells... would be glad if you have any hint,
    thanks
    Ida
    Attached Files
    Last edited by Ida Verspohl; 10 Jan 2017, 14:10.

  • #2
    You don't tell us what kind of file you are importing from, nor what commands you are using to do the import, but I believe the following code (which assumes it is a tab delimited text file brought in with -import delimited-) will work generally:

    Code:
    foreach v of varlist _all {
        local newname = `v'[2] + `v'[1]
        rename `v' `newname'
    }
    drop in 1/2
    In the future, please do not post screenshots of data, whether from Stata or other sources. There is nothing you can do to work with a screen shot. Use -dataex- to post examples of Stata data (-ssc install dataex-, -help dataex- for instructions). For data from text files, copy and paste it between code delimiters (see FAQ #12 for details and more general information about reader-friendly posting.)

    Comment


    • #3
      Clyde,

      thanks a lot for your feedback!

      Sorry about the screenshot, here a proper version of the data:

      input str32 A str4(B C) str6 D str4(E F) str6 G str4(H I) str6 J
      "Year(s) covered" "1972" "1972" "1972" "1973" "1973" "1973" "1974" "1974" "1974"
      "" "PR" "CL " "Status" "PR" "CL " "Status" "PR" "CL " "Status"
      "Afghanistan" "4" "5" "PF" "7" "6" "NF" "7" "6" "NF"
      "Albania" "7" "7" "NF" "7" "7" "NF" "7" "7" "NF"
      "Algeria" "6" "6" "NF" "6" "6" "NF" "6" "6" "NF"
      "Andorra" "4" "3" "PF" "4" "4" "PF" "4" "4" "PF"
      "Angola" ".." ".." ".." ".." ".." ".." ".." ".." ".."
      "Antigua & Barbuda" ".." ".." ".." ".." ".." ".." ".." ".." ".."
      "Argentina" "6" "3" "PF" "2" "2" "F" "2" "4" "PF"


      Hope this works better.
      I imported an excel table by the command
      import excel "I:\...\Country Ratings and Status, 1973-2016 (FINAL)_0.xlsx", sheet(" Country Ratings") cellrange(A6:EM213) clear
      and did some other data cleaning (filling empty cells above the "PR" and "Status" columns by a tiring proedure with the command
      replace BC = BB[1] in 1
      before turning to the first/second row problem.

      Using your command, I get back this error:

      syntax error
      Syntax is
      rename oldname newname [, renumber[(#)] addnumber[(#)] sort ...]
      rename (oldnames) (newnames) [, renumber[(#)] addnumber[(#)] sort ...]
      rename oldnames , {upper|lower|proper}
      r(198);


      Maybe you have an idea what went wrong?

      Thank you

      Ida

      Comment


      • #4
        No; we can have no idea what went wrong until you show us exactly what you typed. in your rename command.

        I sometimes import everything in an Excel file as string and then brace myself for some dedicated programming, peeling off the metadata from the contents of the first few observations. That's not an approach for those new to Stata. On the whole such users are best advised only to read in the dataset proper, and ignore the header lines. You'll still need some surgery to fix variable names and labels either way.
        Last edited by Nick Cox; 11 Jan 2017, 03:29.

        Comment


        • #5
          Thank you Nick, I didn't get yet to that point of renaming variables, still struggling to unite row 1 and 2 for then using the new row 1 as variable names, the error also occurs directly after importing, this is what I did in full detail:

          I first downloaded the Excel file (https://freedomhouse.org/sites/defau...INAL%29_0.xlsx),

          and this is the dofile so far:

          import excel "I:\02 Evaluierung\02-1 Evaluierungen\02-1-5 2016\02-1-5-3 Nachhaltigkeit\05_Daten\Evaluierung Nachhaltigkeit in der deutschen EZ\MAKRO_DATENBANK\rohdaten_extern\FreedomInTheWor ld_Country Ratings and Status, 1973-2016 (FINAL)_0.xlsx", sheet(" Country Ratings") cellrange(A6:EM212) clear

          replace B = C[1] in 1
          replace E = F[1] in 1
          replace H = I[1] in 1
          replace K = L[1] in 1
          replace N = O[1] in 1
          replace Q = R[1] in 1
          replace T = U[1] in 1
          replace W = X[1] in 1
          replace Z = AA[1] in 1

          replace AC = AD[1] in 1
          replace AF = AG[1] in 1
          replace AI = AJ[1] in 1
          replace AL = AM[1] in 1
          replace AO = AP[1] in 1
          replace AR = AS[1] in 1
          replace AU = AV[1] in 1
          replace AX = AY[1] in 1

          replace BA = BB[1] in 1
          replace BD = BE[1] in 1
          replace BG = BH[1] in 1
          replace BJ = BK[1] in 1
          replace BM = BN[1] in 1
          replace BP = BQ[1] in 1
          replace BS = BT[1] in 1
          replace BV = BW[1] in 1
          replace BY = BZ[1] in 1

          replace CB = CC[1] in 1
          replace CE = CF[1] in 1
          replace CH = CI[1] in 1
          replace CK = CL[1] in 1
          replace CN = CO[1] in 1
          replace CQ = CR[1] in 1
          replace CT = CU[1] in 1
          replace CW = CX[1] in 1
          replace CZ = DA[1] in 1

          replace DC = DD[1] in 1
          replace DF = DG[1] in 1
          replace DI = DJ[1] in 1
          replace DL = DM[1] in 1
          replace DO = DP[1] in 1
          replace DR = DS[1] in 1
          replace DU = DV[1] in 1
          replace DX = DY[1] in 1


          replace D = C[1] in 1
          replace G = F[1] in 1
          replace J = I[1] in 1
          replace M = L[1] in 1
          replace P = O[1] in 1
          replace S = R[1] in 1
          replace V = U[1] in 1
          replace Y = X[1] in 1

          replace AB = AA[1] in 1
          replace AE = AD[1] in 1
          replace AH = AG[1] in 1
          replace AK = AJ[1] in 1
          replace AN = AM[1] in 1
          replace AQ = AP[1] in 1
          replace AT = AS[1] in 1
          replace AW = AV[1] in 1
          replace AZ = AY[1] in 1

          replace BC = BB[1] in 1
          replace BF = BE[1] in 1
          replace BI = BH[1] in 1
          replace BL = BK[1] in 1
          replace BO = BN[1] in 1
          replace BR = BQ[1] in 1
          replace BU = BT[1] in 1
          replace BX = BW[1] in 1

          replace CA = BZ[1] in 1
          replace CD = CC[1] in 1
          replace CG = CF[1] in 1
          replace CJ = CI[1] in 1
          replace CM = CL[1] in 1
          replace CP = CO[1] in 1
          replace CS = CR[1] in 1
          replace CV = CU[1] in 1
          replace CY = CX[1] in 1

          replace DB = DA[1] in 1
          replace DE = DD[1] in 1
          replace DH = DG[1] in 1
          replace DK = DJ[1] in 1
          replace DN = DM[1] in 1
          replace DQ = DP[1] in 1
          replace DT = DS[1] in 1
          replace DW = DV[1] in 1
          replace DZ = DY[1] in 1


          The command proposed by Clyde then was


          foreach v of varlist _all {
          local newname = `v'[2] + `v'[1]
          rename `v' `newname'
          }
          drop in 1/2


          Thank you!

          Comment


          • #6
            On the contrary: your error message clearly indicates a problem with rename.

            Thanks, but we can't see your I: drive, so that code is not reproducible. So, somewhat contrary to my principles, I looked at the spreadsheet file.

            You are, I thinking, trying to make new names out of material like

            Code:
              +-------------------------------------------------------------------------------------+
              |               A |  B |    C |      D |  E |    F |      G |  H |    I |      J |  K |
              | Year(s) covered |    | 1972 |        |    | 1973 |        |    | 1974 |        |    |
              |-------------------------------------------------------------------------------------|
              |                     L                   |                       M                   |
              |                  1975                   |                                           |
              +-------------------------------------------------------------------------------------+
            
              +-------------------------------------------------------------------------------------+
              |               A |  B |    C |      D |  E |    F |      G |  H |    I |      J |  K |
              |                 | PR |  CL  | Status | PR |  CL  | Status | PR |  CL  | Status | PR |
              |-------------------------------------------------------------------------------------|
              |                     L                   |                       M                   |
              |                   CL                    |                  Status                   |
              +-------------------------------------------------------------------------------------+
            .

            That's doomed to failure. For example, there would be an attempt to rename several variables "Status". That can only work once. So, Clyde's advice was very good for what you showed in #1 but can be no use for your real case.

            That's the bad news. The good news is that you just need some custom code. This is self-contained, and not only reads in the data, but also reshapes them to a practical layout.


            Code:
            import excel using "https://freedomhouse.org/sites/default/files/Country%20Ratings%20and%20Status%2C%201973-2016%20%28FINAL%29_0.xlsx", clear
            keep in 8/212
            drop EA-EM
            compress
            
            local year = 1973
            local j = 0
            
            foreach v of var B-DZ {
                local j = cond(`j' == 3, 1, `j' + 1)
                local stub = word("PR CL Status", `j')
                rename `v' `stub'`year'
                if `j' == 3 local year = `year' + 1
            }
                
            rename A Country
            reshape long PR CL Status, i(Country) j(year)




            Comment


            • #7
              Thank you so much, it perfectly works out! Thank you for your time and advice!

              Comment


              • #8
                In my experience inconvenient layouts are especially common with multi-country data when someone has decided that every country should be a row in a spreadsheet. Unfortunately the people who compile the spreadsheets rarely have to make sense of the data in statistical programs.

                Comment


                • #9
                  yes I share your impression...

                  Comment


                  • #10
                    Dear Nick,

                    I hope it is ok to continue in this post here.

                    You helped a lot with an issue to merge the string values of first and the second row of my data by your solution, yet there is another data issue linked with the previous: some columns (edition years 1982 to 1989) do not necessarily represent a year, i.e. edition year 1982 covers almost a 2-year period.

                    This is how the data looks like:

                    input str4 AC str18 AD str6 AE str7 AF str17 AG str6 AH str7 AI str17 AJ str6 AK
                    "1982" "" "" "1983-84" "" "" "1984-85" "" ""
                    "" "Jan.1981-Aug. 1982" "" "" "Aug.1982-Nov.1983" "" "" "Nov.1983-Nov.1984" ""
                    "PR" "CL " "Status" "PR" "CL " "Status" "PR" "CL " "Status"
                    "7" "7" "NF" "7" "7" "NF" "7" "7" "NF"
                    "7" "7" "NF" "7" "7" "NF" "7" "7" "NF"
                    "6" "6" "NF" "6" "6" "NF" "6" "6" "NF"
                    end
                    [/CODE]
                    ------------------


                    This leads to the fact that there is one column less than years covered. We will consider e.g. period "Nov1985-Nov1986" as year 1986, so on this side, your solution works fine for us.

                    Do you think there is a possibility to maybe stop your above procedure in the middle (after edition year 1982), create an empty column which would give room for imputation values (giving room for a virtual edition year 1983), and resume the operation until the end, edition year 2016?

                    I hope I could make clear the problem, sounds more tricky than it actually is I guess,

                    thanks a lot!

                    Ida

                    Comment


                    • #11
                      I think you're now referring to a different dataset with different structure.

                      That's the downside of custom code; it needs to be customised.

                      Note that I didn't merge (or even combine) two rows (meaning observations) at all; I ignored them and constructed variable names differently.

                      Note that I am recommending against reading metadata into data, so I don't even advise starting where you are now starting. You can't get variable names well out of empty strings.

                      I can't advise on good code without seeing the dataset.
                      Last edited by Nick Cox; 11 Jan 2017, 10:35.

                      Comment


                      • #12
                        Thanks Nick for your response. The dataset is still the same as in the first question,

                        import excel using "https://freedomhouse.org/sites/default/files/Country%20Ratings%20and%20Status%2C%201973-2016%20%28FINAL%29_0.xlsx", clear
                        keep in 5/212
                        drop EA-EM
                        compress


                        ...looking at columns AD to AZ (which cause the problem that we need an additional column after "year 1982"):

                        input str18 AD str6 AE str7 AF str17 AG str6 AH str7 AI
                        "" "" "1983-84" "" "" "1984-85"
                        "Jan.1981-Aug. 1982" "" "" "Aug.1982-Nov.1983" "" ""
                        "CL " "Status" "PR" "CL " "Status" "PR"
                        "7" "NF" "7" "7" "NF" "7"
                        "7" "NF" "7" "7" "NF" "7"
                        "6" "NF" "6" "6" "NF" "6"
                        ".." ".." ".." ".." ".." ".."
                        "7" "NF" "7" "7" "NF" "7"
                        end
                        [/CODE]
                        ------------------

                        Of course you're right, with your code the architecture of the data set changed (I hope I got you right). Omitting the original years and going forward n+1 is superb, just that it causes a problem: since in the original data, one column is missing, because there was this one period (columns AC-AE, edition year 1982) covering an almost-2-years period, in our new data set there is also one column missing. This is why we would need an additional 3 empty columns after edition year 1982 (between AE and AF) to either leave it blank or impute the data, to make it compatible with other data sets.

                        Do you think this is possible?

                        Thanks

                        Ida

                        Comment


                        • #13
                          This is a difficult art, especially for beginners. The major point is that you have to look at the spreadsheet to write code to get sensible labels.

                          It seems that you need to be registered with this site to get full access to the original spreadsheet, and in any case I can't commit to support every individual download here.

                          The following gives you a more general approach. You just need to define value labels for the different periods, if necessary by hand.


                          Code:
                          import excel using "https://freedomhouse.org/sites/default/files/Country%20Ratings%20and%20Status%2C%201973-2016%20%28FINAL%29_0.xlsx", clear
                          
                          keep in 8/212 
                          keep A-DZ 
                          
                          local j = 0 
                          local k = 1 
                          foreach v of var B-DZ { 
                              if `j' == 3 { 
                                  local j = 1 
                                  local k = `k' + 1 
                              } 
                              else local j = `j' + 1 
                              local stub = word("PR CL Status", `j') 
                              
                              rename `v' `stub'`k' 
                          } 
                          rename A country 
                          
                          reshape long PR CL Status, i(country) j(period)

                          Comment


                          • #14
                            Note that if the problem only occurs in one spot, it is sometimes easier to just edit the spreadsheet (manually, or through the dofile) rather than trying to code your loop around it. (I don't know enough about the problem at hand to say whether this is true in this case)

                            Comment


                            • #15
                              I imagine that Jesse's advice is very good. To me spreadsheet programs are like bus stations; sometimes you have to visit, but there is no interest in lingering.

                              What a competent spreadsheet user needs to have at the top of their list of Stata desiderata (other points could surely be added):

                              0. Getting the dataset into good Stata condition is almost always tedious but usually justified by easier and less error-prone analysis in Stata thereafter.

                              1. Variable names must be unique, really meaning unique, i.e. all distinct. Column names such as B to Z, BA to BZ, etc. do qualify but are a real pain to work with in Stata

                              2. It's a bonus if variable names are structured, e.g.. in groups, and informative. But the rules for variable names in Stata are inviolate.

                              3. One row in the spreadsheet that contains usable variable labels is often quite easy to work with in Stata, but it should be complete. Many spreadsheets have blanks for metadata where the user is supposed to know that you impute text mentally from neighbouring cells, or by analogy with them, but such blanks are also a real pain and should be filled in.

                              Comment

                              Working...
                              X