vertical concat/ combination of two string observations

Ida Verspohl

Join Date: Jan 2017

Posts: 10
#1

vertical concat/ combination of two string observations

10 Jan 2017, 13:54

Hello everybody,
I hope you can give me some ideas about a small data cleaning problem which I can't get fixed even after doing research.
I use STATA 13, and importing data of the Freedom in the World Index, the structure looks like this:

Now I would like to combine row 1 and 2 to get e.g. "PR_2013".
I tried something like gen newvar = B[_n]+B[_n+1] and tried to use foreach var of varlist
but nothing worked out...
Probably there is a very simple command which joins string values of two cells... would be glad if you have any hint,
thanks
Ida

Attached Files

Last edited by Ida Verspohl; 10 Jan 2017, 14:10.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30058
#2

10 Jan 2017, 15:40

You don't tell us what kind of file you are importing from, nor what commands you are using to do the import, but I believe the following code (which assumes it is a tab delimited text file brought in with -import delimited-) will work generally:

Code:

foreach v of varlist _all { local newname = `v'[2] + `v'[1] rename `v' `newname' } drop in 1/2

In the future, please do not post screenshots of data, whether from Stata or other sources. There is nothing you can do to work with a screen shot. Use -dataex- to post examples of Stata data (-ssc install dataex-, -help dataex- for instructions). For data from text files, copy and paste it between code delimiters (see FAQ #12 for details and more general information about reader-friendly posting.)
Comment
Ida Verspohl

Join Date: Jan 2017

Posts: 10
#3

11 Jan 2017, 03:01

Clyde,

thanks a lot for your feedback!

Sorry about the screenshot, here a proper version of the data:

input str32 A str4(B C) str6 D str4(E F) str6 G str4(H I) str6 J
"Year(s) covered" "1972" "1972" "1972" "1973" "1973" "1973" "1974" "1974" "1974"
"" "PR" "CL " "Status" "PR" "CL " "Status" "PR" "CL " "Status"
"Afghanistan" "4" "5" "PF" "7" "6" "NF" "7" "6" "NF"
"Albania" "7" "7" "NF" "7" "7" "NF" "7" "7" "NF"
"Algeria" "6" "6" "NF" "6" "6" "NF" "6" "6" "NF"
"Andorra" "4" "3" "PF" "4" "4" "PF" "4" "4" "PF"
"Angola" ".." ".." ".." ".." ".." ".." ".." ".." ".."
"Antigua & Barbuda" ".." ".." ".." ".." ".." ".." ".." ".." ".."
"Argentina" "6" "3" "PF" "2" "2" "F" "2" "4" "PF"

Hope this works better.
I imported an excel table by the command
import excel "I:\...\Country Ratings and Status, 1973-2016 (FINAL)_0.xlsx", sheet(" Country Ratings") cellrange(A6:EM213) clear
and did some other data cleaning (filling empty cells above the "PR" and "Status" columns by a tiring proedure with the command
replace BC = BB[1] in 1
before turning to the first/second row problem.

Using your command, I get back this error:

syntax error
Syntax is
rename oldname newname [, renumber[(#)] addnumber[(#)] sort ...]
rename (oldnames) (newnames) [, renumber[(#)] addnumber[(#)] sort ...]
rename oldnames , {upper|lower|proper}
r(198);

Maybe you have an idea what went wrong?

Thank you

Ida
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35589
#4

11 Jan 2017, 03:10

No; we can have no idea what went wrong until you show us exactly what you typed. in your rename command.

I sometimes import everything in an Excel file as string and then brace myself for some dedicated programming, peeling off the metadata from the contents of the first few observations. That's not an approach for those new to Stata. On the whole such users are best advised only to read in the dataset proper, and ignore the header lines. You'll still need some surgery to fix variable names and labels either way.

Last edited by Nick Cox; 11 Jan 2017, 03:29.
Comment
Ida Verspohl

Join Date: Jan 2017

Posts: 10
#5

11 Jan 2017, 05:24

Thank you Nick, I didn't get yet to that point of renaming variables, still struggling to unite row 1 and 2 for then using the new row 1 as variable names, the error also occurs directly after importing, this is what I did in full detail:

I first downloaded the Excel file (https://freedomhouse.org/sites/defau...INAL%29_0.xlsx),

and this is the dofile so far:

import excel "I:\02 Evaluierung\02-1 Evaluierungen\02-1-5 2016\02-1-5-3 Nachhaltigkeit\05_Daten\Evaluierung Nachhaltigkeit in der deutschen EZ\MAKRO_DATENBANK\rohdaten_extern\FreedomInTheWor ld_Country Ratings and Status, 1973-2016 (FINAL)_0.xlsx", sheet(" Country Ratings") cellrange(A6:EM212) clear

replace B = C[1] in 1
replace E = F[1] in 1
replace H = I[1] in 1
replace K = L[1] in 1
replace N = O[1] in 1
replace Q = R[1] in 1
replace T = U[1] in 1
replace W = X[1] in 1
replace Z = AA[1] in 1

replace AC = AD[1] in 1
replace AF = AG[1] in 1
replace AI = AJ[1] in 1
replace AL = AM[1] in 1
replace AO = AP[1] in 1
replace AR = AS[1] in 1
replace AU = AV[1] in 1
replace AX = AY[1] in 1

replace BA = BB[1] in 1
replace BD = BE[1] in 1
replace BG = BH[1] in 1
replace BJ = BK[1] in 1
replace BM = BN[1] in 1
replace BP = BQ[1] in 1
replace BS = BT[1] in 1
replace BV = BW[1] in 1
replace BY = BZ[1] in 1

replace CB = CC[1] in 1
replace CE = CF[1] in 1
replace CH = CI[1] in 1
replace CK = CL[1] in 1
replace CN = CO[1] in 1
replace CQ = CR[1] in 1
replace CT = CU[1] in 1
replace CW = CX[1] in 1
replace CZ = DA[1] in 1

replace DC = DD[1] in 1
replace DF = DG[1] in 1
replace DI = DJ[1] in 1
replace DL = DM[1] in 1
replace DO = DP[1] in 1
replace DR = DS[1] in 1
replace DU = DV[1] in 1
replace DX = DY[1] in 1

replace D = C[1] in 1
replace G = F[1] in 1
replace J = I[1] in 1
replace M = L[1] in 1
replace P = O[1] in 1
replace S = R[1] in 1
replace V = U[1] in 1
replace Y = X[1] in 1

replace AB = AA[1] in 1
replace AE = AD[1] in 1
replace AH = AG[1] in 1
replace AK = AJ[1] in 1
replace AN = AM[1] in 1
replace AQ = AP[1] in 1
replace AT = AS[1] in 1
replace AW = AV[1] in 1
replace AZ = AY[1] in 1

replace BC = BB[1] in 1
replace BF = BE[1] in 1
replace BI = BH[1] in 1
replace BL = BK[1] in 1
replace BO = BN[1] in 1
replace BR = BQ[1] in 1
replace BU = BT[1] in 1
replace BX = BW[1] in 1

replace CA = BZ[1] in 1
replace CD = CC[1] in 1
replace CG = CF[1] in 1
replace CJ = CI[1] in 1
replace CM = CL[1] in 1
replace CP = CO[1] in 1
replace CS = CR[1] in 1
replace CV = CU[1] in 1
replace CY = CX[1] in 1

replace DB = DA[1] in 1
replace DE = DD[1] in 1
replace DH = DG[1] in 1
replace DK = DJ[1] in 1
replace DN = DM[1] in 1
replace DQ = DP[1] in 1
replace DT = DS[1] in 1
replace DW = DV[1] in 1
replace DZ = DY[1] in 1

The command proposed by Clyde then was

foreach v of varlist _all {
local newname = `v'[2] + `v'[1]
rename `v' `newname'
}
drop in 1/2

Thank you!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35589

11 Jan 2017, 06:06

On the contrary: your error message clearly indicates a problem with rename.

Thanks, but we can't see your I: drive, so that code is not reproducible. So, somewhat contrary to my principles, I looked at the spreadsheet file.

You are, I thinking, trying to make new names out of material like

Code:

  +-------------------------------------------------------------------------------------+
  |               A |  B |    C |      D |  E |    F |      G |  H |    I |      J |  K |
  | Year(s) covered |    | 1972 |        |    | 1973 |        |    | 1974 |        |    |
  |-------------------------------------------------------------------------------------|
  |                     L                   |                       M                   |
  |                  1975                   |                                           |
  +-------------------------------------------------------------------------------------+

  +-------------------------------------------------------------------------------------+
  |               A |  B |    C |      D |  E |    F |      G |  H |    I |      J |  K |
  |                 | PR |  CL  | Status | PR |  CL  | Status | PR |  CL  | Status | PR |
  |-------------------------------------------------------------------------------------|
  |                     L                   |                       M                   |
  |                   CL                    |                  Status                   |
  +-------------------------------------------------------------------------------------+

.

That's doomed to failure. For example, there would be an attempt to rename several variables "Status". That can only work once. So, Clyde's advice was very good for what you showed in #1 but can be no use for your real case.

That's the bad news. The good news is that you just need some custom code. This is self-contained, and not only reads in the data, but also reshapes them to a practical layout.

Code:

import excel using "https://freedomhouse.org/sites/default/files/Country%20Ratings%20and%20Status%2C%201973-2016%20%28FINAL%29_0.xlsx", clear
keep in 8/212
drop EA-EM
compress

local year = 1973
local j = 0

foreach v of var B-DZ {
    local j = cond(`j' == 3, 1, `j' + 1)
    local stub = word("PR CL Status", `j')
    rename `v' `stub'`year'
    if `j' == 3 local year = `year' + 1
}
    
rename A Country
reshape long PR CL Status, i(Country) j(year)

Comment

Ida Verspohl

Join Date: Jan 2017

Posts: 10
#7

11 Jan 2017, 06:45

Thank you so much, it perfectly works out! Thank you for your time and advice!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35589
#8

11 Jan 2017, 06:54

In my experience inconvenient layouts are especially common with multi-country data when someone has decided that every country should be a row in a spreadsheet. Unfortunately the people who compile the spreadsheets rarely have to make sense of the data in statistical programs.
1 like
Comment
Ida Verspohl

Join Date: Jan 2017

Posts: 10
#9

11 Jan 2017, 07:13

yes I share your impression...
Comment
Ida Verspohl

Join Date: Jan 2017

Posts: 10
#10

11 Jan 2017, 09:23

Dear Nick,

I hope it is ok to continue in this post here.

You helped a lot with an issue to merge the string values of first and the second row of my data by your solution, yet there is another data issue linked with the previous: some columns (edition years 1982 to 1989) do not necessarily represent a year, i.e. edition year 1982 covers almost a 2-year period.

This is how the data looks like:

input str4 AC str18 AD str6 AE str7 AF str17 AG str6 AH str7 AI str17 AJ str6 AK
"1982" "" "" "1983-84" "" "" "1984-85" "" ""
"" "Jan.1981-Aug. 1982" "" "" "Aug.1982-Nov.1983" "" "" "Nov.1983-Nov.1984" ""
"PR" "CL " "Status" "PR" "CL " "Status" "PR" "CL " "Status"
"7" "7" "NF" "7" "7" "NF" "7" "7" "NF"
"7" "7" "NF" "7" "7" "NF" "7" "7" "NF"
"6" "6" "NF" "6" "6" "NF" "6" "6" "NF"
end
[/CODE]
------------------

This leads to the fact that there is one column less than years covered. We will consider e.g. period "Nov1985-Nov1986" as year 1986, so on this side, your solution works fine for us.

Do you think there is a possibility to maybe stop your above procedure in the middle (after edition year 1982), create an empty column which would give room for imputation values (giving room for a virtual edition year 1983), and resume the operation until the end, edition year 2016?

I hope I could make clear the problem, sounds more tricky than it actually is I guess,

thanks a lot!

Ida
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35589
#11

11 Jan 2017, 10:10

I think you're now referring to a different dataset with different structure.

That's the downside of custom code; it needs to be customised.

Note that I didn't merge (or even combine) two rows (meaning observations) at all; I ignored them and constructed variable names differently.

Note that I am recommending against reading metadata into data, so I don't even advise starting where you are now starting. You can't get variable names well out of empty strings.

I can't advise on good code without seeing the dataset.

Last edited by Nick Cox; 11 Jan 2017, 10:35.
Comment
Ida Verspohl

Join Date: Jan 2017

Posts: 10
#12

12 Jan 2017, 03:39

Thanks Nick for your response. The dataset is still the same as in the first question,

import excel using "https://freedomhouse.org/sites/default/files/Country%20Ratings%20and%20Status%2C%201973-2016%20%28FINAL%29_0.xlsx", clear
keep in 5/212
drop EA-EM
compress

...looking at columns AD to AZ (which cause the problem that we need an additional column after "year 1982"):

input str18 AD str6 AE str7 AF str17 AG str6 AH str7 AI
"" "" "1983-84" "" "" "1984-85"
"Jan.1981-Aug. 1982" "" "" "Aug.1982-Nov.1983" "" ""
"CL " "Status" "PR" "CL " "Status" "PR"
"7" "NF" "7" "7" "NF" "7"
"7" "NF" "7" "7" "NF" "7"
"6" "NF" "6" "6" "NF" "6"
".." ".." ".." ".." ".." ".."
"7" "NF" "7" "7" "NF" "7"
end
[/CODE]
------------------

Of course you're right, with your code the architecture of the data set changed (I hope I got you right). Omitting the original years and going forward n+1 is superb, just that it causes a problem: since in the original data, one column is missing, because there was this one period (columns AC-AE, edition year 1982) covering an almost-2-years period, in our new data set there is also one column missing. This is why we would need an additional 3 empty columns after edition year 1982 (between AE and AF) to either leave it blank or impute the data, to make it compatible with other data sets.

Do you think this is possible?

Thanks

Ida
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35589
#13

12 Jan 2017, 04:09

This is a difficult art, especially for beginners. The major point is that you have to look at the spreadsheet to write code to get sensible labels.

It seems that you need to be registered with this site to get full access to the original spreadsheet, and in any case I can't commit to support every individual download here.

The following gives you a more general approach. You just need to define value labels for the different periods, if necessary by hand.

Code:

import excel using "https://freedomhouse.org/sites/default/files/Country%20Ratings%20and%20Status%2C%201973-2016%20%28FINAL%29_0.xlsx", clear keep in 8/212 keep A-DZ local j = 0 local k = 1 foreach v of var B-DZ { if `j' == 3 { local j = 1 local k = `k' + 1 } else local j = `j' + 1 local stub = word("PR CL Status", `j') rename `v' `stub'`k' } rename A country reshape long PR CL Status, i(country) j(period)
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#14

12 Jan 2017, 05:51

Note that if the problem only occurs in one spot, it is sometimes easier to just edit the spreadsheet (manually, or through the dofile) rather than trying to code your loop around it. (I don't know enough about the problem at hand to say whether this is true in this case)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35589
#15

12 Jan 2017, 06:05

I imagine that Jesse's advice is very good. To me spreadsheet programs are like bus stations; sometimes you have to visit, but there is no interest in lingering.

What a competent spreadsheet user needs to have at the top of their list of Stata desiderata (other points could surely be added):

0. Getting the dataset into good Stata condition is almost always tedious but usually justified by easier and less error-prone analysis in Stata thereafter.

1. Variable names must be unique, really meaning unique, i.e. all distinct. Column names such as B to Z, BA to BZ, etc. do qualify but are a real pain to work with in Stata

2. It's a bonus if variable names are structured, e.g.. in groups, and informative. But the rules for variable names in Stata are inviolate.

3. One row in the spreadsheet that contains usable variable labels is often quite easy to work with in Stata, but it should be complete. Many spreadsheets have blanks for metadata where the user is supposed to know that you impute text mentally from neighbouring cells, or by analogy with them, but such blanks are also a real pain and should be filled in.
2 likes
Comment

Announcement