Separating rows following data in a specific cell separated by semicolon

George Mane

Join Date: Apr 2019

Posts: 24
#1

Separating rows following data in a specific cell separated by semicolon

26 Apr 2023, 12:22

Dear statlist team,

I would like to ask for your help with managing some large data I have. I need to generate new observations that maintain the value for all variables, while separating the "hs6s" variable following the codes separated by semicolon within.

Below a small portion of the database to illustrate it.

I should mention that the complete database includes some cells with more than 5000 different hs6 codes. I hope you can help me with this challenge.

clear
input str33 reporter strL partner str4 ntm str11 target strL hs6s
"Afghanistan" "All Members" "P14" "Target 12.4" "290314;290319;290339;380891;380892;380893;380894; 380899;381300;381400;382200;382471;382472;382473;3 82474;382475;382476;382477;382478;382479;382490"
"Afghanistan" "All Members" "P14" "Target 12.5" "290314;290319;290339;380891;380892;380893;380894; 380899;381300;381400;382200;382471;382472;382473;3 82474;382475;382476;382477;382478;382479;382490"
"Afghanistan" "All Members" "P14" "Target 12.4" "381300;841510;841520;841581;841582;841583;841590; 841810;841821;841829;841830;841840;841850;841861;8 41869;841891;841899;841960;842410;845110;847621;84 7780"
"Afghanistan" "All Members" "P14" "Target 12.5" "381300;841510;841520;841581;841582;841583;841590; 841810;841821;841829;841830;841840;841850;841861;8 41869;841891;841899;841960;842410;845110;847621;84 7780"
end
[/CODE]
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#2

26 Apr 2023, 14:18

Code:

gen `c(obs_t)' obs_no = _n split hs6s, gen(item) parse(";") reshape long item, i(obs_no) drop if missing(item)

Note: This code requires running under a version and flavor of Stata that permits you to have 5000 + original number of variables in memory. See -help limits- to see if yours qualifies. If yours doesn't, post back and I can provide a more complicated approach that doesn't require all those extra variables. If we do come to that, also please clarify one thing about your data. Most of the tokens between semi-colons in hs6s are numbers. But a few of them have internal blank spaces. Are those internal blanks actually correct or do they represent data errors? The code can be simpler if it is OK to remove the blanks.

Added: Assuming you have the memory to run this in the first place, you may find that the -reshape- is very slow with this many variables. Consider using the community-contribute -greshape- (part of the -gtools- suite) or -tolong- instead. Both accept the same syntax as -reshape- and both are available from SSC.

Last edited by Clyde Schechter; 26 Apr 2023, 14:50.
Comment
George Mane

Join Date: Apr 2019

Posts: 24
#3

01 May 2023, 02:25

Dear Clyde,

Thank you, this worked slowly but did the job.

Thanks again for your precious help.

Best regards,

George
Comment

Announcement

Separating rows following data in a specific cell separated by semicolon

Comment

Comment