Loop over string values of a varibale

Navid Asgari

Join Date: Jul 2025

Posts: 30
#1

Loop over string values of a varibale

14 Nov 2014, 11:26

Hi all,

I wonder if Sata has the capability for looping over the string values of a variable. For example,

City
---------
LA
Berlin
London
New York

Now, I want to create a loop which starts with LA and then moves to Berlin etc and performs certain calculations in each round of the loop.

Thanks,
David
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35734
#2

14 Nov 2014, 11:33

Yes, certainly, and this is well documented. See (e.g.) the help for foreach and http://www.stata.com/support/faqs/data-management/try-all-values-with-foreach/

P.S. We strongly prefer full real names here and encourage you to change your identifier away from "Str_researcher" to (e.g.) "David <surname>". Many active members here are reluctant to support people with anonymous or cryptic identifiers.
Comment
Tobias Wendler

Join Date: Nov 2017

Posts: 51
#3

08 Feb 2018, 07:09

Hey

I have a similar issue, on which I`m vividly trying to find a solution (using STATA 13).

I have a string variable that looks like the following (has overall 20 or so different labels it may take):

input str213 Indicator
"Domestic Material Input (Non-metallic minerals) (Tonnes)"
"Domestic Material Input (Non-metallic minerals) (Tonnes)"
"Domestic extraction (Total) (Tonnes)"
"Domestic extraction (Total) (Tonnes)"

What I now want to do is:

Writing a loop which in each step only keeps those observations where the string variable has an identical "value".
Then I want to save the dataset in which only these values are kept. Then save it naming it by the first letters of each word of the manifestation in Indication.

For example the first two lines of the example should be saved as: "\\Desktop\DMINmT.dta"
The second as "\\Desktop\DeTT.dta"

Then I want to open again my full dataset to be able to run the loop again:

use "\\Desktop\MF_all.dta"

Examples on what I currently tried and failed with :

foreach i in Indicator {
keep if Indicator == `i'
save "\\Desktop\`i'.dta"
use "\\Desktop\MF_all_indicators.dta", clear
}

Now, the recent approach I am thinking about is to generate two loops:

1. Generating a new variable which gets running numbers, which are the same for each unique label in the string variable
2. Generate a forvalues loop using the numbers from the variable in the first loop

I`ll continue to try to figure out solutions, and will let know as soon as I found one.
For any hints or solutions on how to do this in a more efficient coding way I would be very happy

Thanks and best regards
Tobi
Comment

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

08 Feb 2018, 07:43

Code:

ren Indicator indlong
gen indicator = strtrim(indlong)


levelsof indicator, local(levels) 
foreach l of local levels{
preserve
keep if indicator == "`l'"
save "`l'.dta"
restore
}

Comment

Danial Hoepfner

Join Date: Apr 2017
Posts: 6

08 Feb 2018, 07:56

#4 gets most of what you want done, to get the file named the first letter of each word in the string you just need to add a little bit

Code:

clear

input str213 Indicator
"Domestic Material Input (Non-metallic minerals) (Tonnes)"
"Domestic Material Input (Non-metallic minerals) (Tonnes)"
"Domestic extraction (Total) (Tonnes)" 
"Domestic extraction (Total) (Tonnes)" 
"Another Example"
end
expand 20

levelsof Indicator
foreach i in `r(levels)' {
    preserve
    local sname
    foreach w in `i' {
        local sname=`"`sname'`=substr(`"`=subinstr(`"`w'"',"(","",1)'"',1,1)'"'
    }
    keep if Indicator==`"`i'"'
    sa `"`sname'.dta"', replace
    restore
}

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35734
#6

08 Feb 2018, 08:06

I don't see any obvious need to save lots of different datasets here. My advice is that you're just creating a burden for yourself. If you explain why you think you want or need that, the reason may be convincing, or you may get further advice on how to tackle the real problem.

That said, your syntax is some way off and would be a lot simpler as something more like

Code:

egen group = group(Indicator) su group, meanonly forval i=1/`r(max)' { <do something> if group == `i' }

This approach is documented e.g https://www.stata.com/support/faqs/d...-with-foreach/

What's wrong with your code? I can't test it without example data but the following problems spring to mind.

1. Testing for exact equality with literal strings requires " " around those values.

2. foreach i in varname does not give you a loop over the distinct values of varname Perhaps that guess is based on knowing what some other software does with a loosely similar construct. The syntax you have gives you a loop that is legal in this case but will loop just once. Your code would be legal but yield only

Code:

keep if Indicator == Indicator

which is true for the entire dataset. This error means that #1 doesn't bite, but it would with other code.

3. Putting local macro references after backslashes is explicitly warned against as a source of bugs. See e.g.

[U] 18.3.11 within https://www.stata.com/manuals/u18.pdf

http://www.stata-journal.com/sjpdf.h...iclenum=pr0042
Comment
Tobias Wendler

Join Date: Nov 2017

Posts: 51
#7

08 Feb 2018, 10:48

Hey Jorrit,

thank you so much for the Code, it worked perfectly
Thanks also for the other suggestions and comments, I will take them into account, try out and reply asap.

Thanks and best regards
Tobi

Last edited by Tobias Wendler; 08 Feb 2018, 10:51.
Comment
Tobias Wendler

Join Date: Nov 2017

Posts: 51
#8

16 Feb 2018, 03:36

Hey,

first of all concerning Nicks argument that I am just creating a burden you are right. I tried to splice the big problem into many small ones, but I should look for the full solution.
The actual problem looks like this, I have data that looks the following:

input str74 indicator long(value year) str129 Country
"Domestic extraction (Fossil fuels) (Tonnes)" 2256 48 "Austria"
"Domestic extraction (Fossil fuels) (Tonnes)" 2137 49 "Austria"
"Domestic extraction (Biomass) (Tonnes)" 3508 2 "Austria"
"Domestic extraction (Biomass) (Tonnes)" 3592 3 "Austria"
"Domestic extraction (Non-metallic minerals) (Tonnes)" 10759 48 "Finland"
"Domestic extraction (Non-metallic minerals) (Tonnes)" 10777 49 "Finland"
"Domestic extraction (Biomass) (Tonnes)" 3949 2 "Finland"
"Domestic extraction (Biomass) (Tonnes)" 3982 3 "Finland"
end

What I actually want is to now get according to the different variations of indicator the value for these as its own variable for each country,
so that the data would look like the following if we would name each variable by its first letters, e.g. Domestic extraction (Non-metallic minerals) = DeNm:

Year Country DeNm DeMo DMIFf
1970 Austria Value of Domestic extraction Non-metallic minerals Value of Domestic extraction Metal ores Value of Domestic Material Input Fossil fuels
1 970 Finland
. .
.

If I could get this handled within the dataset without having to create several, edit and merge them that would of course be great.
Thanks in advance for any advice on this issue.

Now, by trying the suggestions above based on my old approach the following things occurred:

1) Danials Code
I got the error code 603, which I have not yet been able to resolve

2) Nick`s code

The code worked and saved me the appropriate files with the names 1 to 20.

Code:

egen group = group(Indicator) su group, meanonly forval i=1/`r(max)' { keep if group == `i' save "\\Desktop\\`i'.dta", replace }

3) Comments to my old code

In fact it was the problem I did not write "". However, the second problem also occured in fact meaning with my code I only saved one dataset.

foreach i in indicator {
keep if indicator == "`i'"
save "\\Desktop\\`i'.dta", replace
}

I will keep trying and reporting as soon as I come to a new solution.

Thanks and best regards
Tobi
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

16 Feb 2018, 09:47

Like Nick, I don't quite understand why break-up the dataset. It kind of looks like Tobias wants the data in a wide layout and the contraction of the indicator variable is needed to have manageable variable names.

Anyway, here's another way to create the variable name contraction using moss (from SSC). This uses a regular expression pattern to match all cases of a letter "([A-Za-z])" that is preceded by a non-letter "[^A-Za-z-]". The advantage of creating this variable name contraction at the observation level is that you can easily check that each contraction matches only one indicator value. The example then shows the mapping of contraction to indicator value.

Finally, I reshape the data to a wide layout. Since each observation in the #8 example was uniquely identified by year and Country, I added an extra observation to show what happens when there's more than one indicator per year and Country.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str52 indicator int value byte year str7 Country
"Domestic extraction (Fossil fuels) (Tonnes)"           2256 48 "Austria"
"Domestic extraction (Fossil fuels) (Tonnes)"           2222  2 "Austria"
"Domestic extraction (Fossil fuels) (Tonnes)"           2137 49 "Austria"
"Domestic extraction (Biomass) (Tonnes)"                3508  2 "Austria"
"Domestic extraction (Biomass) (Tonnes)"                3592  3 "Austria"
"Domestic extraction (Non-metallic minerals) (Tonnes)" 10759 48 "Finland"
"Domestic extraction (Non-metallic minerals) (Tonnes)" 10777 49 "Finland"
"Domestic extraction (Biomass) (Tonnes)"                3949  2 "Finland"
"Domestic extraction (Biomass) (Tonnes)"                3982  3 "Finland"
end

* to install moss, type: ssc install moss
gen s = " " + indicator
moss s, match("[^A-Za-z-]([A-Za-z])") regex
egen vname = concat(_match*)
drop _* s

* make sure that the variable name is matched to only 1 indicator
bysort vname (indicator): assert indicator[1] == indicator[_N]

* show the mapping of vname to indicator
by vname: gen tag = _n == 1
leftalign    // to install, type: ssc install leftalign 
list vname indicator if tag
drop tag

* convert to a wide layout
drop indicator
reshape wide value, i(year Country) j(vname) string
rename value* *

list, sepby(year)

And the results from the first listing

Code:

. list vname indicator if tag

     +--------------------------------------------------------------+
     | vname   indicator                                            |
     |--------------------------------------------------------------|
  1. | DeBT    Domestic extraction (Biomass) (Tonnes)               |
  5. | DeFfT   Domestic extraction (Fossil fuels) (Tonnes)          |
  8. | DeNmT   Domestic extraction (Non-metallic minerals) (Tonnes) |
     +--------------------------------------------------------------+

and the final results

Code:

. list, sepby(year)

     +---------------------------------------+
     | year   Country   DeBT   DeFfT   DeNmT |
     |---------------------------------------|
  1. |    2   Austria   3508    2222       . |
  2. |    2   Finland   3949       .       . |
     |---------------------------------------|
  3. |    3   Austria   3592       .       . |
  4. |    3   Finland   3982       .       . |
     |---------------------------------------|
  5. |   48   Austria      .    2256       . |
  6. |   48   Finland      .       .   10759 |
     |---------------------------------------|
  7. |   49   Austria      .    2137       . |
  8. |   49   Finland      .       .   10777 |
     +---------------------------------------+

.

Comment

Tobias Wendler

Join Date: Nov 2017

Posts: 51
#10

19 Feb 2018, 08:30

Hey Robert,

thank you very much for the awesome suggestion, everything worked perfectly as you suggested and it did exactly what I wanted to do, saving me of all the unnecessary effort.
Thank you and Nick so much for this correction.

Best wishes
Tobi
Comment
Arash Latifi

Join Date: Aug 2019

Posts: 1
#11

30 Aug 2019, 05:18

Dear all,

I am working with the labor data from different cities where the total number of employees alongside their qualification levels is given for each year. The dataset contains 7 variables, namely; Year, City, Qualification_Level, Total_Number, Total_Men, Total_Women (Qualification_Level is a string variable, containg a gap for some cases) and the dataset contains the information from the 2000 to 2018 period.

The dataset looks like:

Year City Qualification_Level Total_Number Total_Men Total_Women
2018 A None … … …
2018 A Undergraduate … … …
2018 A Post Graduate … … …
2018 A Total … … …
2017 A None … … …
2017 A Undergraduate … … …
2017 A Post Graduate … … …
2017 A Total … … …
…..
.
.
.
2018 B None
2018 B Undergraduate
2018 B Post Graduate
2018 B Total
2017 B None
2017 B Undergraduate
2017 B Post Graduate
2017 B Total
….
.
.
.
The values of the variables (Total_Number, Total_Men, Total_Women) are Numeric.
Now I would like to calculate the “Ratio” of Qualification Level for each City at each Year, by calculating the following ratios: (None/Total), (Undergraduate /Total) and (Post Graduate/Total).
I have already tried looping over String values and defining my dataset as a Panel Data, but both approaches were not successful.
Would appreciate your comments regarding looping over String Variables in a situation like this.

Best Regards
Comment
Bradley Tompkins

Join Date: Jun 2023

Posts: 3
#12

20 Mar 2025, 09:42

Hi All,

I am also trying to get Stata (18.5) to loop over the string values of a single variable and for the life of me, I cannot get it to work appropriately. I have reviewed the help for foreach and also the material on http://www.stata.com/support/faqs/da...-with-foreach/, but still not get this to work appropriately. Here is an example of my data:

input str5 hosp_name float(qtr glucose_done)
"HFMMC" 248 0
"HFMMC" 248 0
"HFMMC" 248 0
"HFMMC" 248 0
"PMC" 248 1
"HFMMC" 248 0
"ABHC" 248 0
"PMC" 248 1
"HFMMC" 248 1
"HFMMC" 248 0
end
format %tq qtr
label values glucose_done label_yn
label def label_yn 0 "No", modify
label def label_yn 1 "Yes", modify

and here is the looping code that I wrote:

foreach i in hosp_name {
tab glucose_done qtr if hosp_name==`i', col
}

The code is running, but it is essentially running "tab glucose_done qtr" once for the entire dataset, not one time for each value of hosp_name in the dataset.

Any suggestions for how to get this working appropriately would be greatly appreciated.

Bradley
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#13

20 Mar 2025, 09:53

Your code in #12 has a couple of problems. First is the way you have used -foreach ... in...-. It does not look inside a variable and extract the values. Rather it needs to be followed by a list of the specific values. To loop over the values taken on by the variable hosp_name requires first creating a list of those. That last can then be used in a -foreach ... in...- command, or, better, in a -foreach ... of local ...- loop.

Next, you will get into trouble with -if hosp_name == `i'- because this will cause Stata to look for a variable whose name is `i' and compare whether the value of hosp_name is the same as the value of variable `i'. But, of course, variable `i', if you have constructed the loop properly, will not exist, and you will get an error there. [Added: to get Stata to see whether the value of variable hosp_name is the same as the value in `i' requires that you surround `i' in quotes: `"`i'"'.]

The reason the loop runs at all is that, as you wrote it, the loop runs exactly once, with the value of i set to hosp_name. The -tab- command is then conditioned on -if hosp_name == hosp_name-, which, of course, is true in every observation. So in that one iteration of the loop, you will get -tab glucose_done qtr- executed for the entire data set.

The following code fixes these problems.

Code:

levelsof hosp_name, local(hosp_names) foreach h of local hosp_names { display "`h'" tab glucose_done qtr if hosp_name == `"`h'"' }

I have also added a command to show the name of the hospital before each cross tab, so you can tell which cross tab corresponds to which value of hosp_name.

Last edited by Clyde Schechter; 20 Mar 2025, 10:04.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1413

#14

20 Mar 2025, 10:07

Clyde Schechter specifies the correct way to loop, but if all you need is the tabulation above, you don't actually need a loop at all:

Code:

. bysort hosp_name: tab glucose_done qtr


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> hosp_name = ABHC

glucose_do |    qtr
        ne |    2022q1 |     Total
-----------+-----------+----------
        No |         1 |         1
-----------+-----------+----------
     Total |         1 |         1

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> hosp_name = HFMMC

glucose_do |    qtr
        ne |    2022q1 |     Total
-----------+-----------+----------
        No |         6 |         6
       Yes |         1 |         1
-----------+-----------+----------
     Total |         7 |         7

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> hosp_name = PMC

glucose_do |    qtr
        ne |    2022q1 |     Total
-----------+-----------+----------
       Yes |         2 |         2
-----------+-----------+----------
     Total |         2 |         2

If you are using Stata 17 or later, you could also just do:

Code:

. table (glucose_done) (qtr) (hosp_name), nototal

hosp_name = ABHC
----------------------
             |    qtr 
             |  2022q1
-------------+--------
glucose_done |        
  No         |       1
----------------------

hosp_name = HFMMC
----------------------
             |    qtr 
             |  2022q1
-------------+--------
glucose_done |        
  No         |       6
  Yes        |       1
----------------------

hosp_name = PMC
----------------------
             |    qtr 
             |  2022q1
-------------+--------
glucose_done |        
  Yes        |       2
----------------------

Of course, if this was just dummy code and you actually need to do a set of operations for each value of hosp_name, then the loop may be right way to proceed.

Last edited by Hemanshu Kumar; 20 Mar 2025, 10:09.

Comment

Bradley Tompkins

Join Date: Jun 2023

Posts: 3
#15

20 Mar 2025, 13:31

Thank you Clyde Schechter for the corrected code and detailed explanation of what was wrong with my original attempt. And Hemanshu Kumar thanks for the reminder on the value of bysort to accomplish this as well.
Comment

Announcement