Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Foreach function: looping over all observations not just unique observations

    Hi,

    I am new to Stata, and would appreciate some guidance on the following task:

    I have a dataset of 5,000 series with 2 variables (var1 and var2). There are 4,000 observations in both var1 and var2. Each row for var1 and var2 are unique pairs.

    I need to look for the string position of both var1 and var2 against each series. Then, I need to calculate the difference in the character position between var1 and var2 in each series.

    Below is the code I am using to get to the sample data output as shown in the below table:

    Code:
    levelsof var1, local(var1list)
    foreach i of local var1list{
    generate var1`i'=0
    replace var1`i'= strpos(series,"`i'")
    }

    levelsof var2, local(var2list)
    foreach i of local var2list{
    generate var2`i'=0
    replace var2`i'= strpos(series, "`i'")
    }

    order series var1 var2 var1325620 var1221100 var14A0000 var1327992 var1518200 var1211000 var1211000 var1327992 var2326190 var252A000 var2519130 var2326210 var2541512 var2333130 var2F01000 var2314900


    local list1 "a1 a2 a3 a4 a5 a6 a7 a8"
    local list2 "b1 b2 b3 b4 b5 b6 b7 b8"
    local n : word count `list1'

    forvalues i = 1/`n' {
    local var1list : word `i' of `list1'
    local var2list : word `i' of `list2'

    generate diff`i' = `var1list' - `var2list' if `var1list' >0 & `var2list' >0
    }

    As you can see, I am using the foreach function to look for var1 and var2 against each of the series, but the problem with the foreach function is that it is only looping over unique values for var1 and var2 (e.g. var1 "211000" appears twice, however only one column "var1211000" is generated).

    This prevents me from running the rest of the commands to calculate the difference in the character position between var1 and var2 in each of the series.

    Question:
    How do I update the foreach function to loop over all var1 and var2 observations (including those unique and repeated observations)?
    If foreach is not the correct command to be used here, what is the correct command to use?

    I was also thinking to concatenate var1 and var2, so that each observation is unique. Then, use the strpos function to look for the character position but I can only figure out how to look for the position for the whole string in each series (e.g. the 1st observation in var1_var2 "325620326190"). I do not know how to look for the position of the 1st 6 characters of var1_var2 (e.g. "325620326190" ), and then the position of the last 6 characters (e.g. "325620326190" ). Please advise.

    These questions are really challenging for me, given that I am quite new to Stata, so it would be much appreciated if someone can correct my code.

    Thanks a lot.
    series var1 var2 var1_var2 var1211000 var1221100 var1325620 var1327992 var14A0000 var1518200 var2314900 var2326190 var2326210 var2333130 var2519130 var252A000 var2541512 var2F01000
    322210322120322299322110 325620 326190 325620326190 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    322210322120113000115000325310327992 221100 52A000 22110052A000 0 0 0 31 0 0 0 0 0 0 0 0 0 0
    322210322130113000115000325310327992 4A0000 519130 4A0000519130 0 0 0 31 0 0 0 0 0 0 0 0 0 0
    322210322120325180334513 327992 326210 327992326210 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    3222103221201130001150003253102123A0 518200 541512 518200541512 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    3222103221301130001150003253102123A0 211000 333130 211000333130 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    F010004A0000531ORE22110021100033120033111033141921 2230 211000 F01000 211000F01000 25 19 0 0 7 0 0 0 0 0 0 0 0 1
    F010004A0000531ORE23030133341533142033141133141921 2230 327992 314900 327992314900 0 0 0 0 7 0 0 0 0 0 0 0 0 1
    F010004A0000531ORE54130033592033142033141133141921 2230 0 0 0 0 7 0 0 0 0 0 0 0 0 1
    F010004A0000531ORE22110033361133211A33149033141921 2230 0 19 0 0 7 0 0 0 0 0 0 0 0 1
    F010004A0000531ORE22110021100033120033111048200033 1510331419212230 25 19 0 0 7 0 0 0 0 0 0 0 0 1
    F0100022110021100033120033111048200033151033141921 2230 13 7 0 0 0 0 0 0 0 0 0 0 0 1
    F0100022120021100033120033111048200033151033141921 2230 13 0 0 0 0 0 0 0 0 0 0 0 0 1
    F0100048100032411021100033120033111048200033151033 1419212230 19 0 0 0 0 0 0 0 0 0 0 0 0 1
    F0100048400032411021100033120033111048200033151033 1419212230 19 0 0 0 0 0 0 0 0 0 0 0 0 1
    F0100032411021100033120033111048200033151033141921 2230 13 0 0 0 0 0 0 0 0 0 0 0 0 1
    F0100033611132621033120033111048200033151033141921 2230 0 0 0 0 0 0 0 0 13 0 0 0 0 1
    F0100033611232621033120033111048200033151033141921 2230 0 0 0 0 0 0 0 0 13 0 0 0 0 1
    F0100048400032621033120033111048200033151033141921 2230 0 0 0 0 0 0 0 0 13 0 0 0 0 1
    F0100033611233260033120033111048200033151033141921 2230 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    F0100033790033260033120033111048200033151033141921 2230 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    F0100031161A1121A033260033120033111048200033151033 1419212230 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    F010004A0000531ORE23030133232033111048200033151033 1419212230 0 0 0 0 7 0 0 0 0 0 0 0 0 1
    F01000S0020323030133232033111048200033151033141921 2230 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    F0100033611133639033637033111048200033151033141921 2230 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    F0100033611233639033637033111048200033151033141921 2230 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    F0100033611133621133612033361833632033441A33141921 2230 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    F0100033611233621133612033361833632033441A33141921 2230 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    F0100033611132621031490031330031320031310011190011 5000325310327992 0 0 0 61 0 0 19 0 13 0 0 0 0 1
    F0100033611232621031490031330031320031310011190011 5000325310327992 0 0 0 61 0 0 19 0 13 0 0 0 0 1
    F0100048400032621031490031330031320031310011190011 5000325310327992 0 0 0 61 0 0 19 0 13 0 0 0 0 1
    F0100032621031490031330031320031310011190011500032 5310327992 0 0 0 55 0 0 13 0 7 0 0 0 0 1
    F0100042000031490031330031320031310011190011500032 5310327992 0 0 0 55 0 0 13 0 0 0 0 0 0 1
    F0100045200031490031330031320031310011190011500032 5310327992 0 0 0 55 0 0 13 0 0 0 0 0 0 1
    F0100031490031330031320031310011190011500032531032 7992 0 0 0 49 0 0 7 0 0 0 0 0 0 1
    F0100045200031500031330031320031310011190011500032 5310327992 0 0 0 55 0 0 0 0 0 0 0 0 0 1

  • #2
    I have reread your post several times and am still unable to understand what you are trying to do. Your code contains, in the middle, the mysterious appearance out of nowhere of new variables a1 through a8 and b1 through b8, about which you say nothing else, nor do they appear in your example data. Yet in the final loop you write, these variables are critical for defining the values of the 8 diff variables you want to create.

    Part of what you are asking to do appears to be a logical impossibility. You are bothered that "ar1 "211000" appears twice, however only one column "var1211000" is generated." But in a Stata data set, there can never be more than one variable with the same name. And were you to somehow trick Stata into violating that principle, nothing you did thereafter with those variables would be meaningful, for when you referred to var1211000, how would Stata know which one you had in mind?

    Here's my suggestion. First, read the Forum FAQ for lots of excellent advice on how to post clear questions and the useful ways to show data and results. Take the advice to heart: it is not a set of rules imposed from on high. It is advice that has developed from years of experience and following it makes communication clearer, easier, and more effective and efficient. Then post back. When doing so, be sure to use the -dataex- command to show a brief example of the data you are starting with. I would also recommend then showing an example of what you want the results of your calculations based on that data to look like. It is more likely that you will get a helpful response if you do that.

    One final thought. It may be that the calculations you are trying to do here are, in fact, not really what you ultimately want to do. Perhaps this is just one step in pursuit of some other goal. If so, tell us what the ultimate goal is. It may be that somebody can show you a different and better way to get there.

    Comment


    • #3
      Dear Clyde,

      Thank you for your advice. I have installed dataex as suggested and now provide a sample of the original dataset as follows:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str80 series str6(var1 var2)
      "F010004A0000531ORE221100211000331200331110331419212230"             "211000" "F01000"
      "F010004A0000531ORE221100211000331200331110482000331510331419212230" "211000" "4A0000"
      "F01000221100211000331200331110482000331510331419212230"             "212230" "F01000"
      "F01000484000324110211000331200331110482000331510331419212230"       ""       ""      
      "F01000324110211000331200331110482000331510331419212230"             ""       ""      
      end
      In the original dataset, there are 4,000 long chains in the series column. For each of the var1 and var2 observations (var1 and var2 are unique pairs), I need to query whether they appear together in each of the chains. If they do appear, I need to note the position and calculate the distance between each "var1 and var2" pair in each chain.

      Below is the code I have so far to ascertain the position of all obs in var1 and var2 in each chain, which is the first step of what I need to do:

      Code:
      levelsof var1, local(var1list)
      foreach i of local var1list{
        generate var1`i'=0
        replace var1`i'= strpos(series,"`i'")
      }
      
      levelsof var2, local(var2list)
      foreach i of local var2list{
        generate var2`i'=0
        replace var2`i'= strpos(series, "`i'")
      }

      What is missing in the above code is logic to calculate the distance between each "var1 and var2" pair in each chain and ultimately, the maximum distance for each var1 and var2 pair in all chains.

      Below is the desired output, which has the desired columns "dif_var1ob1_var2ob1", "dif_var1ob2_var2ob2", "dif_var1ob3_var2ob3", and "dif_max".

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str80 series str6(var1 var2) float(var1211000 var1212230 var24A0000 var2F01000 dif_var1ob1_var2ob1 dif_var1ob2_var2ob2 dif_var1ob3_var2ob3 dif_max)
      "F010004A0000531ORE221100211000331200331110331419212230"             "211000" "F01000" 25 49 7 1 24 18 48 24
      "F010004A0000531ORE221100211000331200331110482000331510331419212230" "211000" "4A0000" 25 61 7 1 24 18 60 18
      "F01000221100211000331200331110482000331510331419212230"             "212230" "F01000" 13 49 0 1 12  . 48 60
      "F01000484000324110211000331200331110482000331510331419212230"       ""       ""       19 55 0 1 18  . 54  .
      "F01000324110211000331200331110482000331510331419212230"             ""       ""       13 49 0 1 12  . 48  .
      end
      The first desired column "dif_var1ob1_var2ob1" refers to the 1st pair in var1 and var2 (i.e. var1 = 211000 and var2 = F01000). The 2nd desired column refers to the 2nd pair until all pairs are looped.
      Using the 1st desired column as an example, the 1st pair appears in all the 5 chains and their positions within each chain are shown in columns var1211000 and var2F01000.
      So column "dif_var1ob1_var2ob1" is the distance between columns "var1211000" and "var2F01000" (e.g. in the 1st chain, var1(obs 1) is in position 25, and var2(obs 2) is in position 1, so the distance is 25 - 1 = 24).
      i.e. F010004A0000531ORE221100211000331200331110331419212230

      For the 2nd desired column "dif_var1ob2_var2ob2", this refers to the 2nd pair (var1 = 211000, and var2 = 4A0000). However, you will note that var2 does not appear in chains 3 to 5, and so the output in the 2nd desired column "dif_var1ob2_var2ob2" for chains 3 to 5 should be blank (I only need to calculate the distance where var 1 and var 2 both appear in the chain).

      Finally, after calculating the distance between all var1 and var2 pairs, the last step is to find out the maximum distance for each pair over all chains.

      The desired column for the maximum distance is named "dif_max". For example, looking at the distance calculated for the first pair (refer to "dif_var1ob1_var2ob1" column), the observations are:
      24
      24
      12
      18
      12

      Therefore, the maximum distance for our 1st pair is 24.

      Again, the ultimate goal is to calculate the max distance for each pair for all chains where they both appear. So far, I have only been able to run the 1st step, but I am not sure how to get to the ultimate goal. Please advise how I can achieve the ultimate goal.

      Thank you for your patience and I hope this explanation is a little clearer than my first.

      Comment


      • #4
        OK, I think I get it now. As I suspected, to achieve your final goal you do not need to go through the steps you were working on. And, there is no need for any loops to do this. The problem is that your data organization is dysfunctional: the "rows" (in Stata we call them observations) in which var1 and var2 have nothing to do with the observations of series. They are really two more or less unrelated data sets that have been pasted together side-by-side. That kind of organization is often seen in spreadsheets but almost always leads to disaster when attempted in Stata.

        So once we split up the data set into one for series and a separate one for var1 and var2, it gets very easy:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str80 series str6(var1 var2)
        "F010004A0000531ORE221100211000331200331110331419212230"             "211000" "F01000"
        "F010004A0000531ORE221100211000331200331110482000331510331419212230" "211000" "4A0000"
        "F01000221100211000331200331110482000331510331419212230"             "212230" "F01000"
        "F01000484000324110211000331200331110482000331510331419212230"       ""       ""      
        "F01000324110211000331200331110482000331510331419212230"             ""       ""      
        end
        
        //    PUT series AND var1 var2 INTO SEPARATE DATA SETS
        preserve
        keep var1 var2
        drop if missing(var1, var2)
        tempfile probes
        save `probes'
        
        restore
        keep series
        
        //    CROSS THESE DATA SETS
        cross using `probes'
        gen long p1 = strpos(series, var1)
        gen long p2 = strpos(series, var2)
        gen long diff = p1 - p2 if p1 & p2
        
        //    IDENTIFY MAX DIFF FOR EACH VAR1 VAR2 PAIR
        collapse (max) diff, by(var1 var2)
        Now, with the real data involving 5000 series and 4000 var1 var2 pairs, this is going to be slow. But it will still be much faster than what would be required to do it the way you were starting out. Also, your approach would ultimately have attempted to create a very wide data set involving 20,000,000 variables (5000 * 4000), which is far beyond the maximum number of variables allowed even in Stata MP.


        Comment


        • #5
          Dear Clyde,

          Thank you very much for your help. The code you provided works perfectly well. I have been reading your response on Statalist for a while and found your guidance is always helpful. How fortunate I am to get the response from you in my first pose on this forum. I really appreciate your help!

          I just got two silly questions to ask in order to better understand the code.

          (1) Why the following two commands work well when the data set is split into one for series and a separate one for var1 and var2 but not when the dataset is as a whole. By splitting the dataset, I can see that Stata checks whether each unique var1 var2 pair appears in all the chains. However, when I use the same command without splitting the dataset, Stata only checks the var1 var2 pair in the chain that is next to them, for example, check the 1st var1 var2 pair in the first chain, then check the 2nd var1 var2 pair in the second chain. Why is that?

          Code:
            
           gen long p2 = strpos(series, var2)  
           gen long p1 = strpos(series, var1)



          (2) The second question is about the tempfile command. I read your response in another pose (see the link below) mentioning that the file (in my case, it is 'probes') is a temporary file, after exiting the Stata, neither the name 'probes' nor the file exists any longer. However, when I run the code for the second time, it says "probes already exists" so I have to use another name. I exited and reopened Stata before running the code for the second time. Does it mean "probes" is not a temporary file? but I couldn't see where it is located.

          Below is where I found your response in another pose.
          https://www.statalist.org/forums/for...-tempfile-save


          Thanks again for your great help. I appreciate it!

          Regards,
          Liyu

          Comment


          • #6
            Regarding question 1. In any Stata command, all references to variables are construed as referring to values in the same observation, not any others, unless the variables are explicitly subscripted. The reason that you get results for every series after splitting the data set is because the -cross- command that comes between the splitting and the calculation of p1 and p2 creates a new data set in which every pairing of values of series with values of var1 var2 is a separate observation. So the normal rules of interpretation apply. If you want to get an intuitive feel for what is going on, run the example code up through the -cross- command but no farther. Then -browse- to see what the data looks like.

            Regarding question 2. I don't have a good answer for you. I have occasionally encountered this myself. Typically it happens when the program does not go to completion but breaks somewhere, and I guess whatever mechanism Stata uses to clean up temporary files is never reached, and the file gets left behind. As far as I can remember, however, exiting Stata and relaunching it has always solved the problem. So that is unlike what you are encountering and I don't really know what to tell you. Something is wrong, but I don't know what. I can, however, suggest a workaround. Before the -tempfile `probes'- command, insert a new command: -capture erase `probes'-. That way if `probes' is still there, it will be erased, and if it is not still there Stata will just move on.


            Comment


            • #7
              Dear Clyde,

              Thank you very much for your explanation. It is all clear now!

              Much appreciated!

              Regards,
              Liyu

              Comment

              Working...
              X