Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge on string using "regexm"

    Basically, I use vector approach to merge string with "regexm".
    That is,
    1. reshape wide (one string per column)
    2. keep one column
    3. set observation number equal to the number of rows on another database
    4. make the column all the same with one string
    5. merge on _n with another database
    6. use regexm to drop those not match
    7. save the result
    8. do the loop again

    However as the database grows, this becomes inefficient.
    size: dataset in memory: hundred of thousand of rows, database in using: millions of rows
    Any thought on improvement?

  • #2
    This sounds like potentially an interesting problem, but you have never told us 1) what the nature of your existing data is; 2) how you want your data to turn out. You seem to have told us what you are doing, but *not* what you want the result to be.

    Further, I can't imagine any sense in which "regexm" is used to merge anything, and I don't know what you mean by "a vector approach." I'd suggest, per the FAQ at the top of the screen, you show us an example of what your current data is, and and an example of what you would like the result to be. You might also try showing your problem description to some friend, before you re-post it, to make sure that other persons can understand what you want.

    Regards, Mike

    Comment


    • #3
      Thank you for considering this question, Mike Lacy.

      This is part of big project to identify every groups (in master) using existing database (in using), both are updated regularly.

      For "regexm" part, the following is relevant:

      Master: (in memory)
      groupname var1 var2 time_variable individual_variable ...................
      --------------- ------ ------ ------------------ -------------------------
      name1
      name1
      ac
      rgh
      ...........

      Using: (the database with unique id)
      pseudonyms group_ID group_var1 group_var2 ............
      ------------------ ------------ ---------------- ---------------
      ab~ac~ae~af 00000001 ...................
      ee~ve~4e
      rg~rgh
      hy
      ..................... 00000005 ..................


      What is the nature of existing data in master?
      It is scrapped from web regularly. The groupname is input by the user.
      So I cannot pre-code groupname.

      How I want my data to turn out:
      pseudo-code:
      merge m:1 using "database",if regexm(pseudonyms, groupname)

      It's totally possible that one groupname would have been identified with more than one group_ID using this method. So after regexm, other variable will be compared to drop further.
      And of course, groupname contains missing value or partial string.

      What do I mean by "vector approach"?

      In python's pandas, doing regular expression on vector(column) is much faster.
      I think it is also true in STATA.

      Note:
      Since STATA 14 supporting UTF-8, I have totally done all data management on STATA.
      It's much intuitive than python's pandas due to STATA's data editor(viewer). I write my web crawler totally using STATA. Congratulation!
      Last edited by Jimmy Yang; 30 Jun 2015, 23:40.

      Comment


      • #4
        Unfortunately, I just find out that "regexm" is somewhat buggy in Stata. Instead, use strpos() to match string which contains another string.

        Comment


        • #5
          I'm sorry, but I still don't understand what you want. Perhaps someone else here will do better than I can.

          Regards, Mike

          Comment


          • #6
            I think the objective is to merge the master file to the using file, but instead of having groupname in both files that matches exactly, groupname in the master file is to match a substring of pseudonyms in the using file. This would be a simple SQL join with a where clause comparing groupname to pseudonyms. Unfortunately, this is not in general doable using a Stata merge.

            My approach would be to transform the using file by expanding it to have one row per pseudonym, with all the data for the group duplicated for each pseudonym.

            Code:
            pseudonyms         group_ID     group_var1       group_var2 ............
            ------------------ ------------ ---------------- ---------------
            ab~ac~ae~af        00000001     ...................
            Code:
            groupname          group_ID     group_var1       group_var2 ............
            ------------------ ------------ ---------------- ---------------
            ab                 00000001     ...................
            ac                 00000001     ...................
            ae                 00000001     ...................
            af                 00000001     ...................
            Then this transformed using file could be merged with the master file by groupname.

            Comment


            • #7
              Originally posted by Jimmy Yang View Post
              How I want my data to turn out:
              pseudo-code:
              merge m:1 using "database",if regexm(pseudonyms, groupname)
              Instead of pseudo code, could you please show what the merged data is supposed to look like? Please use CODE tags (see FAQ, section 12).

              Comment


              • #8
                William Lisowski's understanding is right.
                The objective is to do the following code efficiently
                Code:
                use master, clear
                gen id=1
                save, replace
                use using, clear
                gen id=1
                save, replace
                use master, clear
                joinby id using using.dta
                keep if strpos(pseudonyms,groupname)>0
                Last edited by Jimmy Yang; 01 Jul 2015, 21:39.

                Comment


                • #9
                  Originally posted by Jimmy Yang View Post
                  Unfortunately, I just find out that "regexm" is somewhat buggy in Stata. Instead, use strpos() to match string which contains another string.
                  Perhaps you can show an example of how regexm is buggy?

                  On the task that you are asking, what are the dimensions of your problem?

                  Comment


                  • #10
                    I think William Lisowski has provided the best solution. My only two cents to it is that -- depending on the scale of the problem -- it might be better to have two tables out of the first pseudonym table. The first one just with the bridge between groupname and group_ID and the second one pretty much the original without the pseudonyms column/variable.

                    Just a thought

                    Comment


                    • #11
                      I will accept William Lisowski's answer if this is like stackoverflow.
                      Julio Raffo is right. I have tailored to the extend that it only deal with two column, groupname and group_ID.
                      About regexm, I just finded out Extended ASCII will break "regexm" sometimes.
                      However, it has been fixed in Stata 14 using UTF-8. I tested program that day using Stata 13.1.
                      It's somewhat painful for me to migrate to Stata 14. My code is more than thousands of lines.
                      I would urge Stata Corp to adopt Perl's standard on regular expression in Stata 15.
                      Stata is not just a statistical program, it can be used more like Perl or Python now.

                      Comment


                      • #12
                        We don't have a formal "accept answer" apparatus here as on Stack Exchange sites (for those who don't know what on earth this means, it means a big tick next to your answer (!) and 15 points added to your reputation (!!); yes, we're back in primary school).

                        But FAQ Advice does make clear that indicated what solved your problem, most nearly, is often helpful as well as courteous.

                        Comment

                        Working...
                        X