merge on string using "regexm"

Jimmy Yang

Join Date: May 2015

Posts: 54
#1

merge on string using "regexm"

30 Jun 2015, 05:00

Basically, I use vector approach to merge string with "regexm".
That is,
1. reshape wide (one string per column)
2. keep one column
3. set observation number equal to the number of rows on another database
4. make the column all the same with one string
5. merge on _n with another database
6. use regexm to drop those not match
7. save the result
8. do the loop again

However as the database grows, this becomes inefficient.
size: dataset in memory: hundred of thousand of rows, database in using: millions of rows
Any thought on improvement?
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#2

30 Jun 2015, 10:16

This sounds like potentially an interesting problem, but you have never told us 1) what the nature of your existing data is; 2) how you want your data to turn out. You seem to have told us what you are doing, but *not* what you want the result to be.

Further, I can't imagine any sense in which "regexm" is used to merge anything, and I don't know what you mean by "a vector approach." I'd suggest, per the FAQ at the top of the screen, you show us an example of what your current data is, and and an example of what you would like the result to be. You might also try showing your problem description to some friend, before you re-post it, to make sure that other persons can understand what you want.

Regards, Mike
1 like
Comment
Jimmy Yang

Join Date: May 2015

Posts: 54
#3

30 Jun 2015, 22:40

Thank you for considering this question, Mike Lacy.

This is part of big project to identify every groups (in master) using existing database (in using), both are updated regularly.

For "regexm" part, the following is relevant:

Master: (in memory)
groupname var1 var2 time_variable individual_variable ...................
--------------- ------ ------ ------------------ -------------------------
name1
name1
ac
rgh
...........

Using: (the database with unique id)
pseudonyms group_ID group_var1 group_var2 ............
------------------ ------------ ---------------- ---------------
ab~ac~ae~af 00000001 ...................
ee~ve~4e
rg~rgh
hy
..................... 00000005 ..................

What is the nature of existing data in master?
It is scrapped from web regularly. The groupname is input by the user.
So I cannot pre-code groupname.

How I want my data to turn out:
pseudo-code:
merge m:1 using "database",if regexm(pseudonyms, groupname)

It's totally possible that one groupname would have been identified with more than one group_ID using this method. So after regexm, other variable will be compared to drop further.
And of course, groupname contains missing value or partial string.

What do I mean by "vector approach"?

In python's pandas, doing regular expression on vector(column) is much faster.
I think it is also true in STATA.

Note:
Since STATA 14 supporting UTF-8, I have totally done all data management on STATA.
It's much intuitive than python's pandas due to STATA's data editor(viewer). I write my web crawler totally using STATA. Congratulation!

Last edited by Jimmy Yang; 30 Jun 2015, 23:40.
Comment
Jimmy Yang

Join Date: May 2015

Posts: 54
#4

01 Jul 2015, 03:24

Unfortunately, I just find out that "regexm" is somewhat buggy in Stata. Instead, use strpos() to match string which contains another string.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#5

01 Jul 2015, 09:31

I'm sorry, but I still don't understand what you want. Perhaps someone else here will do better than I can.

Regards, Mike
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

01 Jul 2015, 09:51

I think the objective is to merge the master file to the using file, but instead of having groupname in both files that matches exactly, groupname in the master file is to match a substring of pseudonyms in the using file. This would be a simple SQL join with a where clause comparing groupname to pseudonyms. Unfortunately, this is not in general doable using a Stata merge.

My approach would be to transform the using file by expanding it to have one row per pseudonym, with all the data for the group duplicated for each pseudonym.

Code:

pseudonyms group_ID group_var1 group_var2 ............ ------------------ ------------ ---------------- --------------- ab~ac~ae~af 00000001 ...................

Code:

groupname group_ID group_var1 group_var2 ............ ------------------ ------------ ---------------- --------------- ab 00000001 ................... ac 00000001 ................... ae 00000001 ................... af 00000001 ...................

Then this transformed using file could be merged with the master file by groupname.
1 like
Comment
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#7

01 Jul 2015, 10:18

Originally posted by Jimmy Yang View Post

How I want my data to turn out:
pseudo-code:
merge m:1 using "database",if regexm(pseudonyms, groupname)

Instead of pseudo code, could you please show what the merged data is supposed to look like? Please use CODE tags (see FAQ, section 12).
Comment
Jimmy Yang

Join Date: May 2015

Posts: 54
#8

01 Jul 2015, 20:56

William Lisowski's understanding is right.
The objective is to do the following code efficiently

Code:

use master, clear gen id=1 save, replace use using, clear gen id=1 save, replace use master, clear joinby id using using.dta keep if strpos(pseudonyms,groupname)>0

Last edited by Jimmy Yang; 01 Jul 2015, 21:39.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#9

01 Jul 2015, 22:25

Originally posted by Jimmy Yang View Post

Unfortunately, I just find out that "regexm" is somewhat buggy in Stata. Instead, use strpos() to match string which contains another string.

Perhaps you can show an example of how regexm is buggy?

On the task that you are asking, what are the dimensions of your problem?
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#10

03 Jul 2015, 02:01

I think William Lisowski has provided the best solution. My only two cents to it is that -- depending on the scale of the problem -- it might be better to have two tables out of the first pseudonym table. The first one just with the bridge between groupname and group_ID and the second one pretty much the original without the pseudonyms column/variable.

Just a thought
1 like
Comment
Jimmy Yang

Join Date: May 2015

Posts: 54
#11

03 Jul 2015, 06:05

I will accept William Lisowski's answer if this is like stackoverflow.
Julio Raffo is right. I have tailored to the extend that it only deal with two column, groupname and group_ID.
About regexm, I just finded out Extended ASCII will break "regexm" sometimes.
However, it has been fixed in Stata 14 using UTF-8. I tested program that day using Stata 13.1.
It's somewhat painful for me to migrate to Stata 14. My code is more than thousands of lines.
I would urge Stata Corp to adopt Perl's standard on regular expression in Stata 15.
Stata is not just a statistical program, it can be used more like Perl or Python now.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#12

03 Jul 2015, 06:19

We don't have a formal "accept answer" apparatus here as on Stack Exchange sites (for those who don't know what on earth this means, it means a big tick next to your answer (!) and 15 points added to your reputation (!!); yes, we're back in primary school).

But FAQ Advice does make clear that indicated what solved your problem, most nearly, is often helpful as well as courteous.
Comment

Announcement

merge on string using "regexm"

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment